Skip to content

Positional Encoding

Injecting position information into transformer inputs since self-attention is permutation-invariant — without positional encoding, “the cat sat on the mat” and “mat the on sat cat the” produce identical outputs. The three main approaches are sinusoidal (original transformer), learned (BERT, GPT-2), and rotary/RoPE (the modern standard, used in LLaMA, Mistral, Gemma).

Self-attention computes dot products between all pairs of tokens. Dot products don’t care about order — swapping two tokens in the input just swaps the corresponding rows in the output. But language (and most sequences) is fundamentally ordered: “dog bites man” and “man bites dog” mean different things. Positional encoding solves this by adding position-dependent information to each token before attention sees it.

Sinusoidal encoding uses fixed sine/cosine waves at different frequencies — think of it like a clock with many hands spinning at different speeds. Position 0 has one pattern of hand positions, position 1 has a slightly different pattern, and so on. The key property is that the encoding of any position can be expressed as a linear transformation of any other position’s encoding, which in theory lets attention learn relative position patterns.

RoPE (Rotary Position Embeddings) takes a fundamentally different approach: instead of adding position information to the token embeddings, it rotates the query and key vectors by an angle proportional to their position. When you take the dot product of a rotated query with a rotated key, the result depends only on the relative position between them — not on the absolute positions. This is elegant because relative position is what usually matters (“the word two positions back”) and it naturally extends to longer sequences than seen during training.

Sinusoidal (Vaswani et al., 2017):

PE(pos,2i)=sin ⁣(pos100002i/d),PE(pos,2i+1)=cos ⁣(pos100002i/d)PE_{(pos, 2i)} = \sin\!\left(\frac{pos}{10000^{2i/d}}\right), \quad PE_{(pos, 2i+1)} = \cos\!\left(\frac{pos}{10000^{2i/d}}\right)

where pospos is the position index and ii is the dimension index. Each dimension oscillates at a different frequency from 2π2\pi to 100002π10000 \cdot 2\pi.

Learned: Simply a trainable matrix ERTmax×dE \in \mathbb{R}^{T_{\max} \times d} where row tt is looked up and added to the token embedding at position tt.

RoPE (Su et al., 2021):

For each pair of dimensions (2i,2i+1)(2i, 2i+1), rotate by angle θipos\theta_i \cdot pos:

(q2iq2i+1)=(cos(mθi)sin(mθi)sin(mθi)cos(mθi))(q2iq2i+1)\begin{pmatrix} q'_{2i} \\ q'_{2i+1} \end{pmatrix} = \begin{pmatrix} \cos(m\theta_i) & -\sin(m\theta_i) \\ \sin(m\theta_i) & \cos(m\theta_i) \end{pmatrix} \begin{pmatrix} q_{2i} \\ q_{2i+1} \end{pmatrix}

where mm is the position and θi=100002i/d\theta_i = 10000^{-2i/d}. The key property: RoPE(q,m),RoPE(k,n)\langle \text{RoPE}(q, m), \text{RoPE}(k, n) \rangle depends only on qq, kk, and mnm - n (relative position).

import torch
import torch.nn as nn
import math
# ── Sinusoidal (fixed, no parameters) ───────────────────────────
def sinusoidal_encoding(T: int, d: int) -> torch.Tensor:
pos = torch.arange(T).unsqueeze(1).float() # (T, 1)
dim = torch.arange(0, d, 2).float() # (d/2,)
freq = 1.0 / (10000 ** (dim / d)) # (d/2,)
pe = torch.zeros(T, d) # (T, d)
pe[:, 0::2] = torch.sin(pos * freq) # even dims
pe[:, 1::2] = torch.cos(pos * freq) # odd dims
return pe # Add to token embeddings: x = x + pe[:T]
# ── Learned (trainable) ─────────────────────────────────────────
pos_embed = nn.Embedding(max_seq_len, d_model)
# Usage: x = token_embed(ids) + pos_embed(positions)
# where positions = torch.arange(T) # (T,)
# LIMITATION: cannot handle positions > max_seq_len at inference.
# ── RoPE (applied to Q and K, NOT V) ────────────────────────────
def apply_rope(x, freqs_cis):
"""x: (B, T, n_heads, d_head), freqs_cis: (T, d_head/2) complex"""
x_complex = torch.view_as_complex(x.float().reshape(*x.shape[:-1], -1, 2))
x_rotated = x_complex * freqs_cis # broadcast rotation
return torch.view_as_real(x_rotated).reshape(x.shape).type_as(x)
def precompute_rope_freqs(d_head: int, max_T: int, theta: float = 10000.0):
freqs = 1.0 / (theta ** (torch.arange(0, d_head, 2).float() / d_head))
t = torch.arange(max_T)
angles = torch.outer(t, freqs) # (T, d_head/2)
return torch.polar(torch.ones_like(angles), angles) # (T, d_head/2) complex
# Apply to Q and K only — V has no position information.
# NEVER apply RoPE to V. Position is encoded via Q-K interaction.
import numpy as np
def sinusoidal_encoding_np(T, d):
"""Equivalent to the sinusoidal encoding above. Returns (T, d)."""
pos = np.arange(T)[:, None] # (T, 1)
dim = np.arange(0, d, 2)[None, :] # (1, d/2)
freq = 1.0 / (10000 ** (dim / d)) # (1, d/2)
pe = np.zeros((T, d))
pe[:, 0::2] = np.sin(pos * freq) # (T, d/2)
pe[:, 1::2] = np.cos(pos * freq) # (T, d/2)
return pe
def apply_rope_np(q, k, theta=10000.0):
"""
Apply RoPE to query and key vectors.
q, k: (B, T, d) where d must be even.
Returns: rotated q, k of same shape.
"""
B, T, d = q.shape
positions = np.arange(T)[:, None] # (T, 1)
dims = np.arange(0, d, 2)[None, :] # (1, d/2)
angles = positions / (theta ** (dims / d)) # (T, d/2)
cos_a = np.cos(angles) # (T, d/2)
sin_a = np.sin(angles) # (T, d/2)
def rotate(x):
x1 = x[:, :, 0::2] # (B, T, d/2) even dims
x2 = x[:, :, 1::2] # (B, T, d/2) odd dims
out = np.empty_like(x)
out[:, :, 0::2] = x1 * cos_a - x2 * sin_a
out[:, :, 1::2] = x1 * sin_a + x2 * cos_a
return out
return rotate(q), rotate(k)
  • Original transformer (Vaswani et al., 2017): sinusoidal encoding added to input embeddings (see transformer/)
  • BERT, GPT-2: learned positional embeddings up to 512 / 1024 tokens
  • LLaMA, Mistral, Gemma, Qwen (modern LLMs): RoPE applied to Q and K in every attention layer. The modern standard for autoregressive models
  • Vision Transformers (ViT): learned 2D positional embeddings for image patches
  • Long-context models (LLaMA 3 128K, Gemini 1M): RoPE with adjusted base frequency (NTK-aware scaling or YaRN) for length extrapolation
AlternativeWhen to useTradeoff
Sinusoidal (fixed)Simple baselines, no extra parametersNo learning; works well but slightly worse than learned/RoPE in practice
Learned absoluteFixed-length tasks (BERT-style)Simple; cannot extrapolate beyond training length
RoPE (rotary)Autoregressive LLMs (the modern default)Elegant relative position; requires careful frequency scaling for long contexts
ALiBi (Press et al., 2022)Length extrapolation without fine-tuningAdds linear bias to attention scores; simpler than RoPE but less expressive
Relative position bias (T5, Transformer-XL)When you want explicit learned relative offsetsAdds a learned bias matrix indexed by relative position; more parameters
No positional encodingWhen position doesn’t matter (set-based inputs)Works for tasks like point cloud processing where input order is irrelevant

Positional encoding was introduced as part of the original transformer (Vaswani et al., 2017). The sinusoidal formulation was chosen because the authors theorised it would allow the model to learn relative positions through linear projections, and because it required no additional parameters. Learned embeddings (BERT, 2018; GPT-2, 2019) quickly became popular as they matched or exceeded sinusoidal performance on fixed-length tasks.

The major innovation was RoPE (Su et al., 2021, “RoFormer”), which encoded position directly into the attention computation through rotation rather than as an additive input. RoPE’s adoption by LLaMA (2023) made it the de facto standard for modern LLMs. The challenge of extending RoPE to longer contexts than seen during training has spawned several techniques: NTK-aware scaling (adjusting the base frequency), YaRN (combining frequency scaling with attention scaling), and dynamic NTK, all of which modify RoPE’s frequency schedule to handle longer sequences without fine-tuning.