Intuition

R_{Θ, m}^{d} = cos (m θ_{1}) s in (m θ_{1}) 00 ⋮ 00 - s in (m θ_{1}) cos (m θ_{1}) 00 ⋮ 00 00 cos (m θ_{2}) s in (m θ_{2}) ⋮ 00 00 - s in (m θ_{2}) cos (m θ_{2}) ⋮ 00 \dots \dots \dots \dots ⋱ \dots \dots 0000 ⋮ cos (m θ_{d /2}) s in (m θ_{d /2}) 0000 ⋮ - s in (m θ_{d /2}) cos (m θ_{d /2})

RoPE applies rotation of different frequency on $d /2$ 2D subspaces
- This treats the standard basis vectors as the bases
- The generalization rotating on 2D spaces spanned by non-standard basis vectors is described in Rotate from one vector to another vector in high dimensional space
” the intuition behind RoPE is that we can represent the token embeddings as complex numbers and their positions as pure rotations that we apply to them. If we shift both the query and key by the same amount, changing absolute position but not relative position, this will lead both representations to be additionally rotated in the same manner---as we will see in the derivation---thus the angle between them will remain unchanged and thus the dot product will also remain unchanged.” ¹
- Instead of working in the usual $R^{d}$ , we will work in $C^{d /2}$ by considering consecutive pairs of elements of the query and key vectors to be a single complex number. Specifically, instead of viewing $q = (q_{1}, q_{2}, q_{3}, q_{4}, \dots, q_{d})$ as a d-dimensional real vector we view it as $q = (q_{1} + i q_{2}, q_{3} + i q_{4}, \dots q_{d - 1} + i q_{d}) \in C^{d /2}$ .

Notes

actually learn positional embeddings
a popular choice
tend to slow things down a little bit
now there are lots of interesting ways of interpolating to get extrapolation with RoPE
used by LLaMa, Adept.ai (for the 8B model)