Intuition
- RoPE applies rotation of different frequency on 2D subspaces
- This treats the standard basis vectors as the bases
- The generalization rotating on 2D spaces spanned by non-standard basis vectors is described in Rotate from one vector to another vector in high dimensional space
- ” the intuition behind RoPE is that we can represent the token embeddings as complex numbers and their positions as pure rotations that we apply to them. If we shift both the query and key by the same amount, changing absolute position but not relative position, this will lead both representations to be additionally rotated in the same manner---as we will see in the derivation---thus the angle between them will remain unchanged and thus the dot product will also remain unchanged.” 1
- Instead of working in the usual , we will work in by considering consecutive pairs of elements of the query and key vectors to be a single complex number. Specifically, instead of viewing as a d-dimensional real vector we view it as .
Notes
From Training and Fine-tuning LLMs - W&B course
- actually learn positional embeddings
- a popular choice
- tend to slow things down a little bit
- now there are lots of interesting ways of interpolating to get extrapolation with RoPE
- used by LLaMa, Adept.ai (for the 8B model)