Intuition

  • RoPE applies rotation of different frequency on 2D subspaces
  • ” the intuition behind RoPE is that we can represent the token embeddings as complex numbers and their positions as pure rotations that we apply to them. If we shift both the query and key by the same amount, changing absolute position but not relative position, this will lead both representations to be additionally rotated in the same manner---as we will see in the derivation---thus the angle between them will remain unchanged and thus the dot product will also remain unchanged.” 1
    •  Instead of working in the usual , we will work in  by considering consecutive pairs of elements of the query and key vectors to be a single complex number. Specifically, instead of viewing  as a d-dimensional real vector we view it as .

Notes

From Training and Fine-tuning LLMs - W&B course

  • actually learn positional embeddings
  • a popular choice
  • tend to slow things down a little bit
  • now there are lots of interesting ways of interpolating to get extrapolation with RoPE
  • used by LLaMa, Adept.ai (for the 8B model)

Footnotes

  1. Rotary Embeddings: A Relative Revolution | EleutherAI Blog