- Features are organized as directions, which can also be seen as points in a hypersphere
- The vectors in the residual stream are combinations of features
→ The model learn how to organize information on the hyper sphere The directions generated by the attention and MLP layers guide the exploration on the hypersphere surface
- The attention layers generate directions from inter-tokens interactions
- The MLP layers store some (popular ?) directions (likely representing combinations of features instead of single features)
The model doesn’t need to “remember” where an information is on the hypersphere, it just need to learn how to organize the hypersphere according to the data
- The initial directions are provided by the embeddings
- The model combines the initial directions using the attention layers to make poly-semantic features
- Some of these features are stored in the MLP layers, possibly though that are frequently seen during training and have higher effects on the loss
- Similar samples would generate similar combinations of features and would be placed closed together in clusters on the hypersphere
- These clusters would then parameterized a distribution over the output tokens idea
- Might be related to VAEs ?