yet another hypothesis on the mechanism of LLMs

Features are organized as directions, which can also be seen as points in a hypersphere
The vectors in the residual stream are combinations of features

→ The model learn how to organize information on the hyper sphere The directions generated by the attention and MLP layers guide the exploration on the hypersphere surface

The attention layers generate directions from inter-tokens interactions
The MLP layers store some (popular ?) directions (likely representing combinations of features instead of single features)

The model doesn’t need to “remember” where an information is on the hypersphere, it just need to learn how to organize the hypersphere according to the data

The initial directions are provided by the embeddings
The model combines the initial directions using the attention layers to make poly-semantic features
Some of these features are stored in the MLP layers, possibly though that are frequently seen during training and have higher effects on the loss
Similar samples would generate similar combinations of features and would be placed closed together in clusters on the hypersphere
These clusters would then parameterized a distribution over the output tokens idea
- Might be related to VAEs ?

Lone's notes

Recently Updated

Adaptive KV cache pruning with selective features

eigen values and Page Rank

The effects of neural net layers on activation space

which features represented by Attention layers vs by MLP layers

Mechanistic Interpretability

All notes

yet another hypothesis on the mechanism of LLMs

Graph View