llm mechanistic-interpretability hypothesis
- MLP layers store conceptual features (people, things, etc.)
- Attention layers compute behavioural features (refusal, truthful, emotions, etc.) or association features () on the fly
- so perhaps the QKVO matrices learn to associate conceptual features from the tokens to produce behaviourial features