llm mechanistic-interpretability hypothesis

  • MLP layers store conceptual features (people, things, etc.)
  • Attention layers compute behavioural features (refusal, truthful, emotions, etc.) or association features () on the fly
    • so perhaps the QKVO matrices learn to associate conceptual features from the tokens to produce behaviourial features