This is the master note for Mechanistic Interpretability

what

reverse engineering the computational mechanism and representation learnt by neural networks to human-understandable algorithms and concepts¹
a paradigm shift in interpretability: surface-level analysis (input/output relations) → inner interpretability (internal mechanisms) ¹
- similar to the shift from behaviourism to cognitive neuroscience in psychology
- Mech Interp is an approach toward inner interpretability
  
  Link to original

some terminologies

Features:
- independent yet repeatable units that a neural network representation can decompose into ²
- fundamental units of neural network representations ¹
- are different from neurons
Neurons ¹
- are computational units
- potentially representing individual features
- forming privileged bases
Circuit
- sub-graphs of the network, consisting of features and the weights connecting them
monosemantic and polysemantic neurons
- monosemantic: neurons corresponding to a single semantic concept
- polysemantic: neurons associated with multiple, unrelated concepts
- if neurons were the fundamental primitives of model representations
  - all neurons would be monosemantic
  - implying a 1-to-1 relation between neurons and features
  - however, empirical studies observed that neurons are polysemantic

Privileged bases
- the standard bases of the representation space
  
  Link to original

hypotheses

Linear representation hypothesis

Features / concepts are represented by orthogonal directions in activation space
- but need not be aligned with the privileged bases
The network represents features as linear combinations of neurons

Link to original

Superposition Hypothesis

The representation may encode features not with the $n$ basis directions (neurons) but with $\propto e x p (n)$ possible almost orthogonal directions.

Neural networks represent more features than they have neurons by encoding features in ovelapping combination of neurons ¹

Link to original

Non-orthogonality means that features interfere with one another.
Sparsity means that a feature rarely occurs. The assumption is most features are sparse.

Toy model of superposition

Link to original

A Hierarchy of Feature Properties³

4 progressively more strict properties that neural network representations might have:

Decomposability: Neural network activations which are decomposable can be decomposed into features, the meaning of which is not dependent on the value of other features.
Linearity: Features correspond to directions. Each feature $f_{i}$ has a corresponding representation direction $W_{i}$ . The presence of multiple features $f_{1}, f_{2} \dots$ activating with values $x_{f_{1}}, x_{f_{2}} \dots$ is represented by $x_{f_{1}} W_{f_{1}} + x f_{2} W f_{2} \dots$
Superposition vs Non-Superposition: A linear representation exhibits superposition if $W W^{T}$ is not invertible. If $W W^{T}$ is invertible, it does not exhibit superposition.
Basis-Aligned: A representation is basis aligned if all $W_{i}$ are one-hot basis vectors. A representation is partially basis aligned if all $W_{i}$ are sparse. This requires a privileged basis.

The first two are hypothesized to be widespread, while the latter are believed to be occured only sometimes.

Universality hypothesis

Analogous features and circuits form across model and tasks.

A mechanistic view on LLM

Virtual Weights and the Residual Stream as a Communication Channel ⁴

View the residual stream as the main object that accumulate information

MLP and Attention are branches that write to the stream

Observations about Attention ⁴

Applying attention can be described as

h (x) = Project result vectors out for each token (h (x)_{i} = W_{O} r_{i}) (I \otimes W_{O}) \cdot Mix value vectors across tokens to compute result vectors (r_{i} = \sum_{j} A_{i, j} v_{j}) (A \otimes I) \cdot Compute value vector for each token (v_{i} = W_{V} x_{i}) (I \otimes W_{V}) \cdot x = A mixes across tokens while W_{O} W_{V} acts on each vector independently (A \otimes W_{O} W_{V}) \cdot x

And the attention pattern is

k_{i} q_{i} A = W_{K} x_{i} = W_{Q} x_{i} = so f t ma x (q^{T} k) = so f t ma x (x^{T} W_{Q}^{T} W_{K} x)

$W_{Q}$ and $W_{K}$ always operate together. They’re never independent. Similarly, $W_{O}$ and $W_{V}$ always operate together as well.

An attention head is really applying two linear operations, $A$ and $W_{O} W_{V}$ , which operate on different dimensions and act independently.
- $A$ governs which token’s information is moved from and to.
- $W_{O} W_{V}$ governs which information is read from the source token and how it is written to the destination token.
Products of attention heads behave much like attention heads themselves. By the distributive property

(A^{h_{2}} \otimes W_{O V}^{h_{2}} \cdot A^{h_{1}} \otimes W_{O V}^{h_{1}} = (A^{h_{2}} A^{h_{1}}) \otimes (W_{O V}^{h_{2}} W_{O V}^{h_{1}}))

These are called virtual attention heads

MLP layers are KV memories ⁵

Link to original

MLP layers are Unnormalized Key-Value Memories
- MLP layers ⁶ : $FF (x) = f (x \cdot K^{T}) \cdot V$
- Neural Memory ⁷ : $p (k_{i} ∣ x) MN (x) \propto e x p (x \cdot k_{i}) = \sum d_{m} i = 1 p (k_{i} ∣ x) v_{i} = so f t ma x (x \cdot K^{T}) \cdot V$
MLP layers are almost identical to KV neural memories. The only different is
- neural memory uses $so f t ma x (\cdot)$ as the non-linearity $f (\cdot)$
- while MLPs in transformer doesn’t use a normalizing function

Intra-layer and inter-layer memory composition
- “the layer-level prediction is typically not the result of a single dominant memory cell, but a composition of multiple memories.”
  Link to original
- “the model uses the sequential composition apparatus as a means to refine its prediction from layer to layer, often deciding what the prediction will be at one of the lower layers.”
  Link to original
  
  Link to original

Lone's notes

Recently Updated

Diffusion language modeling with maximum semantic likelihood

distill from AR LM to diffusion LM

LLM generation is path finding in activation space, each decoder block's processing is taking a step in said space

AI resources

Controlling reasoning duration with activation steering

All notes

Mechanistic Interpretability

what

some terminologies

hypotheses

Linear representation hypothesis

Superposition Hypothesis

Toy model of superposition

A Hierarchy of Feature Properties³

Universality hypothesis

A mechanistic view on LLM

Virtual Weights and the Residual Stream as a Communication Channel ⁴

Observations about Attention ⁴

MLP layers are KV memories ⁵

my observations, hypotheses and findings

literature

papers

articles

references

Graph View

Table of Contents

Lone's notes

Recently Updated

Diffusion language modeling with maximum semantic likelihood

distill from AR LM to diffusion LM

LLM generation is path finding in activation space, each decoder block's processing is taking a step in said space

AI resources

Controlling reasoning duration with activation steering

All notes

Mechanistic Interpretability

what

some terminologies

hypotheses

Linear representation hypothesis

Superposition Hypothesis

Toy model of superposition

A Hierarchy of Feature Properties3

Universality hypothesis

A mechanistic view on LLM

Virtual Weights and the Residual Stream as a Communication Channel 4

Observations about Attention 4

MLP layers are KV memories 5

my observations, hypotheses and findings

literature

papers

articles

references

Footnotes

Graph View

Table of Contents

A Hierarchy of Feature Properties³

Virtual Weights and the Residual Stream as a Communication Channel ⁴

Observations about Attention ⁴

MLP layers are KV memories ⁵