- Most computation in LLMs are in self attention and MLP block.
- But inputs to these blocks have fixed length
- That means the vector length is a degree of freedom that is not used to for representation capability. So what is it used for ?
Hypothesis: It’s a filter mechanism. Suppose that an activation is a weighted sum of related features: