llm research idea hypothesis

  • Most computation in LLMs are in self attention and MLP block.
  • But inputs to these blocks have fixed length
  • That means the vector length is a degree of freedom that is not used to for representation capability. So what is it used for ?

Hypothesis: It’s a filter mechanism. Suppose that an activation is a weighted sum of related features: