Changing MLP layers in Transformers to Probabilistic encoders

idea

In Transformer Feed-Forward Layers Are Key-Value Memories, the interaction between the input $x$ and each column $k_{i}$ of the first parameter matrix is viewed as a conditional distribution of $k_{i}$ given $x$ :

Link to original

What if we employ the idea of VAE and make $x$ parameterizing $p (k_{i} ∣ x)$ as Gaussian, something like:

μ_{x} p (k_{i} ∣ x) = M (x) \propto e^{\frac{- 1}{2} (k_{i} - μ_{x})^{2}}

which is a Gaussian with mean give by $x$ and a fixed std.

The intuition is, given an input $x$ that produces $μ_{x}$ , if $k_{i}$ is distributed near $μ_{x}$ then it is more likely to be related to $x$ . This will force the model to learn $k_{i}$ such that it’s distributed near $μ_{x}$ .

But perhaps the MLP already has this property ? As $x \cdot k_{i}$ is bigger when $x$ and $k_{i}$ are more co-linear, the “almost co-linear” region surrounding $x$ can be seen as a distribution parameterized (partly or entirely) by $x$ .

Lone's notes

Recently Updated

Diffusion language modeling with maximum semantic likelihood

distill from AR LM to diffusion LM

LLM generation is path finding in activation space, each decoder block's processing is taking a step in said space

AI resources

Controlling reasoning duration with activation steering

All notes

Changing MLP layers in Transformers to Probabilistic encoders

Graph View