idea
noising process
- remove least semantically significant tokens first
- the most semantically significant tokens (in sequence) are the ones that most effectively summarize the meaning of the input. E.g.:
a quick brown fox jumps over the lazy dog
_ quick brown fox jumps over ___ lazy dog
_ _____ brown fox jumps over ___ ____ dog
_ _____ _____ fox jumps ____ ___ ____ dog
- The selection can be based on question
- a statistical model
- an objective that minimize the divergence/distance of the sentence embedding between 2 consecutive steps. Some ideas are:
- could use the distance between the 2 embeddings experiment
- learn the metric on the go idea
- we can apply this “semantic distance” metric with RL for preference tuning idea
- KL-divergence/OT distance between the prob dist of newly generated tokens idea
- an encoder model to compute and and use them to compute a metric
- If the encoder is an AR LLM, this can be framed as a method to
distill from AR LM to diffusion LM
implications
-
This would be highly impactful as we can leverage existing powerful LLM and not having to start from scratch
-
If it works then it means that we can unify the activation space of AR LM and Diffusion LM
- This is supported by Harnessing the Universal Geometry of Embeddings
- We can then apply all the analysis and steering techniques from AR LLM to apply on Diffusion LM
-
It will also unify the 2 views:
-
even if we can just distill from a 70B AR LM to a 7B diffusion LM, as long as it performs as good as a 7B LM but faster then it’s a win
-
- however, training would be heavy as we need to serve 2 models and need to wait for the processing of the scorer at each step.
- But hopefully there will be some engineering tricks to improve this, and also the computation saved from distillation would make up for the heavy training
- If the encoder is an AR LLM, this can be framed as a method to
denoising process
- each step generate new tokens that doesn’t change the some metric (e.g. semantic) too much
- restrict on generating only at positions next to existing tokens. How ? question
- Filter out only positions that next to a token ? experiment
- Make the generate-able positions dynamic idea e.g. the current sequence:
a quick brown fox
→ generate-able slots:_ a _ _ quick _ _ brown _ _ fox _
(2 slots in between every 2 consecutive tokens) → the sequence length is not fixed. How to implement this ? question
- Always add generate-able positions at the beginning and ending of the sequence to make generation expandable on both ends idea
- Again, the sequence length is not fixed. → Needs some smart engineering for the case of dynamic sequence length question
- How to measure the change of semantic ? question
- Use the same metric from as in the noising process ?
- This can be used to control the “creativeness” of generation idea
- restrict on generating only at positions next to existing tokens. How ? question
- How to stop generation when the sequence length is not fixed ? question some ideas:
- Use a special token like in AR models ? → doesn’t really make sense for this case as generation can be extended in both directions
- Set a threshold for predicted probs: If the logits of new tokens are all lower than a threshold then stop generation. idea
- Set a threshold for the “semantic” of the sequence: If is higher than a certain value then stop generation. idea