implications
-
This would be highly impactful as we can leverage existing powerful LLM and not having to start from scratch
-
If it works then it means that we can unify the activation space of AR LM and Diffusion LM
- This is supported by Harnessing the Universal Geometry of Embeddings
- We can then apply all the analysis and steering techniques from AR LLM to apply on Diffusion LM
-
It will also unify the 2 views:
-
even if we can just distill from a 70B AR LM to a 7B diffusion LM, as long as it performs as good as a 7B LM but faster then it’s a win