how does smaller versions of the same llm (3b, 7b, 13b, etc.) are trained

Lone's notes

Recently Updated

Diffusion language modeling with maximum semantic likelihood
Jul 09, 2025
distill from AR LM to diffusion LM
Jul 09, 2025
LLM generation is path finding in activation space, each decoder block's processing is taking a step in said space
Jul 09, 2025
AI resources
Jul 09, 2025
Controlling reasoning duration with activation steering
Jul 09, 2025