questionllm do they use the biggest version in a teacher-student traing scheme ? distil from the biggest version ? modularly stacked ? trained separately ?