questionllm

  • do they use the biggest version in a teacher-student traing scheme ?
  • distil from the biggest version ?
  • modularly stacked ?
  • trained separately ?