ideanlp

  • CNN models have auxiliary loss at different depths of the model
    • original motivation was to reduce vanishing gradient
    • these losses can be the same
    • which means the model can perform the same task at different depth, albeit with different performance level
  • Can we apply the same idea for language models ?question
    • different depths of the model can perform the same task