idea

Given that:

  • The Bias-Variance tradeoff suggests that high bias comes with low variance and high variance comes with low bias.
  • Overfitting is a sign of high variance low bias

Then if I have a large amount of computation, I can make many overfitting models then take the average of their predictions to get a low variance low bias ensemble model.

Perhaps this is why these methods work:

  • Stacking and ensemble techniques
  • Training large models with random sampled batches
    • Different pathways are different overfitted learner on different subset of the data
    • This might be considered as a learning paradigm which is different from another paradigm which is Mixture-of-Distribution learning