Given that:
- The Bias-Variance tradeoff suggests that high bias comes with low variance and high variance comes with low bias.
- Overfitting is a sign of high variance low bias
Then if I have a large amount of computation, I can make many overfitting models then take the average of their predictions to get a low variance low bias ensemble model.
Perhaps this is why these methods work:
- Stacking and ensemble techniques
- Training large models with random sampled batches
- Different pathways are different overfitted learner on different subset of the data
- This might be considered as a learning paradigm which is different from another paradigm which is Mixture-of-Distribution learning