Skip to content

Overfitting

The model memorises the training data instead of learning generalisable patterns. Training loss continues to decrease while validation loss increases — the model is getting better at the training set and worse at everything else. The original machine learning problem.

Imagine a student who memorises every practice exam word-for-word instead of understanding the underlying concepts. They score 100% on practice exams but fail the real one because the questions are slightly different. The student’s “learning” is brittle — tied to specific examples rather than transferable principles.

Neural networks can memorise training data because they have enormous capacity. Zhang et al. (2017) famously showed that standard architectures can fit random labels with 100% training accuracy — the model doesn’t need to find patterns because it has enough parameters to store each example individually. This means that low training loss tells you nothing about whether the model has learned anything useful. Only validation loss on held-out data reveals generalisation.

The surprising twist in modern deep learning is that overparameterised models often generalise well despite having the capacity to memorise. This is because SGD, architecture choices, and implicit regularisation act as inductive biases that steer the model toward simple, generalisable solutions. But this only works up to a point — on small datasets, limited domains, or without regularisation, overfitting remains the primary failure mode.

  • Growing train-val gap: training loss decreases while validation loss increases or plateaus — the most classic signal
  • Training accuracy reaches near 100% on complex tasks where that shouldn’t be possible
  • Model performs poorly on slightly different data — good on the test set but fails on real-world inputs that differ from the training distribution
  • Model is overconfident — predicts extreme probabilities (0.99, 0.01) rather than calibrated ones
  • Performance improves with more data dramatically — if doubling your dataset gives a large improvement, the model was overfitting to the smaller set
  • NN training (nn-training/): the primary concern for supervised learning — weight decay, dropout, and early stopping are standard mitigations
  • Q-learning (q-learning/): Q-networks can overfit to frequently-visited states while performing poorly on rare but important states — larger replay buffers and data diversity help
  • Contrastive learning (contrastive-self-supervising/): representation overfitting — the encoder learns features specific to the augmentation strategy rather than semantically meaningful features
  • GANs (gans/): the discriminator can overfit to the current generator distribution, providing uninformative gradients — this is partly why discriminator regularisation (spectral norm, gradient penalty) matters
SolutionMechanismWhere documented
Weight decayPenalises large weights, reducing effective model capacityatomic-concepts/regularisation/weight-decay.md
DropoutRandomly zeros activations during training, preventing co-adaptationatomic-concepts/regularisation/dropout.md
Label smoothingSoftens one-hot targets, preventing the model from becoming overconfidentatomic-concepts/regularisation/label-smoothing.md
Early stoppingStop training when validation loss starts increasingnn-training/
Data augmentationIncreases the effective training set size by applying random transformations(standard practice)
More dataThe most reliable cure — overfitting is fundamentally a data scarcity problem(standard practice)

Overfitting has been understood since the early days of statistics — it’s closely related to the bias-variance tradeoff formalised in the 19th century. In machine learning, the term became central through the 1990s and 2000s when model selection and regularisation were the primary research concerns. The deep learning revolution initially seemed to sidestep the problem — dropout (Srivastava et al., 2014), batch normalisation, and massive datasets reduced overfitting enough that practitioners could focus on architecture design. Zhang et al. (2017) sharpened the question by showing that networks can memorise but usually don’t, prompting a rich line of work on implicit regularisation and why SGD favours generalisable solutions.