Overfitting

The model memorises the training data instead of learning generalisable patterns. Training loss continues to decrease while validation loss increases — the model is getting better at the training set and worse at everything else. The original machine learning problem.

Intuition

Imagine a student who memorises every practice exam word-for-word instead of understanding the underlying concepts. They score 100% on practice exams but fail the real one because the questions are slightly different. The student’s “learning” is brittle — tied to specific examples rather than transferable principles.

Neural networks can memorise training data because they have enormous capacity. Zhang et al. (2017) famously showed that standard architectures can fit random labels with 100% training accuracy — the model doesn’t need to find patterns because it has enough parameters to store each example individually. This means that low training loss tells you nothing about whether the model has learned anything useful. Only validation loss on held-out data reveals generalisation.

The surprising twist in modern deep learning is that overparameterised models often generalise well despite having the capacity to memorise. This is because SGD, architecture choices, and implicit regularisation act as inductive biases that steer the model toward simple, generalisable solutions. But this only works up to a point — on small datasets, limited domains, or without regularisation, overfitting remains the primary failure mode.

Manifestation

Growing train-val gap: training loss decreases while validation loss increases or plateaus — the most classic signal
Training accuracy reaches near 100% on complex tasks where that shouldn’t be possible
Model performs poorly on slightly different data — good on the test set but fails on real-world inputs that differ from the training distribution
Model is overconfident — predicts extreme probabilities (0.99, 0.01) rather than calibrated ones
Performance improves with more data dramatically — if doubling your dataset gives a large improvement, the model was overfitting to the smaller set

Where It Appears

NN training (nn-training/): the primary concern for supervised learning — weight decay, dropout, and early stopping are standard mitigations
Q-learning (q-learning/): Q-networks can overfit to frequently-visited states while performing poorly on rare but important states — larger replay buffers and data diversity help
Contrastive learning (contrastive-self-supervising/): representation overfitting — the encoder learns features specific to the augmentation strategy rather than semantically meaningful features
GANs (gans/): the discriminator can overfit to the current generator distribution, providing uninformative gradients — this is partly why discriminator regularisation (spectral norm, gradient penalty) matters

Solutions at a Glance

Solution	Mechanism	Where documented
Weight decay	Penalises large weights, reducing effective model capacity	`atomic-concepts/regularisation/weight-decay.md`
Dropout	Randomly zeros activations during training, preventing co-adaptation	`atomic-concepts/regularisation/dropout.md`
Label smoothing	Softens one-hot targets, preventing the model from becoming overconfident	`atomic-concepts/regularisation/label-smoothing.md`
Early stopping	Stop training when validation loss starts increasing	`nn-training/`
Data augmentation	Increases the effective training set size by applying random transformations	(standard practice)
More data	The most reliable cure — overfitting is fundamentally a data scarcity problem	(standard practice)

Historical Context

Overfitting has been understood since the early days of statistics — it’s closely related to the bias-variance tradeoff formalised in the 19th century. In machine learning, the term became central through the 1990s and 2000s when model selection and regularisation were the primary research concerns. The deep learning revolution initially seemed to sidestep the problem — dropout (Srivastava et al., 2014), batch normalisation, and massive datasets reduced overfitting enough that practitioners could focus on architecture design. Zhang et al. (2017) sharpened the question by showing that networks can memorise but usually don’t, prompting a rich line of work on implicit regularisation and why SGD favours generalisable solutions.