Bias-Variance Tradeoff

Reducing bias (underfitting) tends to increase variance (sensitivity to the specific training set), and vice versa. Every modelling choice navigates this tension — more capacity reduces bias but increases variance; more regularisation reduces variance but increases bias. The foundational tradeoff in all of statistical learning.

Intuition

Imagine fitting a curve to 10 noisy data points. A straight line (high bias) will miss the true pattern but give similar answers no matter which 10 points you happened to sample. A degree-9 polynomial (high variance) will hit every point perfectly but look completely different with a different sample of 10 points. Neither extreme generalises well — the line can’t capture the pattern, and the polynomial captures the noise.

The expected error on new data decomposes cleanly: error = bias² + variance + irreducible noise. You can’t reduce both bias and variance simultaneously without more data or a better inductive bias. This is why simply making a model bigger doesn’t always help — you reduce bias but increase variance, and at some point the variance increase dominates.

In deep learning, the classical picture is complicated by the “double descent” phenomenon: very large overparameterised models can have low bias AND low variance, seemingly violating the tradeoff. The resolution is that implicit regularisation from SGD, architecture, and early stopping acts as a strong inductive bias that keeps variance in check even as capacity grows. The tradeoff still holds — it’s just that modern deep learning operates in a regime where implicit regularisation does most of the work.

Math

For a prediction $\hat{f}(x)$ trained on dataset $D$ , the expected squared error at a point $x$ decomposes as:

$\mathbb{E}_D\left[(y - \hat{f}(x))^2\right] = \underbrace{\left(\mathbb{E}_D[\hat{f}(x)] - f(x)\right)^2}_{\text{bias}^2} + \underbrace{\mathbb{E}_D\left[(\hat{f}(x) - \mathbb{E}_D[\hat{f}(x)])^2\right]}_{\text{variance}} + \underbrace{\sigma^2_\epsilon}_{\text{noise}}$

where $f(x)$ is the true function and $\sigma^2_\epsilon$ is the irreducible noise.

Bias: how far the average prediction (across all possible training sets) is from the truth
Variance: how much the prediction varies across different training sets
Noise: inherent randomness in the data that no model can capture

Manifestation

High bias (underfitting): training loss AND validation loss are both high; the model can’t fit even the training data well
High variance (overfitting): training loss is low but validation loss is much higher; large gap between train and val curves
Unstable predictions: retrain the same architecture on different random seeds or data splits and get wildly different models (high variance)
In RL (GAE λ): low λ gives low-variance but biased advantage estimates (relies on learned value function); high λ gives unbiased but high-variance estimates (relies on actual returns)
In ensembles: if individual models disagree strongly, variance is high; if they all make the same wrong prediction, bias is high

Where It Appears

Policy gradient (policy-gradient/): GAE λ is an explicit bias-variance dial — λ=0 is pure bootstrapping (low variance, high bias from value function errors); λ=1 is Monte Carlo (unbiased, high variance)
Q-learning (q-learning/): n-step returns trade off the same way — 1-step has low variance but high bias from bootstrapping; full-episode returns are unbiased but high variance
NN training (nn-training/): regularisation strength (weight decay, dropout rate) directly controls the tradeoff — too much regularisation → underfitting (bias), too little → overfitting (variance)
Diffusion (diffusion/): the number of denoising steps at inference trades compute for variance — fewer steps is faster but noisier
Contrastive learning (contrastive-self-supervising/): batch size controls the quality of the negative sample estimate — small batches give high-variance gradient estimates

Solutions at a Glance

Solution	Mechanism	Where documented
GAE (λ parameter)	Exponentially-weighted mix of n-step returns — λ interpolates between bias and variance extremes	`atomic-concepts/rl-specific/generalised-advantage-estimation.md`
Dropout	Randomly drops units during training, reducing co-adaptation (variance) at cost of slight underfitting (bias)	`atomic-concepts/regularisation/dropout.md`
Weight decay	Shrinks weights toward zero, reducing model effective capacity	`atomic-concepts/regularisation/weight-decay.md`
Ensemble methods	Average multiple models to reduce variance without increasing bias	(standard ML practice)
Early stopping	Stop training when validation loss starts increasing — implicitly limits effective capacity	`nn-training/`
Importance sampling correction	Corrects for off-policy bias in RL at the cost of increased variance	`atomic-concepts/mathematical-tricks/importance-sampling.md`

Historical Context

The bias-variance decomposition comes from statistical decision theory (Geman, Bieman & Doursat, 1992, though the ideas trace to earlier work by Stein and James in the 1960s). It was the dominant framework for understanding generalisation in machine learning through the 1990s and 2000s. The arrival of deep learning initially seemed to violate it — massively overparameterised networks generalised well despite having enough capacity to memorise the training set. Belkin et al. (2019) reconciled this with the “double descent” curve, showing that the classical U-shaped test error curve (bias-variance tradeoff) appears in the underparameterised regime, but performance improves again in the heavily overparameterised regime. The tradeoff remains the correct mental model for most practical decisions (regularisation strength, model size, data augmentation), even if the extreme overparameterised regime adds nuance.