Skip to content

Bias-Variance Tradeoff

Reducing bias (underfitting) tends to increase variance (sensitivity to the specific training set), and vice versa. Every modelling choice navigates this tension — more capacity reduces bias but increases variance; more regularisation reduces variance but increases bias. The foundational tradeoff in all of statistical learning.

Imagine fitting a curve to 10 noisy data points. A straight line (high bias) will miss the true pattern but give similar answers no matter which 10 points you happened to sample. A degree-9 polynomial (high variance) will hit every point perfectly but look completely different with a different sample of 10 points. Neither extreme generalises well — the line can’t capture the pattern, and the polynomial captures the noise.

The expected error on new data decomposes cleanly: error = bias² + variance + irreducible noise. You can’t reduce both bias and variance simultaneously without more data or a better inductive bias. This is why simply making a model bigger doesn’t always help — you reduce bias but increase variance, and at some point the variance increase dominates.

In deep learning, the classical picture is complicated by the “double descent” phenomenon: very large overparameterised models can have low bias AND low variance, seemingly violating the tradeoff. The resolution is that implicit regularisation from SGD, architecture, and early stopping acts as a strong inductive bias that keeps variance in check even as capacity grows. The tradeoff still holds — it’s just that modern deep learning operates in a regime where implicit regularisation does most of the work.

For a prediction f^(x)\hat{f}(x) trained on dataset DD, the expected squared error at a point xx decomposes as:

ED[(yf^(x))2]=(ED[f^(x)]f(x))2bias2+ED[(f^(x)ED[f^(x)])2]variance+σϵ2noise\mathbb{E}_D\left[(y - \hat{f}(x))^2\right] = \underbrace{\left(\mathbb{E}_D[\hat{f}(x)] - f(x)\right)^2}_{\text{bias}^2} + \underbrace{\mathbb{E}_D\left[(\hat{f}(x) - \mathbb{E}_D[\hat{f}(x)])^2\right]}_{\text{variance}} + \underbrace{\sigma^2_\epsilon}_{\text{noise}}

where f(x)f(x) is the true function and σϵ2\sigma^2_\epsilon is the irreducible noise.

  • Bias: how far the average prediction (across all possible training sets) is from the truth
  • Variance: how much the prediction varies across different training sets
  • Noise: inherent randomness in the data that no model can capture
  • High bias (underfitting): training loss AND validation loss are both high; the model can’t fit even the training data well
  • High variance (overfitting): training loss is low but validation loss is much higher; large gap between train and val curves
  • Unstable predictions: retrain the same architecture on different random seeds or data splits and get wildly different models (high variance)
  • In RL (GAE λ): low λ gives low-variance but biased advantage estimates (relies on learned value function); high λ gives unbiased but high-variance estimates (relies on actual returns)
  • In ensembles: if individual models disagree strongly, variance is high; if they all make the same wrong prediction, bias is high
  • Policy gradient (policy-gradient/): GAE λ is an explicit bias-variance dial — λ=0 is pure bootstrapping (low variance, high bias from value function errors); λ=1 is Monte Carlo (unbiased, high variance)
  • Q-learning (q-learning/): n-step returns trade off the same way — 1-step has low variance but high bias from bootstrapping; full-episode returns are unbiased but high variance
  • NN training (nn-training/): regularisation strength (weight decay, dropout rate) directly controls the tradeoff — too much regularisation → underfitting (bias), too little → overfitting (variance)
  • Diffusion (diffusion/): the number of denoising steps at inference trades compute for variance — fewer steps is faster but noisier
  • Contrastive learning (contrastive-self-supervising/): batch size controls the quality of the negative sample estimate — small batches give high-variance gradient estimates
SolutionMechanismWhere documented
GAE (λ parameter)Exponentially-weighted mix of n-step returns — λ interpolates between bias and variance extremesatomic-concepts/rl-specific/generalised-advantage-estimation.md
DropoutRandomly drops units during training, reducing co-adaptation (variance) at cost of slight underfitting (bias)atomic-concepts/regularisation/dropout.md
Weight decayShrinks weights toward zero, reducing model effective capacityatomic-concepts/regularisation/weight-decay.md
Ensemble methodsAverage multiple models to reduce variance without increasing bias(standard ML practice)
Early stoppingStop training when validation loss starts increasing — implicitly limits effective capacitynn-training/
Importance sampling correctionCorrects for off-policy bias in RL at the cost of increased varianceatomic-concepts/mathematical-tricks/importance-sampling.md

The bias-variance decomposition comes from statistical decision theory (Geman, Bieman & Doursat, 1992, though the ideas trace to earlier work by Stein and James in the 1960s). It was the dominant framework for understanding generalisation in machine learning through the 1990s and 2000s. The arrival of deep learning initially seemed to violate it — massively overparameterised networks generalised well despite having enough capacity to memorise the training set. Belkin et al. (2019) reconciled this with the “double descent” curve, showing that the classical U-shaped test error curve (bias-variance tradeoff) appears in the underparameterised regime, but performance improves again in the heavily overparameterised regime. The tradeoff remains the correct mental model for most practical decisions (regularisation strength, model size, data augmentation), even if the extreme overparameterised regime adds nuance.