Maximisation Bias
Maximisation Bias
Section titled “Maximisation Bias”Taking the maximum of noisy estimates produces a systematically optimistic result: . When you use the same noisy estimates to both select the best option and evaluate how good it is, you get a positively biased answer every time.
Intuition
Section titled “Intuition”Imagine 10 fair coins, each flipped 3 times. On average each shows 1.5 heads, but if you pick the coin that happened to show the most heads, you’ll often pick one that showed 3/3 — and you’ll estimate that coin’s true rate as 100%. You used the same data to choose the winner and to measure it, so luck in being selected and luck in appearing good are the same luck, counted twice.
This is not a small-sample fluke. It happens whenever estimates have variance and you take a max. More options make it worse (more lottery tickets = higher “winning” score). More noise makes it worse (wider variance = more room for upward flukes). It only disappears when your estimates are perfect — i.e., zero variance.
In Q-learning, this matters because the Bellman target uses — the agent queries its own noisy Q-estimates, picks the action with the highest value, and uses that (overestimated) value as a training target. The overestimation feeds back into training, creating a vicious cycle: overestimated values → overoptimistic targets → even more overestimated values.
For random variables with for all , Jensen’s inequality gives:
The gap grows with (a) the number of variables and (b) the variance of each . For i.i.d. Gaussians , the expected maximum grows as .
In Q-learning, the target is:
The same network both selects and evaluates . Double DQN decouples these: select with , evaluate with .
Manifestation
Section titled “Manifestation”- Q-values grow unrealistically large — if the true optimal return is ~100 but your Q-values climb to 500+, maximisation bias is likely the cause
- Learning is unstable with periodic spikes in Q-values followed by crashes in actual performance
- The agent becomes overconfident about actions it has tried infrequently — rare actions accumulate less data, so their Q-estimates have higher variance, making them more susceptible to the max operator
- Divergence between estimated and actual returns — plot the mean Q-value alongside the true episodic return; a growing gap signals bias
Where It Appears
Section titled “Where It Appears”- Q-learning (
q-learning/): the canonical setting — in the Bellman target is directly biased → Double DQN decouples selection and evaluation - Policy gradient (
policy-gradient/): less direct, but advantage estimation using a learned value function can inherit overestimation if the value function is used to select among rollouts - Hyperparameter tuning (general): selecting the configuration with the best validation score from many trials overestimates true performance — same statistical phenomenon, different domain
- Model selection: choosing the best model from a set based on test performance inflates reported accuracy
Solutions at a Glance
Section titled “Solutions at a Glance”| Solution | Mechanism | Where documented |
|---|---|---|
| Double DQN | Select action with online network, evaluate with target network | q-learning/ |
| Clipped Double Q (SAC, TD3) | Take the minimum of two independent Q-estimates as the target | q-learning/ (SAC variant) |
| Target networks | Slow-moving Q-target reduces variance of the evaluation | atomic-concepts/rl-specific/polyak-averaging.md |
| CQL (Conservative Q-Learning) | Explicitly penalises high Q-values on out-of-distribution actions | q-learning/ |
| Cross-validation | Use held-out data for evaluation, not the same data used for selection | (standard ML practice) |
Historical Context
Section titled “Historical Context”Maximisation bias was identified in the statistics literature in the 1960s as a property of order statistics. Van Hasselt (2010) brought it into the RL spotlight with the Double Q-learning paper, showing that standard Q-learning’s systematic overestimation was not just a theoretical curiosity but a practical training failure. Van Hasselt, Guez & Silver (2016) scaled the idea to deep networks as Double DQN, which became the standard baseline for value-based deep RL. The insight is now considered foundational — every modern off-policy method (SAC, TD3, CQL) includes some mechanism to counteract it.