Maximisation Bias

Taking the maximum of noisy estimates produces a systematically optimistic result: $\mathbb{E}[\max(X_1, \ldots, X_n)] \geq \max(\mathbb{E}[X_1], \ldots, \mathbb{E}[X_n])$ . When you use the same noisy estimates to both select the best option and evaluate how good it is, you get a positively biased answer every time.

Intuition

Imagine 10 fair coins, each flipped 3 times. On average each shows 1.5 heads, but if you pick the coin that happened to show the most heads, you’ll often pick one that showed 3/3 — and you’ll estimate that coin’s true rate as 100%. You used the same data to choose the winner and to measure it, so luck in being selected and luck in appearing good are the same luck, counted twice.

This is not a small-sample fluke. It happens whenever estimates have variance and you take a max. More options make it worse (more lottery tickets = higher “winning” score). More noise makes it worse (wider variance = more room for upward flukes). It only disappears when your estimates are perfect — i.e., zero variance.

In Q-learning, this matters because the Bellman target uses $\max_a Q(s', a)$ — the agent queries its own noisy Q-estimates, picks the action with the highest value, and uses that (overestimated) value as a training target. The overestimation feeds back into training, creating a vicious cycle: overestimated values → overoptimistic targets → even more overestimated values.

Math

For random variables $X_1, \ldots, X_n$ with $\mathbb{E}[X_i] = \mu$ for all $i$ , Jensen’s inequality gives:

$\mathbb{E}\left[\max_i X_i\right] \geq \max_i \mathbb{E}[X_i] = \mu$

The gap grows with (a) the number of variables $n$ and (b) the variance of each $X_i$ . For $n$ i.i.d. Gaussians $\mathcal{N}(\mu, \sigma^2)$ , the expected maximum grows as $\mu + \sigma \cdot \Theta(\sqrt{\log n})$ .

In Q-learning, the target is:

$y = r + \gamma \max_{a'} Q_\theta(s', a')$

The same network $Q_\theta$ both selects $\arg\max_{a'} Q_\theta(s', a')$ and evaluates $Q_\theta(s', a^*)$ . Double DQN decouples these: select with $Q_\theta$ , evaluate with $Q_{\theta^-}$ .

Manifestation

Q-values grow unrealistically large — if the true optimal return is ~100 but your Q-values climb to 500+, maximisation bias is likely the cause
Learning is unstable with periodic spikes in Q-values followed by crashes in actual performance
The agent becomes overconfident about actions it has tried infrequently — rare actions accumulate less data, so their Q-estimates have higher variance, making them more susceptible to the max operator
Divergence between estimated and actual returns — plot the mean Q-value alongside the true episodic return; a growing gap signals bias

Where It Appears

Q-learning (q-learning/): the canonical setting — $\max_a Q(s', a)$ in the Bellman target is directly biased → Double DQN decouples selection and evaluation
Policy gradient (policy-gradient/): less direct, but advantage estimation using a learned value function can inherit overestimation if the value function is used to select among rollouts
Hyperparameter tuning (general): selecting the configuration with the best validation score from many trials overestimates true performance — same statistical phenomenon, different domain
Model selection: choosing the best model from a set based on test performance inflates reported accuracy

Solutions at a Glance

Solution	Mechanism	Where documented
Double DQN	Select action with online network, evaluate with target network	`q-learning/`
Clipped Double Q (SAC, TD3)	Take the minimum of two independent Q-estimates as the target	`q-learning/` (SAC variant)
Target networks	Slow-moving Q-target reduces variance of the evaluation	`atomic-concepts/rl-specific/polyak-averaging.md`
CQL (Conservative Q-Learning)	Explicitly penalises high Q-values on out-of-distribution actions	`q-learning/`
Cross-validation	Use held-out data for evaluation, not the same data used for selection	(standard ML practice)

Historical Context

Maximisation bias was identified in the statistics literature in the 1960s as a property of order statistics. Van Hasselt (2010) brought it into the RL spotlight with the Double Q-learning paper, showing that standard Q-learning’s systematic overestimation was not just a theoretical curiosity but a practical training failure. Van Hasselt, Guez & Silver (2016) scaled the idea to deep networks as Double DQN, which became the standard baseline for value-based deep RL. The insight is now considered foundational — every modern off-policy method (SAC, TD3, CQL) includes some mechanism to counteract it.