Skip to content

Maximisation Bias

Taking the maximum of noisy estimates produces a systematically optimistic result: E[max(X1,,Xn)]max(E[X1],,E[Xn])\mathbb{E}[\max(X_1, \ldots, X_n)] \geq \max(\mathbb{E}[X_1], \ldots, \mathbb{E}[X_n]). When you use the same noisy estimates to both select the best option and evaluate how good it is, you get a positively biased answer every time.

Imagine 10 fair coins, each flipped 3 times. On average each shows 1.5 heads, but if you pick the coin that happened to show the most heads, you’ll often pick one that showed 3/3 — and you’ll estimate that coin’s true rate as 100%. You used the same data to choose the winner and to measure it, so luck in being selected and luck in appearing good are the same luck, counted twice.

This is not a small-sample fluke. It happens whenever estimates have variance and you take a max. More options make it worse (more lottery tickets = higher “winning” score). More noise makes it worse (wider variance = more room for upward flukes). It only disappears when your estimates are perfect — i.e., zero variance.

In Q-learning, this matters because the Bellman target uses maxaQ(s,a)\max_a Q(s', a) — the agent queries its own noisy Q-estimates, picks the action with the highest value, and uses that (overestimated) value as a training target. The overestimation feeds back into training, creating a vicious cycle: overestimated values → overoptimistic targets → even more overestimated values.

For random variables X1,,XnX_1, \ldots, X_n with E[Xi]=μ\mathbb{E}[X_i] = \mu for all ii, Jensen’s inequality gives:

E[maxiXi]maxiE[Xi]=μ\mathbb{E}\left[\max_i X_i\right] \geq \max_i \mathbb{E}[X_i] = \mu

The gap grows with (a) the number of variables nn and (b) the variance of each XiX_i. For nn i.i.d. Gaussians N(μ,σ2)\mathcal{N}(\mu, \sigma^2), the expected maximum grows as μ+σΘ(logn)\mu + \sigma \cdot \Theta(\sqrt{\log n}).

In Q-learning, the target is:

y=r+γmaxaQθ(s,a)y = r + \gamma \max_{a'} Q_\theta(s', a')

The same network QθQ_\theta both selects argmaxaQθ(s,a)\arg\max_{a'} Q_\theta(s', a') and evaluates Qθ(s,a)Q_\theta(s', a^*). Double DQN decouples these: select with QθQ_\theta, evaluate with QθQ_{\theta^-}.

  • Q-values grow unrealistically large — if the true optimal return is ~100 but your Q-values climb to 500+, maximisation bias is likely the cause
  • Learning is unstable with periodic spikes in Q-values followed by crashes in actual performance
  • The agent becomes overconfident about actions it has tried infrequently — rare actions accumulate less data, so their Q-estimates have higher variance, making them more susceptible to the max operator
  • Divergence between estimated and actual returns — plot the mean Q-value alongside the true episodic return; a growing gap signals bias
  • Q-learning (q-learning/): the canonical setting — maxaQ(s,a)\max_a Q(s', a) in the Bellman target is directly biased → Double DQN decouples selection and evaluation
  • Policy gradient (policy-gradient/): less direct, but advantage estimation using a learned value function can inherit overestimation if the value function is used to select among rollouts
  • Hyperparameter tuning (general): selecting the configuration with the best validation score from many trials overestimates true performance — same statistical phenomenon, different domain
  • Model selection: choosing the best model from a set based on test performance inflates reported accuracy
SolutionMechanismWhere documented
Double DQNSelect action with online network, evaluate with target networkq-learning/
Clipped Double Q (SAC, TD3)Take the minimum of two independent Q-estimates as the targetq-learning/ (SAC variant)
Target networksSlow-moving Q-target reduces variance of the evaluationatomic-concepts/rl-specific/polyak-averaging.md
CQL (Conservative Q-Learning)Explicitly penalises high Q-values on out-of-distribution actionsq-learning/
Cross-validationUse held-out data for evaluation, not the same data used for selection(standard ML practice)

Maximisation bias was identified in the statistics literature in the 1960s as a property of order statistics. Van Hasselt (2010) brought it into the RL spotlight with the Double Q-learning paper, showing that standard Q-learning’s systematic overestimation was not just a theoretical curiosity but a practical training failure. Van Hasselt, Guez & Silver (2016) scaled the idea to deep networks as Double DQN, which became the standard baseline for value-based deep RL. The insight is now considered foundational — every modern off-policy method (SAC, TD3, CQL) includes some mechanism to counteract it.