Bootstrapping Bias
Bootstrapping Bias
Section titled “Bootstrapping Bias”Using your own predictions as regression targets introduces circular bias — you’re training the model to match its own (wrong) estimates. The error in the target propagates into the update, which changes the target, which changes the update. Unlike supervised learning where targets are fixed ground truth, bootstrapped targets move with the model.
Intuition
Section titled “Intuition”Imagine a student who grades their own exams. They confidently mark wrong answers as correct, study those wrong answers, and become even more confident in them. That’s bootstrapping bias: the model’s errors feed back into its training signal, and there’s no external correction.
In temporal difference (TD) learning, the target for is — it includes the model’s own estimate of the next state’s value. If is wrong, the target is wrong, and the update makes wrong in a way that’s correlated with the original error. This is fundamentally different from supervised learning, where the labels don’t change when you update the model.
The bias doesn’t necessarily cause divergence — TD learning has convergence guarantees in the tabular case. But with function approximation (neural networks), the bias combines with generalisation to create a dangerous feedback loop: updating toward a biased target also changes at nearby states (because the network generalises), which changes their targets, which changes nearby states’ values, and so on. This is one leg of the “deadly triad” (bootstrapping + function approximation + off-policy learning).
Manifestation
Section titled “Manifestation”- Q-values drift from true values in a correlated way — entire regions of state space become systematically over- or under-estimated
- Training is less stable than equivalent supervised learning — loss oscillates or diverges where a fixed-target version would converge
- Errors propagate backward through the state space: if Q is wrong at the terminal state, the bias spreads to all states that can reach it
- Performance is sensitive to target update frequency — update the target network too fast and bootstrap bias runs away; too slow and you train on stale targets
Where It Appears
Section titled “Where It Appears”- Q-learning (
q-learning/): the Bellman target bootstraps from Q’s own estimates → target networks and Polyak averaging provide a slower-moving, more stable target - Policy gradient (
policy-gradient/): when using a value function baseline, the advantage estimate bootstraps from the value function → GAE λ=1 eliminates bootstrap bias entirely (at the cost of high variance) - Diffusion (
diffusion/): not directly present — diffusion avoids bootstrapping by using a fixed noise schedule as the target, which is one reason it’s more stable than RL-based generative methods - TD learning (
atomic-concepts/rl-specific/temporal-difference-learning.md): the foundational concept — TD(0) has maximum bootstrap bias, TD(λ→1) reduces it toward zero
Solutions at a Glance
Section titled “Solutions at a Glance”| Solution | Mechanism | Where documented |
|---|---|---|
| Target networks | Use a frozen or slow-moving copy of the model for computing targets | atomic-concepts/rl-specific/polyak-averaging.md |
| GAE (high λ) | Blend TD targets with Monte Carlo returns — higher λ reduces bootstrap reliance | atomic-concepts/rl-specific/generalised-advantage-estimation.md |
| n-step returns | Use actual rewards for n steps before bootstrapping — reduces bias at cost of variance | atomic-concepts/rl-specific/temporal-difference-learning.md |
| Monte Carlo returns | No bootstrapping at all (λ=1) — unbiased but high variance | policy-gradient/ |
| Double Q-learning | Decouples selection from evaluation, reducing bias amplification | q-learning/ |
Historical Context
Section titled “Historical Context”Bootstrapping in RL was introduced by Sutton (1988) with TD(λ), building on earlier ideas from Samuel’s checkers player (1959). The bias was understood theoretically from the start — Sutton proved convergence only for the tabular case. Tsitsiklis & Van Roy (1997) showed that combining bootstrapping with function approximation could diverge, and Baird (1995) gave concrete counterexamples. The “deadly triad” framing (bootstrapping + function approximation + off-policy) was popularised by Sutton & Barto’s textbook. Despite the bias, bootstrapping remains essential in practice because the alternative (Monte Carlo) has impractically high variance for most problems. Modern methods don’t eliminate the bias — they manage it through target networks, n-step returns, and conservative value estimation.