Bootstrapping Bias

Using your own predictions as regression targets introduces circular bias — you’re training the model to match its own (wrong) estimates. The error in the target propagates into the update, which changes the target, which changes the update. Unlike supervised learning where targets are fixed ground truth, bootstrapped targets move with the model.

Intuition

Imagine a student who grades their own exams. They confidently mark wrong answers as correct, study those wrong answers, and become even more confident in them. That’s bootstrapping bias: the model’s errors feed back into its training signal, and there’s no external correction.

In temporal difference (TD) learning, the target for $Q(s, a)$ is $r + \gamma Q(s', a')$ — it includes the model’s own estimate of the next state’s value. If $Q(s', a')$ is wrong, the target is wrong, and the update makes $Q(s, a)$ wrong in a way that’s correlated with the original error. This is fundamentally different from supervised learning, where the labels don’t change when you update the model.

The bias doesn’t necessarily cause divergence — TD learning has convergence guarantees in the tabular case. But with function approximation (neural networks), the bias combines with generalisation to create a dangerous feedback loop: updating $Q(s, a)$ toward a biased target also changes $Q$ at nearby states (because the network generalises), which changes their targets, which changes nearby states’ values, and so on. This is one leg of the “deadly triad” (bootstrapping + function approximation + off-policy learning).

Manifestation

Q-values drift from true values in a correlated way — entire regions of state space become systematically over- or under-estimated
Training is less stable than equivalent supervised learning — loss oscillates or diverges where a fixed-target version would converge
Errors propagate backward through the state space: if Q is wrong at the terminal state, the bias spreads to all states that can reach it
Performance is sensitive to target update frequency — update the target network too fast and bootstrap bias runs away; too slow and you train on stale targets

Where It Appears

Q-learning (q-learning/): the Bellman target $r + \gamma \max Q(s', a')$ bootstraps from Q’s own estimates → target networks and Polyak averaging provide a slower-moving, more stable target
Policy gradient (policy-gradient/): when using a value function baseline, the advantage estimate $A = r + \gamma V(s') - V(s)$ bootstraps from the value function → GAE λ=1 eliminates bootstrap bias entirely (at the cost of high variance)
Diffusion (diffusion/): not directly present — diffusion avoids bootstrapping by using a fixed noise schedule as the target, which is one reason it’s more stable than RL-based generative methods
TD learning (atomic-concepts/rl-specific/temporal-difference-learning.md): the foundational concept — TD(0) has maximum bootstrap bias, TD(λ→1) reduces it toward zero

Solutions at a Glance

Solution	Mechanism	Where documented
Target networks	Use a frozen or slow-moving copy of the model for computing targets	`atomic-concepts/rl-specific/polyak-averaging.md`
GAE (high λ)	Blend TD targets with Monte Carlo returns — higher λ reduces bootstrap reliance	`atomic-concepts/rl-specific/generalised-advantage-estimation.md`
n-step returns	Use actual rewards for n steps before bootstrapping — reduces bias at cost of variance	`atomic-concepts/rl-specific/temporal-difference-learning.md`
Monte Carlo returns	No bootstrapping at all (λ=1) — unbiased but high variance	`policy-gradient/`
Double Q-learning	Decouples selection from evaluation, reducing bias amplification	`q-learning/`

Historical Context

Bootstrapping in RL was introduced by Sutton (1988) with TD(λ), building on earlier ideas from Samuel’s checkers player (1959). The bias was understood theoretically from the start — Sutton proved convergence only for the tabular case. Tsitsiklis & Van Roy (1997) showed that combining bootstrapping with function approximation could diverge, and Baird (1995) gave concrete counterexamples. The “deadly triad” framing (bootstrapping + function approximation + off-policy) was popularised by Sutton & Barto’s textbook. Despite the bias, bootstrapping remains essential in practice because the alternative (Monte Carlo) has impractically high variance for most problems. Modern methods don’t eliminate the bias — they manage it through target networks, n-step returns, and conservative value estimation.