Off-Policy Extrapolation Error
Off-Policy Extrapolation Error
Section titled “Off-Policy Extrapolation Error”Querying a learned value function on state-action pairs outside its training distribution yields unreliable, often wildly overestimated values. The model confidently predicts Q-values for actions it has never seen — and those predictions are wrong in a systematically dangerous direction (too high), because maximisation bias amplifies the errors.
Intuition
Section titled “Intuition”Imagine asking a food critic who has only reviewed Italian restaurants to rate a Thai restaurant. They have no basis for an accurate rating, but they’ll produce a number anyway — and that number could be anything. Now imagine a policy that chooses restaurants based on these ratings: it’ll happily pick the Thai restaurant if the critic’s made-up number is high enough, even though the rating is meaningless.
That’s extrapolation error. A Q-function trained on data from a specific behaviour policy (the “Italian restaurants”) is asked to evaluate actions from a different policy (the “Thai restaurants”). Neural networks don’t output “I don’t know” — they extrapolate, and the extrapolated values have no reason to be accurate. Worse, the operator in Q-learning preferentially selects the actions with the highest extrapolated values, which are the ones most likely to be overestimated.
This is especially devastating in offline RL, where the agent must learn from a fixed dataset and can never collect new data to correct its mistakes. The agent learns that some unseen action has a high Q-value, takes it, gets a bad outcome — but there’s no new data coming in to correct the overestimate.
Manifestation
Section titled “Manifestation”- Q-values for out-of-distribution actions are unreasonably high — the model outputs values far exceeding the maximum possible return
- The learned policy strongly prefers actions rarely seen in the data — precisely the actions with the least reliable Q-estimates
- Performance degrades as the learned policy diverges from the data-collection policy — the further off-policy, the worse the extrapolation
- In offline RL: the agent performs much worse than the behaviour policy that collected the data, despite having higher estimated Q-values
Where It Appears
Section titled “Where It Appears”- Q-learning (
q-learning/): CQL explicitly penalises Q-values on out-of-distribution actions to counteract extrapolation; IQL avoids querying unseen actions entirely by learning from in-sample maxima - Q-learning (
q-learning/): even in online settings, the replay buffer creates mild off-policy-ness — target networks help by slowing down the feedback loop - Policy gradient (
policy-gradient/): PPO’s clipping mechanism limits how far the new policy can deviate from the old one, indirectly limiting extrapolation - Contrastive learning (
contrastive-self-supervising/): not directly applicable, but the general principle — neural networks extrapolate unreliably — motivates augmentation strategies that stay close to the training distribution
Solutions at a Glance
Section titled “Solutions at a Glance”| Solution | Mechanism | Where documented |
|---|---|---|
| CQL (Conservative Q-Learning) | Adds a regulariser that pushes down Q-values on all actions, especially unseen ones | q-learning/ |
| IQL (Implicit Q-Learning) | Learns Q from in-sample maxima only — never queries the Q-function on unseen actions | q-learning/ |
| Policy constraint (BCQ, TD3+BC) | Constrains the learned policy to stay close to the behaviour policy | (offline RL methods) |
| PPO clipping | Limits the policy update magnitude, keeping the new policy close to the old one | policy-gradient/ |
| Importance sampling | Re-weights off-policy data to correct for distribution mismatch | atomic-concepts/mathematical-tricks/importance-sampling.md |
| Replay buffers | In online settings, keeps recent on-policy data available to limit staleness | atomic-concepts/rl-specific/replay-buffers.md |
Historical Context
Section titled “Historical Context”Extrapolation error was understood informally in RL for decades, but Fujimoto, Meger & Precup (2019) in the BCQ paper gave it a clear articulation and name in the context of offline (batch) RL. They showed that standard off-policy algorithms (DQN, DDPG) fail catastrophically when trained on fixed datasets, even when the data comes from a good policy. Kumar et al. (2020) with CQL and Kostrikov et al. (2022) with IQL provided two influential solutions with different philosophies — CQL pessimistically penalises all Q-values, while IQL avoids the problem entirely by never evaluating out-of-distribution actions. The problem remains the central challenge in offline RL.