Skip to content

Off-Policy Extrapolation Error

Querying a learned value function on state-action pairs outside its training distribution yields unreliable, often wildly overestimated values. The model confidently predicts Q-values for actions it has never seen — and those predictions are wrong in a systematically dangerous direction (too high), because maximisation bias amplifies the errors.

Imagine asking a food critic who has only reviewed Italian restaurants to rate a Thai restaurant. They have no basis for an accurate rating, but they’ll produce a number anyway — and that number could be anything. Now imagine a policy that chooses restaurants based on these ratings: it’ll happily pick the Thai restaurant if the critic’s made-up number is high enough, even though the rating is meaningless.

That’s extrapolation error. A Q-function trained on data from a specific behaviour policy (the “Italian restaurants”) is asked to evaluate actions from a different policy (the “Thai restaurants”). Neural networks don’t output “I don’t know” — they extrapolate, and the extrapolated values have no reason to be accurate. Worse, the max\max operator in Q-learning preferentially selects the actions with the highest extrapolated values, which are the ones most likely to be overestimated.

This is especially devastating in offline RL, where the agent must learn from a fixed dataset and can never collect new data to correct its mistakes. The agent learns that some unseen action has a high Q-value, takes it, gets a bad outcome — but there’s no new data coming in to correct the overestimate.

  • Q-values for out-of-distribution actions are unreasonably high — the model outputs values far exceeding the maximum possible return
  • The learned policy strongly prefers actions rarely seen in the data — precisely the actions with the least reliable Q-estimates
  • Performance degrades as the learned policy diverges from the data-collection policy — the further off-policy, the worse the extrapolation
  • In offline RL: the agent performs much worse than the behaviour policy that collected the data, despite having higher estimated Q-values
  • Q-learning (q-learning/): CQL explicitly penalises Q-values on out-of-distribution actions to counteract extrapolation; IQL avoids querying unseen actions entirely by learning from in-sample maxima
  • Q-learning (q-learning/): even in online settings, the replay buffer creates mild off-policy-ness — target networks help by slowing down the feedback loop
  • Policy gradient (policy-gradient/): PPO’s clipping mechanism limits how far the new policy can deviate from the old one, indirectly limiting extrapolation
  • Contrastive learning (contrastive-self-supervising/): not directly applicable, but the general principle — neural networks extrapolate unreliably — motivates augmentation strategies that stay close to the training distribution
SolutionMechanismWhere documented
CQL (Conservative Q-Learning)Adds a regulariser that pushes down Q-values on all actions, especially unseen onesq-learning/
IQL (Implicit Q-Learning)Learns Q from in-sample maxima only — never queries the Q-function on unseen actionsq-learning/
Policy constraint (BCQ, TD3+BC)Constrains the learned policy to stay close to the behaviour policy(offline RL methods)
PPO clippingLimits the policy update magnitude, keeping the new policy close to the old onepolicy-gradient/
Importance samplingRe-weights off-policy data to correct for distribution mismatchatomic-concepts/mathematical-tricks/importance-sampling.md
Replay buffersIn online settings, keeps recent on-policy data available to limit stalenessatomic-concepts/rl-specific/replay-buffers.md

Extrapolation error was understood informally in RL for decades, but Fujimoto, Meger & Precup (2019) in the BCQ paper gave it a clear articulation and name in the context of offline (batch) RL. They showed that standard off-policy algorithms (DQN, DDPG) fail catastrophically when trained on fixed datasets, even when the data comes from a good policy. Kumar et al. (2020) with CQL and Kostrikov et al. (2022) with IQL provided two influential solutions with different philosophies — CQL pessimistically penalises all Q-values, while IQL avoids the problem entirely by never evaluating out-of-distribution actions. The problem remains the central challenge in offline RL.