Skip to content

Unified Q-Learning

A single skeleton that covers Q-Learning, DQN, Double DQN, Dueling DQN, CQL, IQL, SAC (Q-critic), and more.

The core Q-learning loop is identical across all variants.

  • q_a = Q(s).gather(a) — evaluate current Q
  • targets = compute_target(batch) — build target (PLUGGABLE)
  • loss = compute_loss(q_a, targets) — compute loss (PLUGGABLE)
  • optimizer step
  • target network update (hard copy or Polyak)
  • Online → interact with env, fill replay buffer, sample batches
  • Offline → just sample batches from a fixed dataset
  1. compute_target() — how the bootstrap target y is built
  2. compute_loss() — the objective (MSE, Huber, + regularizers)
  3. Data source — online replay buffer vs. offline dataset
Variantcompute_target()compute_loss() extra
DQNr + γ max Q_tgt(s’, ·)MSE
Double DQNr + γ Q_tgt(s’, argmax Q(s’, ·))MSE
CQLr + γ max Q_tgt(s’, ·)MSE + α·CQL penalty
IQLr + γ V(s’)MSE (+ V expectile)
SAC / SoftQr + γ (Q_tgt(s’, a’) − α log π)MSE
Variantcompute_target()compute_loss() extra
Dueling DQNr + γ max Q_tgt(s’, ·)MSE (but Q = V(s) + A(s,a) − mean A)
Distributional (C51)Projected distributional Bellman targetCross-entropy over atom probabilities
QR-DQNr + γ quantiles of Q_tgt(s’, ·)Quantile Huber loss
IQNr + γ sampled quantiles of Q_tgt(s’, ·)Quantile Huber (implicit quantile sampling)
RainbowCombines Double + Dueling + Distributional + PER + noisy nets + n-stepCross-entropy (distributional)
N-step DQNΣ γⁱrᵢ + γⁿ max Q_tgt(sₙ, ·)MSE
Prioritized (PER)Same as base (DQN/Double/etc.)MSE, but samples weighted by TD-error priority
Noisy DQNSame as baseMSE (parametric noise in network weights replaces ε-greedy)
HERr + γ max Q_tgt(s’‖g, ·)MSE (replays with relabeled goals)
BCQr + γ max_{a∈filtered} Q_tgt(s’, a)MSE + generative model constrains actions to data support
TD3 (continuous)r + γ min(Q_tgt1, Q_tgt2)(s’, π_tgt(s’)+clip noise)MSE (clipped double Q, delayed policy update)
REDQr + γ min over M random subset of N Q-targetsMSE (high update-to-data ratio)
Munchausen RL(r + α log π(a|s)) + γ (Q_tgt − α log π)MSE (adds scaled log-policy to reward)

Key groupings:

  • Representation: Dueling (separate V/A streams)
  • Distributional: C51, QR-DQN, IQN (model return distribution, not just mean)
  • Sampling/training tricks: PER, N-step, Noisy nets, Rainbow (combines many)
  • Offline/batch RL: BCQ (constrains to data support), pairs with CQL/IQL above
  • Continuous action: TD3 (extends DDPG with clipped double-Q), REDQ
  • Goal-conditioned: HER (relabels goals in hindsight)
VariantProblem SolvedIntuition for Solution
DQNQ-learning fails with high-dim state spaces (e.g. pixels) and is unstable with function approximationUse a neural net as Q, plus a frozen target network and replay buffer to decorrelate updates and stabilise training
Double DQNMaximisation bias: the same net both selects and evaluates the best action, causing systematic Q overestimationDecouple action selection (online net) from action evaluation (target net) so noise in Q doesn’t consistently inflate the target
CQLOffline RL: Q-values on out-of-distribution (OOD) actions are unconstrained and drift to delusional overestimatesAdd a regulariser that pushes Q down on all actions (via logsumexp) and up on dataset actions, producing a learned lower bound that is conservative on OOD
IQLOffline RL: even querying max_a Q(s’,a’) during training requires evaluating Q on unseen actions, enabling OOD extrapolation errorsAvoid the max entirely — learn V(s) via expectile regression on dataset (s,a) pairs, which approximates an in-sample max without ever evaluating Q on actions not in the data
SAC / SoftQPure reward maximisation leads to brittle, deterministic policies that exploit narrow peaks and explore poorlyAugment the target with an entropy bonus (−α log π), so the agent maximises reward AND action entropy jointly, encouraging robust, multi-modal policies
VariantProblem SolvedIntuition for Solution
Dueling DQNIn many states, the value of being there matters more than which action you pick, but standard DQN entangles state-value and action-advantageSplit Q into V(s) + A(s,a) streams so the network can learn “this state is good/bad” without needing to experience every action in it
Distributional (C51)A single expected Q-value discards information about risk and multi-modal outcomesModel the full distribution of returns as a categorical over fixed atoms; richer gradients and risk-awareness improve learning
QR-DQNC51 requires a fixed support range and atom count chosen a priori, which can clip or waste resolutionLearn quantile locations directly — no fixed support needed — and use quantile Huber loss for asymmetric regression on each quantile
IQNQR-DQN fixes the number of quantiles, limiting resolution of the return distributionSample quantile fractions from U(0,1) at each forward pass, implicitly representing the full continuous distribution
RainbowEach DQN improvement helps independently, but they address orthogonal failure modesCombine six complementary ideas (Double, Dueling, Distributional, PER, Noisy Nets, N-step) — gains compound because they fix different bottlenecks
N-step DQN1-step bootstrapping propagates reward information slowly, especially in sparse-reward tasksUse n-step returns so reward signal reaches earlier states faster, at the cost of some bias from the longer bootstrap horizon
Prioritized (PER)Uniform replay wastes most updates on already-learned transitionsSample transitions proportional to their TD-error so the network focuses on surprising or poorly-predicted experiences
Noisy DQNε-greedy explores uniformly across actions regardless of state, which is inefficient and state-agnosticReplace ε-greedy with learnable noise injected into network weights, so exploration is state-dependent and fades automatically as the network becomes confident
HERIn sparse-reward goal-conditioned tasks, the agent almost never reaches the goal, so it gets no learning signalAfter an episode, relabel the trajectory with goals the agent actually reached — every trajectory becomes a success story for some goal, providing dense reward signal
BCQOffline RL with unconstrained action selection queries Q on actions never seen in the dataset, causing extrapolation errorsTrain a generative model on the dataset’s action distribution and only allow the policy to choose from actions the generative model considers plausible
TD3 (continuous)DDPG suffers from Q overestimation, brittle convergence, and sensitivity to hyperparameters in continuous controlTake the min of two Q-networks (clipped double-Q), add noise to target actions, and delay policy updates — all reduce overestimation and stabilise training
REDQHigh sample efficiency requires many gradient updates per environment step, but this amplifies Q overestimationMaintain a large ensemble of Q-networks and take the min over a random subset for each target, allowing high update-to-data ratios without runaway overestimation
Munchausen RLStandard bootstrapping treats all transitions equally regardless of how likely the policy was to take that actionAdd the agent’s own log-policy to the reward (a “self-referential” bonus), which implicitly performs a form of KL regularisation and consistently improves DQN and SAC baselines with one line of code
  • unified-q-learning.py — all variant implementations in a single file, showing how they share one core loop