Unified Q-Learning
Unified Q-Learning
Section titled “Unified Q-Learning”A single skeleton that covers Q-Learning, DQN, Double DQN, Dueling DQN, CQL, IQL, SAC (Q-critic), and more.
Key Idea
Section titled “Key Idea”The core Q-learning loop is identical across all variants.
Always the same (core loop):
Section titled “Always the same (core loop):”- q_a = Q(s).gather(a) — evaluate current Q
- targets = compute_target(batch) — build target (PLUGGABLE)
- loss = compute_loss(q_a, targets) — compute loss (PLUGGABLE)
- optimizer step
- target network update (hard copy or Polyak)
Outer Loop:
Section titled “Outer Loop:”- Online → interact with env, fill replay buffer, sample batches
- Offline → just sample batches from a fixed dataset
Three pluggable components change:
Section titled “Three pluggable components change:”compute_target()— how the bootstrap target y is builtcompute_loss()— the objective (MSE, Huber, + regularizers)- Data source — online replay buffer vs. offline dataset
Variants
Section titled “Variants”| Variant | compute_target() | compute_loss() extra |
|---|---|---|
| DQN | r + γ max Q_tgt(s’, ·) | MSE |
| Double DQN | r + γ Q_tgt(s’, argmax Q(s’, ·)) | MSE |
| CQL | r + γ max Q_tgt(s’, ·) | MSE + α·CQL penalty |
| IQL | r + γ V(s’) | MSE (+ V expectile) |
| SAC / SoftQ | r + γ (Q_tgt(s’, a’) − α log π) | MSE |
Other Variants
Section titled “Other Variants”| Variant | compute_target() | compute_loss() extra |
|---|---|---|
| Dueling DQN | r + γ max Q_tgt(s’, ·) | MSE (but Q = V(s) + A(s,a) − mean A) |
| Distributional (C51) | Projected distributional Bellman target | Cross-entropy over atom probabilities |
| QR-DQN | r + γ quantiles of Q_tgt(s’, ·) | Quantile Huber loss |
| IQN | r + γ sampled quantiles of Q_tgt(s’, ·) | Quantile Huber (implicit quantile sampling) |
| Rainbow | Combines Double + Dueling + Distributional + PER + noisy nets + n-step | Cross-entropy (distributional) |
| N-step DQN | Σ γⁱrᵢ + γⁿ max Q_tgt(sₙ, ·) | MSE |
| Prioritized (PER) | Same as base (DQN/Double/etc.) | MSE, but samples weighted by TD-error priority |
| Noisy DQN | Same as base | MSE (parametric noise in network weights replaces ε-greedy) |
| HER | r + γ max Q_tgt(s’‖g, ·) | MSE (replays with relabeled goals) |
| BCQ | r + γ max_{a∈filtered} Q_tgt(s’, a) | MSE + generative model constrains actions to data support |
| TD3 (continuous) | r + γ min(Q_tgt1, Q_tgt2)(s’, π_tgt(s’)+clip noise) | MSE (clipped double Q, delayed policy update) |
| REDQ | r + γ min over M random subset of N Q-targets | MSE (high update-to-data ratio) |
| Munchausen RL | (r + α log π(a|s)) + γ (Q_tgt − α log π) | MSE (adds scaled log-policy to reward) |
Key groupings:
- Representation: Dueling (separate V/A streams)
- Distributional: C51, QR-DQN, IQN (model return distribution, not just mean)
- Sampling/training tricks: PER, N-step, Noisy nets, Rainbow (combines many)
- Offline/batch RL: BCQ (constrains to data support), pairs with CQL/IQL above
- Continuous action: TD3 (extends DDPG with clipped double-Q), REDQ
- Goal-conditioned: HER (relabels goals in hindsight)
Motives for Each Variant
Section titled “Motives for Each Variant”| Variant | Problem Solved | Intuition for Solution |
|---|---|---|
| DQN | Q-learning fails with high-dim state spaces (e.g. pixels) and is unstable with function approximation | Use a neural net as Q, plus a frozen target network and replay buffer to decorrelate updates and stabilise training |
| Double DQN | Maximisation bias: the same net both selects and evaluates the best action, causing systematic Q overestimation | Decouple action selection (online net) from action evaluation (target net) so noise in Q doesn’t consistently inflate the target |
| CQL | Offline RL: Q-values on out-of-distribution (OOD) actions are unconstrained and drift to delusional overestimates | Add a regulariser that pushes Q down on all actions (via logsumexp) and up on dataset actions, producing a learned lower bound that is conservative on OOD |
| IQL | Offline RL: even querying max_a Q(s’,a’) during training requires evaluating Q on unseen actions, enabling OOD extrapolation errors | Avoid the max entirely — learn V(s) via expectile regression on dataset (s,a) pairs, which approximates an in-sample max without ever evaluating Q on actions not in the data |
| SAC / SoftQ | Pure reward maximisation leads to brittle, deterministic policies that exploit narrow peaks and explore poorly | Augment the target with an entropy bonus (−α log π), so the agent maximises reward AND action entropy jointly, encouraging robust, multi-modal policies |
Motives for the Additional Variants
Section titled “Motives for the Additional Variants”| Variant | Problem Solved | Intuition for Solution |
|---|---|---|
| Dueling DQN | In many states, the value of being there matters more than which action you pick, but standard DQN entangles state-value and action-advantage | Split Q into V(s) + A(s,a) streams so the network can learn “this state is good/bad” without needing to experience every action in it |
| Distributional (C51) | A single expected Q-value discards information about risk and multi-modal outcomes | Model the full distribution of returns as a categorical over fixed atoms; richer gradients and risk-awareness improve learning |
| QR-DQN | C51 requires a fixed support range and atom count chosen a priori, which can clip or waste resolution | Learn quantile locations directly — no fixed support needed — and use quantile Huber loss for asymmetric regression on each quantile |
| IQN | QR-DQN fixes the number of quantiles, limiting resolution of the return distribution | Sample quantile fractions from U(0,1) at each forward pass, implicitly representing the full continuous distribution |
| Rainbow | Each DQN improvement helps independently, but they address orthogonal failure modes | Combine six complementary ideas (Double, Dueling, Distributional, PER, Noisy Nets, N-step) — gains compound because they fix different bottlenecks |
| N-step DQN | 1-step bootstrapping propagates reward information slowly, especially in sparse-reward tasks | Use n-step returns so reward signal reaches earlier states faster, at the cost of some bias from the longer bootstrap horizon |
| Prioritized (PER) | Uniform replay wastes most updates on already-learned transitions | Sample transitions proportional to their TD-error so the network focuses on surprising or poorly-predicted experiences |
| Noisy DQN | ε-greedy explores uniformly across actions regardless of state, which is inefficient and state-agnostic | Replace ε-greedy with learnable noise injected into network weights, so exploration is state-dependent and fades automatically as the network becomes confident |
| HER | In sparse-reward goal-conditioned tasks, the agent almost never reaches the goal, so it gets no learning signal | After an episode, relabel the trajectory with goals the agent actually reached — every trajectory becomes a success story for some goal, providing dense reward signal |
| BCQ | Offline RL with unconstrained action selection queries Q on actions never seen in the dataset, causing extrapolation errors | Train a generative model on the dataset’s action distribution and only allow the policy to choose from actions the generative model considers plausible |
| TD3 (continuous) | DDPG suffers from Q overestimation, brittle convergence, and sensitivity to hyperparameters in continuous control | Take the min of two Q-networks (clipped double-Q), add noise to target actions, and delay policy updates — all reduce overestimation and stabilise training |
| REDQ | High sample efficiency requires many gradient updates per environment step, but this amplifies Q overestimation | Maintain a large ensemble of Q-networks and take the min over a random subset for each target, allowing high update-to-data ratios without runaway overestimation |
| Munchausen RL | Standard bootstrapping treats all transitions equally regardless of how likely the policy was to take that action | Add the agent’s own log-policy to the reward (a “self-referential” bonus), which implicitly performs a form of KL regularisation and consistently improves DQN and SAC baselines with one line of code |
unified-q-learning.py— all variant implementations in a single file, showing how they share one core loop