Skip to content

Unified Q-Learning

Unified Q-Learning

A single skeleton that covers Q-Learning, DQN, Double DQN, Dueling DQN, CQL, IQL, SAC (Q-critic), and more.

Key Idea

The core Q-learning loop is identical across all variants.

Always the same (core loop):

q_a = Q(s).gather(a) — evaluate current Q
targets = compute_target(batch) — build target (PLUGGABLE)
loss = compute_loss(q_a, targets) — compute loss (PLUGGABLE)
optimizer step
target network update (hard copy or Polyak)

Outer Loop:

Online → interact with env, fill replay buffer, sample batches
Offline → just sample batches from a fixed dataset

Three pluggable components change:

compute_target() — how the bootstrap target y is built
compute_loss() — the objective (MSE, Huber, + regularizers)
Data source — online replay buffer vs. offline dataset

Variants

Variant	`compute_target()`	`compute_loss()` extra
DQN	r + γ max Q_tgt(s’, ·)	MSE
Double DQN	r + γ Q_tgt(s’, argmax Q(s’, ·))	MSE
CQL	r + γ max Q_tgt(s’, ·)	MSE + α·CQL penalty
IQL	r + γ V(s’)	MSE (+ V expectile)
SAC / SoftQ	r + γ (Q_tgt(s’, a’) − α log π)	MSE

Other Variants

Variant	`compute_target()`	`compute_loss()` extra
Dueling DQN	r + γ max Q_tgt(s’, ·)	MSE (but Q = V(s) + A(s,a) − mean A)
Distributional (C51)	Projected distributional Bellman target	Cross-entropy over atom probabilities
QR-DQN	r + γ quantiles of Q_tgt(s’, ·)	Quantile Huber loss
IQN	r + γ sampled quantiles of Q_tgt(s’, ·)	Quantile Huber (implicit quantile sampling)
Rainbow	Combines Double + Dueling + Distributional + PER + noisy nets + n-step	Cross-entropy (distributional)
N-step DQN	Σ γⁱrᵢ + γⁿ max Q_tgt(sₙ, ·)	MSE
Prioritized (PER)	Same as base (DQN/Double/etc.)	MSE, but samples weighted by TD-error priority
Noisy DQN	Same as base	MSE (parametric noise in network weights replaces ε-greedy)
HER	r + γ max Q_tgt(s’‖g, ·)	MSE (replays with relabeled goals)
BCQ	r + γ max_{a∈filtered} Q_tgt(s’, a)	MSE + generative model constrains actions to data support
TD3 (continuous)	r + γ min(Q_tgt1, Q_tgt2)(s’, π_tgt(s’)+clip noise)	MSE (clipped double Q, delayed policy update)
REDQ	r + γ min over M random subset of N Q-targets	MSE (high update-to-data ratio)
Munchausen RL	(r + α log π(a\|s)) + γ (Q_tgt − α log π)	MSE (adds scaled log-policy to reward)

Key groupings:

Representation: Dueling (separate V/A streams)
Distributional: C51, QR-DQN, IQN (model return distribution, not just mean)
Sampling/training tricks: PER, N-step, Noisy nets, Rainbow (combines many)
Offline/batch RL: BCQ (constrains to data support), pairs with CQL/IQL above
Continuous action: TD3 (extends DDPG with clipped double-Q), REDQ
Goal-conditioned: HER (relabels goals in hindsight)

Motives for Each Variant

Variant	Problem Solved	Intuition for Solution
DQN	Q-learning fails with high-dim state spaces (e.g. pixels) and is unstable with function approximation	Use a neural net as Q, plus a frozen target network and replay buffer to decorrelate updates and stabilise training
Double DQN	Maximisation bias: the same net both selects and evaluates the best action, causing systematic Q overestimation	Decouple action selection (online net) from action evaluation (target net) so noise in Q doesn’t consistently inflate the target
CQL	Offline RL: Q-values on out-of-distribution (OOD) actions are unconstrained and drift to delusional overestimates	Add a regulariser that pushes Q down on all actions (via logsumexp) and up on dataset actions, producing a learned lower bound that is conservative on OOD
IQL	Offline RL: even querying max_a Q(s’,a’) during training requires evaluating Q on unseen actions, enabling OOD extrapolation errors	Avoid the max entirely — learn V(s) via expectile regression on dataset (s,a) pairs, which approximates an in-sample max without ever evaluating Q on actions not in the data
SAC / SoftQ	Pure reward maximisation leads to brittle, deterministic policies that exploit narrow peaks and explore poorly	Augment the target with an entropy bonus (−α log π), so the agent maximises reward AND action entropy jointly, encouraging robust, multi-modal policies

Motives for the Additional Variants

Variant	Problem Solved	Intuition for Solution
Dueling DQN	In many states, the value of being there matters more than which action you pick, but standard DQN entangles state-value and action-advantage	Split Q into V(s) + A(s,a) streams so the network can learn “this state is good/bad” without needing to experience every action in it
Distributional (C51)	A single expected Q-value discards information about risk and multi-modal outcomes	Model the full distribution of returns as a categorical over fixed atoms; richer gradients and risk-awareness improve learning
QR-DQN	C51 requires a fixed support range and atom count chosen a priori, which can clip or waste resolution	Learn quantile locations directly — no fixed support needed — and use quantile Huber loss for asymmetric regression on each quantile
IQN	QR-DQN fixes the number of quantiles, limiting resolution of the return distribution	Sample quantile fractions from U(0,1) at each forward pass, implicitly representing the full continuous distribution
Rainbow	Each DQN improvement helps independently, but they address orthogonal failure modes	Combine six complementary ideas (Double, Dueling, Distributional, PER, Noisy Nets, N-step) — gains compound because they fix different bottlenecks
N-step DQN	1-step bootstrapping propagates reward information slowly, especially in sparse-reward tasks	Use n-step returns so reward signal reaches earlier states faster, at the cost of some bias from the longer bootstrap horizon
Prioritized (PER)	Uniform replay wastes most updates on already-learned transitions	Sample transitions proportional to their TD-error so the network focuses on surprising or poorly-predicted experiences
Noisy DQN	ε-greedy explores uniformly across actions regardless of state, which is inefficient and state-agnostic	Replace ε-greedy with learnable noise injected into network weights, so exploration is state-dependent and fades automatically as the network becomes confident
HER	In sparse-reward goal-conditioned tasks, the agent almost never reaches the goal, so it gets no learning signal	After an episode, relabel the trajectory with goals the agent actually reached — every trajectory becomes a success story for some goal, providing dense reward signal
BCQ	Offline RL with unconstrained action selection queries Q on actions never seen in the dataset, causing extrapolation errors	Train a generative model on the dataset’s action distribution and only allow the policy to choose from actions the generative model considers plausible
TD3 (continuous)	DDPG suffers from Q overestimation, brittle convergence, and sensitivity to hyperparameters in continuous control	Take the min of two Q-networks (clipped double-Q), add noise to target actions, and delay policy updates — all reduce overestimation and stabilise training
REDQ	High sample efficiency requires many gradient updates per environment step, but this amplifies Q overestimation	Maintain a large ensemble of Q-networks and take the min over a random subset for each target, allowing high update-to-data ratios without runaway overestimation
Munchausen RL	Standard bootstrapping treats all transitions equally regardless of how likely the policy was to take that action	Add the agent’s own log-policy to the reward (a “self-referential” bonus), which implicitly performs a form of KL regularisation and consistently improves DQN and SAC baselines with one line of code

File

unified-q-learning.py — all variant implementations in a single file, showing how they share one core loop