Replay Buffers
Replay Buffers
Section titled “Replay Buffers”Store past (s, a, r, s’, done) transitions and sample random minibatches for training. Breaks temporal correlation between consecutive samples, enables sample reuse, and is essential for all off-policy methods (DQN, SAC, DDPG, TD3). Also called “experience replay” or “replay memory.” Two main variants: uniform sampling and prioritised replay.
Intuition
Section titled “Intuition”Neural networks trained on correlated sequential data are unstable — if the agent spends 100 steps in one room, the network overfits to that room and forgets everything else. A replay buffer fixes this by storing transitions as they arrive and sampling random minibatches for training. Random sampling breaks the temporal correlation: a single batch might contain transitions from many different episodes and states, giving the network a diverse training signal.
The second benefit is sample efficiency. In on-policy methods (PPO, A2C), you collect data, use it once, then throw it away. With a replay buffer, every transition can be reused many times. DQN typically replays each transition ~8 times over its lifetime in the buffer. This is why off-policy methods are far more sample-efficient than on-policy ones.
Prioritised replay goes further: instead of sampling uniformly, it samples transitions proportional to their TD error. High TD error means the network’s prediction was far off — these are the most “surprising” transitions and learning from them yields the largest gradient updates. The catch is that this introduces bias (over-representing surprising transitions changes the effective distribution), so importance sampling weights are needed to correct it.
Uniform replay: sample each transition with probability where is the buffer size.
Prioritised replay (Schaul et al., 2016):
Priority of transition : where is the TD error and is a small constant preventing zero priority.
Sampling probability:
controls how much prioritisation is used: is uniform, is fully proportional to priority.
Importance sampling correction (to remove the bias from non-uniform sampling):
is annealed from a small value to 1.0 over training. Weights are normalised by for stability. The loss becomes:
import torchimport numpy as npfrom collections import deque
# ── Uniform Replay Buffer ───────────────────────────────────────class ReplayBuffer: def __init__(self, capacity): self.buffer = deque(maxlen=capacity)
def push(self, state, action, reward, next_state, done): self.buffer.append((state, action, reward, next_state, done))
def sample(self, batch_size): indices = np.random.choice(len(self.buffer), batch_size, replace=False) batch = [self.buffer[i] for i in indices] states, actions, rewards, next_states, dones = zip(*batch) return ( torch.stack(states), # (B, *state_shape) torch.tensor(actions), # (B,) torch.tensor(rewards, dtype=torch.float32), # (B,) torch.stack(next_states), # (B, *state_shape) torch.tensor(dones, dtype=torch.float32), # (B,) )
def __len__(self): return len(self.buffer)
# ── Usage in DQN training loop ──────────────────────────────────# NEVER start training until the buffer has enough transitions.# Common minimum: 10,000-50,000 random transitions before learning.if len(buffer) >= min_buffer_size: batch = buffer.sample(batch_size=256)Manual Implementation
Section titled “Manual Implementation”import numpy as np
class ReplayBufferNumpy: """ Fixed-size circular buffer with uniform sampling, numpy-only. More memory-efficient than deque of tuples for large buffers. """ def __init__(self, capacity, state_dim, action_dim=1): self.capacity = capacity self.idx = 0 # next write position self.size = 0 # current number of transitions
# Pre-allocate contiguous arrays — avoids per-transition allocation self.states = np.zeros((capacity, state_dim), dtype=np.float32) self.actions = np.zeros((capacity, action_dim), dtype=np.int64) self.rewards = np.zeros(capacity, dtype=np.float32) self.next_states = np.zeros((capacity, state_dim), dtype=np.float32) self.dones = np.zeros(capacity, dtype=np.float32)
def push(self, s, a, r, s_next, done): self.states[self.idx] = s self.actions[self.idx] = a self.rewards[self.idx] = r self.next_states[self.idx] = s_next self.dones[self.idx] = float(done) self.idx = (self.idx + 1) % self.capacity # circular overwrite self.size = min(self.size + 1, self.capacity)
def sample(self, batch_size): indices = np.random.choice(self.size, batch_size, replace=False) return ( self.states[indices], # (B, state_dim) self.actions[indices], # (B, action_dim) self.rewards[indices], # (B,) self.next_states[indices], # (B, state_dim) self.dones[indices], # (B,) )Popular Uses
Section titled “Popular Uses”- DQN (see
q-learning/): the original replay buffer — 1M transitions, uniform sampling, batch size 32. One of the two key innovations (alongside target networks) that made deep Q-learning work - SAC, DDPG, TD3 (see
q-learning/): all continuous-control off-policy methods use replay buffers, typically 1M capacity - Prioritised DQN (Rainbow): prioritised replay was one of six improvements combined in Rainbow DQN
- Offline RL (CQL, IQL — see
q-learning/): the “replay buffer” is a fixed dataset collected by a previous policy. No new transitions are added during training - HER (Hindsight Experience Replay): relabels goals in stored transitions after the fact, turning failures into successes for goal-conditioned RL
Alternatives
Section titled “Alternatives”| Alternative | When to use | Tradeoff |
|---|---|---|
| On-policy data (PPO, A2C) | Policy gradient methods that require fresh trajectories | No buffer needed; data is discarded after one use; less sample-efficient but avoids off-policy bias |
| Prioritised replay | Want faster learning on hard transitions | ~40% better sample efficiency on Atari; adds complexity, needs importance sampling correction |
| Hindsight replay (HER) | Sparse reward, goal-conditioned tasks | Relabels goals to create “free” positive transitions; only applicable to goal-conditioned setups |
| N-step replay | Want multi-step returns in off-policy methods | Store (s_t, a_t, R_{t:t+n}, s_{t+n}); lower bias targets but slightly stale n-step returns |
| Reservoir sampling | Streaming data, unknown total size | Maintains uniform sample from a stream; not commonly used in deep RL |
Historical Context
Section titled “Historical Context”Experience replay was introduced by Long-Ji Lin in 1992 as a technique for speeding up credit assignment in reinforcement learning. It remained a niche idea until Mnih et al. (2013, 2015) used it as a critical component of DQN. The DQN paper showed that without replay, deep Q-learning diverges — the temporal correlation in sequential data causes catastrophic forgetting and instability.
Prioritised experience replay (Schaul et al., 2016) improved on uniform sampling by focusing on high-TD-error transitions. It was incorporated into Rainbow (Hessel et al., 2018) as one of six complementary improvements. Modern frameworks like Reverb (DeepMind) and TorchRL provide production-quality replay buffer implementations with distributed sampling, compression, and rate limiting. The concept has also influenced supervised learning through curriculum learning and hard example mining.