Skip to content

Replay Buffers

Store past (s, a, r, s’, done) transitions and sample random minibatches for training. Breaks temporal correlation between consecutive samples, enables sample reuse, and is essential for all off-policy methods (DQN, SAC, DDPG, TD3). Also called “experience replay” or “replay memory.” Two main variants: uniform sampling and prioritised replay.

Neural networks trained on correlated sequential data are unstable — if the agent spends 100 steps in one room, the network overfits to that room and forgets everything else. A replay buffer fixes this by storing transitions as they arrive and sampling random minibatches for training. Random sampling breaks the temporal correlation: a single batch might contain transitions from many different episodes and states, giving the network a diverse training signal.

The second benefit is sample efficiency. In on-policy methods (PPO, A2C), you collect data, use it once, then throw it away. With a replay buffer, every transition can be reused many times. DQN typically replays each transition ~8 times over its lifetime in the buffer. This is why off-policy methods are far more sample-efficient than on-policy ones.

Prioritised replay goes further: instead of sampling uniformly, it samples transitions proportional to their TD error. High TD error means the network’s prediction was far off — these are the most “surprising” transitions and learning from them yields the largest gradient updates. The catch is that this introduces bias (over-representing surprising transitions changes the effective distribution), so importance sampling weights are needed to correct it.

Uniform replay: sample each transition with probability 1N\frac{1}{N} where NN is the buffer size.

Prioritised replay (Schaul et al., 2016):

Priority of transition ii: pi=δi+ϵp_i = |\delta_i| + \epsilon where δi\delta_i is the TD error and ϵ\epsilon is a small constant preventing zero priority.

Sampling probability:

P(i)=piαkpkαP(i) = \frac{p_i^\alpha}{\sum_k p_k^\alpha}

α\alpha controls how much prioritisation is used: α=0\alpha = 0 is uniform, α=1\alpha = 1 is fully proportional to priority.

Importance sampling correction (to remove the bias from non-uniform sampling):

wi=(1NP(i))βw_i = \left(\frac{1}{N \cdot P(i)}\right)^\beta

β\beta is annealed from a small value to 1.0 over training. Weights are normalised by maxiwi\max_i w_i for stability. The loss becomes:

L=1Bibatchwi(Q(si,ai),yi)\mathcal{L} = \frac{1}{B} \sum_{i \in \text{batch}} w_i \cdot \ell(Q(s_i, a_i), y_i)

import torch
import numpy as np
from collections import deque
# ── Uniform Replay Buffer ───────────────────────────────────────
class ReplayBuffer:
def __init__(self, capacity):
self.buffer = deque(maxlen=capacity)
def push(self, state, action, reward, next_state, done):
self.buffer.append((state, action, reward, next_state, done))
def sample(self, batch_size):
indices = np.random.choice(len(self.buffer), batch_size, replace=False)
batch = [self.buffer[i] for i in indices]
states, actions, rewards, next_states, dones = zip(*batch)
return (
torch.stack(states), # (B, *state_shape)
torch.tensor(actions), # (B,)
torch.tensor(rewards, dtype=torch.float32), # (B,)
torch.stack(next_states), # (B, *state_shape)
torch.tensor(dones, dtype=torch.float32), # (B,)
)
def __len__(self):
return len(self.buffer)
# ── Usage in DQN training loop ──────────────────────────────────
# NEVER start training until the buffer has enough transitions.
# Common minimum: 10,000-50,000 random transitions before learning.
if len(buffer) >= min_buffer_size:
batch = buffer.sample(batch_size=256)
import numpy as np
class ReplayBufferNumpy:
"""
Fixed-size circular buffer with uniform sampling, numpy-only.
More memory-efficient than deque of tuples for large buffers.
"""
def __init__(self, capacity, state_dim, action_dim=1):
self.capacity = capacity
self.idx = 0 # next write position
self.size = 0 # current number of transitions
# Pre-allocate contiguous arrays — avoids per-transition allocation
self.states = np.zeros((capacity, state_dim), dtype=np.float32)
self.actions = np.zeros((capacity, action_dim), dtype=np.int64)
self.rewards = np.zeros(capacity, dtype=np.float32)
self.next_states = np.zeros((capacity, state_dim), dtype=np.float32)
self.dones = np.zeros(capacity, dtype=np.float32)
def push(self, s, a, r, s_next, done):
self.states[self.idx] = s
self.actions[self.idx] = a
self.rewards[self.idx] = r
self.next_states[self.idx] = s_next
self.dones[self.idx] = float(done)
self.idx = (self.idx + 1) % self.capacity # circular overwrite
self.size = min(self.size + 1, self.capacity)
def sample(self, batch_size):
indices = np.random.choice(self.size, batch_size, replace=False)
return (
self.states[indices], # (B, state_dim)
self.actions[indices], # (B, action_dim)
self.rewards[indices], # (B,)
self.next_states[indices], # (B, state_dim)
self.dones[indices], # (B,)
)
  • DQN (see q-learning/): the original replay buffer — 1M transitions, uniform sampling, batch size 32. One of the two key innovations (alongside target networks) that made deep Q-learning work
  • SAC, DDPG, TD3 (see q-learning/): all continuous-control off-policy methods use replay buffers, typically 1M capacity
  • Prioritised DQN (Rainbow): prioritised replay was one of six improvements combined in Rainbow DQN
  • Offline RL (CQL, IQL — see q-learning/): the “replay buffer” is a fixed dataset collected by a previous policy. No new transitions are added during training
  • HER (Hindsight Experience Replay): relabels goals in stored transitions after the fact, turning failures into successes for goal-conditioned RL
AlternativeWhen to useTradeoff
On-policy data (PPO, A2C)Policy gradient methods that require fresh trajectoriesNo buffer needed; data is discarded after one use; less sample-efficient but avoids off-policy bias
Prioritised replayWant faster learning on hard transitions~40% better sample efficiency on Atari; adds complexity, needs importance sampling correction
Hindsight replay (HER)Sparse reward, goal-conditioned tasksRelabels goals to create “free” positive transitions; only applicable to goal-conditioned setups
N-step replayWant multi-step returns in off-policy methodsStore (s_t, a_t, R_{t:t+n}, s_{t+n}); lower bias targets but slightly stale n-step returns
Reservoir samplingStreaming data, unknown total sizeMaintains uniform sample from a stream; not commonly used in deep RL

Experience replay was introduced by Long-Ji Lin in 1992 as a technique for speeding up credit assignment in reinforcement learning. It remained a niche idea until Mnih et al. (2013, 2015) used it as a critical component of DQN. The DQN paper showed that without replay, deep Q-learning diverges — the temporal correlation in sequential data causes catastrophic forgetting and instability.

Prioritised experience replay (Schaul et al., 2016) improved on uniform sampling by focusing on high-TD-error transitions. It was incorporated into Rainbow (Hessel et al., 2018) as one of six complementary improvements. Modern frameworks like Reverb (DeepMind) and TorchRL provide production-quality replay buffer implementations with distributed sampling, compression, and rate limiting. The concept has also influenced supervised learning through curriculum learning and hard example mining.