Replay Buffers

Store past (s, a, r, s’, done) transitions and sample random minibatches for training. Breaks temporal correlation between consecutive samples, enables sample reuse, and is essential for all off-policy methods (DQN, SAC, DDPG, TD3). Also called “experience replay” or “replay memory.” Two main variants: uniform sampling and prioritised replay.

Intuition

Neural networks trained on correlated sequential data are unstable — if the agent spends 100 steps in one room, the network overfits to that room and forgets everything else. A replay buffer fixes this by storing transitions as they arrive and sampling random minibatches for training. Random sampling breaks the temporal correlation: a single batch might contain transitions from many different episodes and states, giving the network a diverse training signal.

The second benefit is sample efficiency. In on-policy methods (PPO, A2C), you collect data, use it once, then throw it away. With a replay buffer, every transition can be reused many times. DQN typically replays each transition ~8 times over its lifetime in the buffer. This is why off-policy methods are far more sample-efficient than on-policy ones.

Prioritised replay goes further: instead of sampling uniformly, it samples transitions proportional to their TD error. High TD error means the network’s prediction was far off — these are the most “surprising” transitions and learning from them yields the largest gradient updates. The catch is that this introduces bias (over-representing surprising transitions changes the effective distribution), so importance sampling weights are needed to correct it.

Math

Uniform replay: sample each transition with probability $\frac{1}{N}$ where $N$ is the buffer size.

Prioritised replay (Schaul et al., 2016):

Priority of transition $i$ : $p_i = |\delta_i| + \epsilon$ where $\delta_i$ is the TD error and $\epsilon$ is a small constant preventing zero priority.

Sampling probability:

$P(i) = \frac{p_i^\alpha}{\sum_k p_k^\alpha}$

$\alpha$ controls how much prioritisation is used: $\alpha = 0$ is uniform, $\alpha = 1$ is fully proportional to priority.

Importance sampling correction (to remove the bias from non-uniform sampling):

$w_i = \left(\frac{1}{N \cdot P(i)}\right)^\beta$

$\beta$ is annealed from a small value to 1.0 over training. Weights are normalised by $\max_i w_i$ for stability. The loss becomes:

$\mathcal{L} = \frac{1}{B} \sum_{i \in \text{batch}} w_i \cdot \ell(Q(s_i, a_i), y_i)$

Code

import torch
import numpy as np
from collections import deque

# ── Uniform Replay Buffer ───────────────────────────────────────
class ReplayBuffer:
    def __init__(self, capacity):
        self.buffer = deque(maxlen=capacity)

    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))

    def sample(self, batch_size):
        indices = np.random.choice(len(self.buffer), batch_size, replace=False)
        batch = [self.buffer[i] for i in indices]
        states, actions, rewards, next_states, dones = zip(*batch)
        return (
            torch.stack(states),                    # (B, *state_shape)
            torch.tensor(actions),                  # (B,)
            torch.tensor(rewards, dtype=torch.float32),   # (B,)
            torch.stack(next_states),               # (B, *state_shape)
            torch.tensor(dones, dtype=torch.float32),     # (B,)
        )

    def __len__(self):
        return len(self.buffer)

# ── Usage in DQN training loop ──────────────────────────────────
# NEVER start training until the buffer has enough transitions.
# Common minimum: 10,000-50,000 random transitions before learning.
if len(buffer) >= min_buffer_size:
    batch = buffer.sample(batch_size=256)

Manual Implementation

import numpy as np

class ReplayBufferNumpy:
    """
    Fixed-size circular buffer with uniform sampling, numpy-only.
    More memory-efficient than deque of tuples for large buffers.
    """
    def __init__(self, capacity, state_dim, action_dim=1):
        self.capacity = capacity
        self.idx = 0                                  # next write position
        self.size = 0                                 # current number of transitions

        # Pre-allocate contiguous arrays — avoids per-transition allocation
        self.states      = np.zeros((capacity, state_dim), dtype=np.float32)
        self.actions     = np.zeros((capacity, action_dim), dtype=np.int64)
        self.rewards     = np.zeros(capacity, dtype=np.float32)
        self.next_states = np.zeros((capacity, state_dim), dtype=np.float32)
        self.dones       = np.zeros(capacity, dtype=np.float32)

    def push(self, s, a, r, s_next, done):
        self.states[self.idx]      = s
        self.actions[self.idx]     = a
        self.rewards[self.idx]     = r
        self.next_states[self.idx] = s_next
        self.dones[self.idx]       = float(done)
        self.idx = (self.idx + 1) % self.capacity     # circular overwrite
        self.size = min(self.size + 1, self.capacity)

    def sample(self, batch_size):
        indices = np.random.choice(self.size, batch_size, replace=False)
        return (
            self.states[indices],                      # (B, state_dim)
            self.actions[indices],                     # (B, action_dim)
            self.rewards[indices],                     # (B,)
            self.next_states[indices],                 # (B, state_dim)
            self.dones[indices],                       # (B,)
        )

Popular Uses

DQN (see q-learning/): the original replay buffer — 1M transitions, uniform sampling, batch size 32. One of the two key innovations (alongside target networks) that made deep Q-learning work
SAC, DDPG, TD3 (see q-learning/): all continuous-control off-policy methods use replay buffers, typically 1M capacity
Prioritised DQN (Rainbow): prioritised replay was one of six improvements combined in Rainbow DQN
Offline RL (CQL, IQL — see q-learning/): the “replay buffer” is a fixed dataset collected by a previous policy. No new transitions are added during training
HER (Hindsight Experience Replay): relabels goals in stored transitions after the fact, turning failures into successes for goal-conditioned RL

Alternatives

Alternative	When to use	Tradeoff
On-policy data (PPO, A2C)	Policy gradient methods that require fresh trajectories	No buffer needed; data is discarded after one use; less sample-efficient but avoids off-policy bias
Prioritised replay	Want faster learning on hard transitions	~40% better sample efficiency on Atari; adds complexity, needs importance sampling correction
Hindsight replay (HER)	Sparse reward, goal-conditioned tasks	Relabels goals to create “free” positive transitions; only applicable to goal-conditioned setups
N-step replay	Want multi-step returns in off-policy methods	Store (s_t, a_t, R_{t:t+n}, s_{t+n}); lower bias targets but slightly stale n-step returns
Reservoir sampling	Streaming data, unknown total size	Maintains uniform sample from a stream; not commonly used in deep RL

Historical Context

Experience replay was introduced by Long-Ji Lin in 1992 as a technique for speeding up credit assignment in reinforcement learning. It remained a niche idea until Mnih et al. (2013, 2015) used it as a critical component of DQN. The DQN paper showed that without replay, deep Q-learning diverges — the temporal correlation in sequential data causes catastrophic forgetting and instability.

Prioritised experience replay (Schaul et al., 2016) improved on uniform sampling by focusing on high-TD-error transitions. It was incorporated into Rainbow (Hessel et al., 2018) as one of six complementary improvements. Modern frameworks like Reverb (DeepMind) and TorchRL provide production-quality replay buffer implementations with distributed sampling, compression, and rate limiting. The concept has also influenced supervised learning through curriculum learning and hard example mining.