Random Network Distillation (RND)
Measure exploration novelty by how poorly a predictor network can mimic a fixed random network on each observation — high prediction error means the state is unfamiliar. RND adds this prediction error as an intrinsic reward bonus, driving the agent to seek out states it has rarely or never visited. Introduced by Burda et al. (2018) at OpenAI, RND was the first method to achieve superhuman performance on Montezuma’s Revenge without demonstrations or domain knowledge.
Intuition
Section titled “Intuition”Imagine you have a friend who outputs random but consistent answers to any question — always the same random answer for the same question. You train yourself to predict your friend’s answers. For questions you’ve practiced on many times, you can predict the answer well. For questions you’ve never heard before, your prediction is terrible.
RND exploits exactly this. A target network (your friend) is a fixed, randomly initialised neural network — it maps observations to embeddings deterministically but arbitrarily. A predictor network (you) is trained to match the target’s output on observations the agent actually visits. When the agent encounters a familiar state, the predictor has been trained on similar inputs and the prediction error is low. When the agent reaches a genuinely novel state, the predictor has never seen anything like it and the error is high.
This prediction error becomes an intrinsic reward — a bonus added to the environment’s (extrinsic) reward. The agent is effectively rewarded for finding states that surprise its predictor. Over time, as the predictor trains on more data, previously novel states become familiar and their bonus decays naturally. This is crucial: unlike count-based methods, RND works in high-dimensional observation spaces (images) where explicit state counting is impossible.
The elegance of RND is its simplicity compared to other curiosity methods. ICM (Intrinsic Curiosity Module) requires learning a forward dynamics model and an inverse model. Count-based methods require density estimation. RND needs only a single extra network to train, the target network is frozen from initialisation, and the whole thing is a simple regression loss. The tradeoff is that RND can be distracted by stochastic elements in the environment (a noisy TV showing random static is always “novel”), though observation normalisation mitigates this significantly in practice.
Target network (fixed, random weights drawn once at init):
Predictor network (trained weights ):
Intrinsic reward for observation :
Predictor loss (minimised during training):
Combined reward used by the RL agent:
where controls the exploration-exploitation balance. In practice, the Burda et al. paper uses two separate value heads rather than summing rewards:
with (long horizon for sparse extrinsic reward) and (shorter horizon because intrinsic reward is non-stationary — novelty decays).
Observation normalisation (critical for stability):
The intrinsic reward is also normalised by a running estimate of its standard deviation to keep the bonus scale stable across training.
import torchimport torch.nn as nn
class RNDModel(nn.Module): """Random Network Distillation for exploration bonuses."""
def __init__(self, obs_dim, embed_dim=128, hidden_dim=256): super().__init__() # Target: fixed random network (never trained) self.target = nn.Sequential( nn.Linear(obs_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, embed_dim), ) for p in self.target.parameters(): p.requires_grad = False
# Predictor: trained to match target output self.predictor = nn.Sequential( nn.Linear(obs_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, embed_dim), )
def forward(self, obs): """Returns (predictor_embed, target_embed).""" with torch.no_grad(): target_embed = self.target(obs) predictor_embed = self.predictor(obs) return predictor_embed, target_embed
def intrinsic_reward(self, obs): """Prediction error as exploration bonus. obs: (B, obs_dim).""" pred, target = self.forward(obs) return (pred - target).pow(2).sum(dim=-1) # (B,)
def predictor_loss(self, obs): """MSE loss to train the predictor. obs: (B, obs_dim).""" pred, target = self.forward(obs) return (pred - target).pow(2).sum(dim=-1).mean()
# ── Running normalisation for observations and rewards ────────class RunningStats: """Welford's online algorithm for running mean/std."""
def __init__(self, shape): self.mean = torch.zeros(shape) self.var = torch.ones(shape) self.count = 1e-4
@torch.no_grad() def update(self, x): batch_mean = x.mean(dim=0) batch_var = x.var(dim=0) batch_count = x.shape[0] delta = batch_mean - self.mean total = self.count + batch_count self.mean += delta * batch_count / total self.var = (self.var * self.count + batch_var * batch_count + delta.pow(2) * self.count * batch_count / total) / total self.count = total
def normalise(self, x): return ((x - self.mean) / (self.var.sqrt() + 1e-8)).clamp(-5, 5)
# ── Integration with PPO training loop ───────────────────────def compute_combined_reward(ext_reward, obs, rnd_model, reward_stats): """ ext_reward: (B,) extrinsic reward from environment obs: (B, obs_dim) normalised observations Returns: (ext_reward, int_reward) kept separate for dual value heads """ int_reward = rnd_model.intrinsic_reward(obs) # (B,) reward_stats.update(int_reward.unsqueeze(-1)) int_reward = int_reward / (reward_stats.var.sqrt() + 1e-8) # normalise return ext_reward, int_reward.detach()Manual Implementation
Section titled “Manual Implementation”import numpy as np
class RNDNumpy: """ Minimal RND in pure numpy. Uses simple MLPs with ReLU. For understanding the algorithm — use the PyTorch version for real training. """
def __init__(self, obs_dim, embed_dim=64, hidden_dim=128, lr=1e-3, seed=42): rng = np.random.default_rng(seed)
# Xavier-style init for both networks def make_weights(dims): layers = [] for fan_in, fan_out in zip(dims[:-1], dims[1:]): scale = np.sqrt(2.0 / fan_in) W = rng.normal(0, scale, (fan_in, fan_out)).astype(np.float32) b = np.zeros(fan_out, dtype=np.float32) layers.append((W, b)) return layers
arch = [obs_dim, hidden_dim, hidden_dim, embed_dim] self.target_layers = make_weights(arch) # frozen self.pred_layers = make_weights(arch) # trainable self.lr = lr
# Running stats for observation normalisation self.obs_mean = np.zeros(obs_dim, dtype=np.float32) self.obs_var = np.ones(obs_dim, dtype=np.float32) self.obs_count = 1e-4
def _forward(self, x, layers): """MLP forward pass with ReLU (no activation on final layer).""" for i, (W, b) in enumerate(layers): x = x @ W + b if i < len(layers) - 1: x = np.maximum(x, 0) # ReLU return x
def normalise_obs(self, obs): return np.clip((obs - self.obs_mean) / (np.sqrt(self.obs_var) + 1e-8), -5, 5)
def update_obs_stats(self, obs_batch): """Welford online update for observation normalisation.""" batch_mean = obs_batch.mean(axis=0) batch_var = obs_batch.var(axis=0) batch_count = obs_batch.shape[0] delta = batch_mean - self.obs_mean total = self.obs_count + batch_count self.obs_mean += delta * batch_count / total self.obs_var = (self.obs_var * self.obs_count + batch_var * batch_count + delta ** 2 * self.obs_count * batch_count / total) / total self.obs_count = total
def intrinsic_reward(self, obs): """ obs: (B, obs_dim) or (obs_dim,), raw observations Returns: (B,) or scalar, prediction error as intrinsic reward """ single = obs.ndim == 1 if single: obs = obs[np.newaxis]
obs_norm = self.normalise_obs(obs) target_out = self._forward(obs_norm, self.target_layers) pred_out = self._forward(obs_norm, self.pred_layers) error = np.sum((pred_out - target_out) ** 2, axis=-1)
return error[0] if single else error
def train_predictor(self, obs_batch): """ One gradient step on the predictor network (numerical gradients). obs_batch: (B, obs_dim), raw observations Returns: float, mean prediction error (loss) """ obs_norm = self.normalise_obs(obs_batch) target_out = self._forward(obs_norm, self.target_layers)
# Forward pass through predictor, caching activations for backprop activations = [obs_norm] x = obs_norm for i, (W, b) in enumerate(self.pred_layers): x = x @ W + b if i < len(self.pred_layers) - 1: x = np.maximum(x, 0) activations.append(x)
pred_out = activations[-1] error = pred_out - target_out # (B, embed_dim) loss = np.mean(np.sum(error ** 2, axis=-1))
# Backprop through predictor grad = 2.0 * error / obs_batch.shape[0] # d(loss)/d(pred_out) for i in range(len(self.pred_layers) - 1, -1, -1): W, b = self.pred_layers[i] act = activations[i]
dW = act.T @ grad db = grad.sum(axis=0)
self.pred_layers[i] = (W - self.lr * dW, b - self.lr * db)
if i > 0: grad = grad @ W.T # ReLU backward grad = grad * (activations[i] > 0)
return lossPopular Uses
Section titled “Popular Uses”- PPO + RND (Burda et al., 2018): the original paper pairs RND with PPO using 128 parallel environments, achieving a mean score of 6,600+ on Montezuma’s Revenge — the first method to clear the first level without human demonstrations
- Large-scale exploration (OpenAI): the same paper trains on 128 parallel workers with a total of ~1.6 billion frames, showing that RND scales well with compute
- Go-Explore (Ecoffet et al., 2021): uses RND-style novelty detection as one component of its exploration strategy for hard-exploration Atari games
- MiniGrid / procedurally generated environments: RND is a standard exploration baseline in environments where reward is extremely sparse and the observation space is high-dimensional
Alternatives
Section titled “Alternatives”| Alternative | When to use | Tradeoff |
|---|---|---|
| Epsilon-greedy | Simple discrete action spaces with shaped rewards | Zero overhead but completely undirected — fails in sparse reward settings |
| ICM (Intrinsic Curiosity Module) | Want exploration driven by dynamics prediction error | Uses forward/inverse dynamics models; robust to visual distractors via learned feature space; more complex than RND |
| Count-based exploration (hash, pseudo-counts) | Low-dimensional or discretisable state spaces | Principled (upper confidence bounds); breaks down in high-dimensional observation spaces |
| Noisy Networks (NoisyNet) | Want exploration without separate bonus | Learned noise in network weights; simpler but less directed than curiosity methods |
| Never Give Up (NGU) | Need both episodic and life-long novelty | Combines RND-like life-long novelty with episodic counts; significantly more complex |
| Agent57 | Need SOTA on all 57 Atari games | Builds on NGU with meta-controller for exploration-exploitation; very heavy |
| RE3 (Random Encoders for Efficient Exploration) | Want a simpler random-encoder bonus | Uses k-NN in random encoder space instead of prediction error; no predictor to train |
Historical Context
Section titled “Historical Context”RND emerged from a line of work on curiosity-driven exploration. Schmidhuber (1991) first proposed using prediction error as intrinsic motivation. Pathak et al. (2017) revived this idea as ICM, which learns a feature space via inverse dynamics to make prediction error robust to irrelevant visual details. However, ICM requires two auxiliary models (forward and inverse) and can still struggle with stochastic environments.
Burda et al. (2018) made two key observations: (1) a fixed random network already provides a useful feature space — no need to learn one, and (2) prediction error in this random feature space serves as an effective novelty signal. The resulting method (RND) is dramatically simpler than ICM while being more robust. Their paper also carefully analysed the “noisy TV problem” — where stochastic environment elements create permanently high prediction error — and showed that observation normalisation largely solves it.
RND’s success on Montezuma’s Revenge (a game requiring long sequences of specific actions with no intermediate reward) was a landmark result. It demonstrated that pure curiosity, without demonstrations or handcrafted rewards, could drive deep exploration in hard problems. This influenced subsequent work like Never Give Up (Badia et al., 2020) and Agent57 (Badia et al., 2020), which combined RND-style life-long novelty with episodic novelty to achieve superhuman performance across all 57 Atari games.