Skip to content

Random Network Distillation (RND)

Measure exploration novelty by how poorly a predictor network can mimic a fixed random network on each observation — high prediction error means the state is unfamiliar. RND adds this prediction error as an intrinsic reward bonus, driving the agent to seek out states it has rarely or never visited. Introduced by Burda et al. (2018) at OpenAI, RND was the first method to achieve superhuman performance on Montezuma’s Revenge without demonstrations or domain knowledge.

Imagine you have a friend who outputs random but consistent answers to any question — always the same random answer for the same question. You train yourself to predict your friend’s answers. For questions you’ve practiced on many times, you can predict the answer well. For questions you’ve never heard before, your prediction is terrible.

RND exploits exactly this. A target network (your friend) is a fixed, randomly initialised neural network — it maps observations to embeddings deterministically but arbitrarily. A predictor network (you) is trained to match the target’s output on observations the agent actually visits. When the agent encounters a familiar state, the predictor has been trained on similar inputs and the prediction error is low. When the agent reaches a genuinely novel state, the predictor has never seen anything like it and the error is high.

This prediction error becomes an intrinsic reward — a bonus added to the environment’s (extrinsic) reward. The agent is effectively rewarded for finding states that surprise its predictor. Over time, as the predictor trains on more data, previously novel states become familiar and their bonus decays naturally. This is crucial: unlike count-based methods, RND works in high-dimensional observation spaces (images) where explicit state counting is impossible.

The elegance of RND is its simplicity compared to other curiosity methods. ICM (Intrinsic Curiosity Module) requires learning a forward dynamics model and an inverse model. Count-based methods require density estimation. RND needs only a single extra network to train, the target network is frozen from initialisation, and the whole thing is a simple regression loss. The tradeoff is that RND can be distracted by stochastic elements in the environment (a noisy TV showing random static is always “novel”), though observation normalisation mitigates this significantly in practice.

Target network (fixed, random weights θ\theta^* drawn once at init):

f:ORk,θinit distribution (never updated)f^* : \mathcal{O} \to \mathbb{R}^k, \quad \theta^* \sim \text{init distribution (never updated)}

Predictor network (trained weights θ\theta):

f^:ORk,θ updated by gradient descent\hat{f} : \mathcal{O} \to \mathbb{R}^k, \quad \theta \text{ updated by gradient descent}

Intrinsic reward for observation oto_t:

rtint=f^(ot;θ)f(ot;θ)2r_t^{\text{int}} = \lVert \hat{f}(o_t; \theta) - f^*(o_t; \theta^*) \rVert^2

Predictor loss (minimised during training):

L(θ)=Eotπ[f^(ot;θ)f(ot;θ)2]L(\theta) = \mathbb{E}_{o_t \sim \pi} \bigl[ \lVert \hat{f}(o_t; \theta) - f^*(o_t; \theta^*) \rVert^2 \bigr]

Combined reward used by the RL agent:

rt=rtext+βrtintr_t = r_t^{\text{ext}} + \beta \, r_t^{\text{int}}

where β\beta controls the exploration-exploitation balance. In practice, the Burda et al. paper uses two separate value heads rather than summing rewards:

Vext(st)E[k=0γextkrt+kext],Vint(st)E[k=0γintkrt+kint]V^{\text{ext}}(s_t) \approx \mathbb{E}\Bigl[\sum_{k=0}^{\infty} \gamma_{\text{ext}}^k \, r_{t+k}^{\text{ext}}\Bigr], \quad V^{\text{int}}(s_t) \approx \mathbb{E}\Bigl[\sum_{k=0}^{\infty} \gamma_{\text{int}}^k \, r_{t+k}^{\text{int}}\Bigr]

with γext=0.999\gamma_{\text{ext}} = 0.999 (long horizon for sparse extrinsic reward) and γint=0.99\gamma_{\text{int}} = 0.99 (shorter horizon because intrinsic reward is non-stationary — novelty decays).

Observation normalisation (critical for stability):

o^t=clip ⁣(otμrunningσrunning,5,5)\hat{o}_t = \text{clip}\!\left(\frac{o_t - \mu_{\text{running}}}{\sigma_{\text{running}}}, -5, 5\right)

The intrinsic reward is also normalised by a running estimate of its standard deviation to keep the bonus scale stable across training.

import torch
import torch.nn as nn
class RNDModel(nn.Module):
"""Random Network Distillation for exploration bonuses."""
def __init__(self, obs_dim, embed_dim=128, hidden_dim=256):
super().__init__()
# Target: fixed random network (never trained)
self.target = nn.Sequential(
nn.Linear(obs_dim, hidden_dim), nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim), nn.ReLU(),
nn.Linear(hidden_dim, embed_dim),
)
for p in self.target.parameters():
p.requires_grad = False
# Predictor: trained to match target output
self.predictor = nn.Sequential(
nn.Linear(obs_dim, hidden_dim), nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim), nn.ReLU(),
nn.Linear(hidden_dim, embed_dim),
)
def forward(self, obs):
"""Returns (predictor_embed, target_embed)."""
with torch.no_grad():
target_embed = self.target(obs)
predictor_embed = self.predictor(obs)
return predictor_embed, target_embed
def intrinsic_reward(self, obs):
"""Prediction error as exploration bonus. obs: (B, obs_dim)."""
pred, target = self.forward(obs)
return (pred - target).pow(2).sum(dim=-1) # (B,)
def predictor_loss(self, obs):
"""MSE loss to train the predictor. obs: (B, obs_dim)."""
pred, target = self.forward(obs)
return (pred - target).pow(2).sum(dim=-1).mean()
# ── Running normalisation for observations and rewards ────────
class RunningStats:
"""Welford's online algorithm for running mean/std."""
def __init__(self, shape):
self.mean = torch.zeros(shape)
self.var = torch.ones(shape)
self.count = 1e-4
@torch.no_grad()
def update(self, x):
batch_mean = x.mean(dim=0)
batch_var = x.var(dim=0)
batch_count = x.shape[0]
delta = batch_mean - self.mean
total = self.count + batch_count
self.mean += delta * batch_count / total
self.var = (self.var * self.count + batch_var * batch_count
+ delta.pow(2) * self.count * batch_count / total) / total
self.count = total
def normalise(self, x):
return ((x - self.mean) / (self.var.sqrt() + 1e-8)).clamp(-5, 5)
# ── Integration with PPO training loop ───────────────────────
def compute_combined_reward(ext_reward, obs, rnd_model, reward_stats):
"""
ext_reward: (B,) extrinsic reward from environment
obs: (B, obs_dim) normalised observations
Returns: (ext_reward, int_reward) kept separate for dual value heads
"""
int_reward = rnd_model.intrinsic_reward(obs) # (B,)
reward_stats.update(int_reward.unsqueeze(-1))
int_reward = int_reward / (reward_stats.var.sqrt() + 1e-8) # normalise
return ext_reward, int_reward.detach()
import numpy as np
class RNDNumpy:
"""
Minimal RND in pure numpy. Uses simple MLPs with ReLU.
For understanding the algorithm — use the PyTorch version for real training.
"""
def __init__(self, obs_dim, embed_dim=64, hidden_dim=128, lr=1e-3, seed=42):
rng = np.random.default_rng(seed)
# Xavier-style init for both networks
def make_weights(dims):
layers = []
for fan_in, fan_out in zip(dims[:-1], dims[1:]):
scale = np.sqrt(2.0 / fan_in)
W = rng.normal(0, scale, (fan_in, fan_out)).astype(np.float32)
b = np.zeros(fan_out, dtype=np.float32)
layers.append((W, b))
return layers
arch = [obs_dim, hidden_dim, hidden_dim, embed_dim]
self.target_layers = make_weights(arch) # frozen
self.pred_layers = make_weights(arch) # trainable
self.lr = lr
# Running stats for observation normalisation
self.obs_mean = np.zeros(obs_dim, dtype=np.float32)
self.obs_var = np.ones(obs_dim, dtype=np.float32)
self.obs_count = 1e-4
def _forward(self, x, layers):
"""MLP forward pass with ReLU (no activation on final layer)."""
for i, (W, b) in enumerate(layers):
x = x @ W + b
if i < len(layers) - 1:
x = np.maximum(x, 0) # ReLU
return x
def normalise_obs(self, obs):
return np.clip((obs - self.obs_mean) / (np.sqrt(self.obs_var) + 1e-8),
-5, 5)
def update_obs_stats(self, obs_batch):
"""Welford online update for observation normalisation."""
batch_mean = obs_batch.mean(axis=0)
batch_var = obs_batch.var(axis=0)
batch_count = obs_batch.shape[0]
delta = batch_mean - self.obs_mean
total = self.obs_count + batch_count
self.obs_mean += delta * batch_count / total
self.obs_var = (self.obs_var * self.obs_count
+ batch_var * batch_count
+ delta ** 2 * self.obs_count * batch_count / total) / total
self.obs_count = total
def intrinsic_reward(self, obs):
"""
obs: (B, obs_dim) or (obs_dim,), raw observations
Returns: (B,) or scalar, prediction error as intrinsic reward
"""
single = obs.ndim == 1
if single:
obs = obs[np.newaxis]
obs_norm = self.normalise_obs(obs)
target_out = self._forward(obs_norm, self.target_layers)
pred_out = self._forward(obs_norm, self.pred_layers)
error = np.sum((pred_out - target_out) ** 2, axis=-1)
return error[0] if single else error
def train_predictor(self, obs_batch):
"""
One gradient step on the predictor network (numerical gradients).
obs_batch: (B, obs_dim), raw observations
Returns: float, mean prediction error (loss)
"""
obs_norm = self.normalise_obs(obs_batch)
target_out = self._forward(obs_norm, self.target_layers)
# Forward pass through predictor, caching activations for backprop
activations = [obs_norm]
x = obs_norm
for i, (W, b) in enumerate(self.pred_layers):
x = x @ W + b
if i < len(self.pred_layers) - 1:
x = np.maximum(x, 0)
activations.append(x)
pred_out = activations[-1]
error = pred_out - target_out # (B, embed_dim)
loss = np.mean(np.sum(error ** 2, axis=-1))
# Backprop through predictor
grad = 2.0 * error / obs_batch.shape[0] # d(loss)/d(pred_out)
for i in range(len(self.pred_layers) - 1, -1, -1):
W, b = self.pred_layers[i]
act = activations[i]
dW = act.T @ grad
db = grad.sum(axis=0)
self.pred_layers[i] = (W - self.lr * dW, b - self.lr * db)
if i > 0:
grad = grad @ W.T
# ReLU backward
grad = grad * (activations[i] > 0)
return loss
  • PPO + RND (Burda et al., 2018): the original paper pairs RND with PPO using 128 parallel environments, achieving a mean score of 6,600+ on Montezuma’s Revenge — the first method to clear the first level without human demonstrations
  • Large-scale exploration (OpenAI): the same paper trains on 128 parallel workers with a total of ~1.6 billion frames, showing that RND scales well with compute
  • Go-Explore (Ecoffet et al., 2021): uses RND-style novelty detection as one component of its exploration strategy for hard-exploration Atari games
  • MiniGrid / procedurally generated environments: RND is a standard exploration baseline in environments where reward is extremely sparse and the observation space is high-dimensional
AlternativeWhen to useTradeoff
Epsilon-greedySimple discrete action spaces with shaped rewardsZero overhead but completely undirected — fails in sparse reward settings
ICM (Intrinsic Curiosity Module)Want exploration driven by dynamics prediction errorUses forward/inverse dynamics models; robust to visual distractors via learned feature space; more complex than RND
Count-based exploration (hash, pseudo-counts)Low-dimensional or discretisable state spacesPrincipled (upper confidence bounds); breaks down in high-dimensional observation spaces
Noisy Networks (NoisyNet)Want exploration without separate bonusLearned noise in network weights; simpler but less directed than curiosity methods
Never Give Up (NGU)Need both episodic and life-long noveltyCombines RND-like life-long novelty with episodic counts; significantly more complex
Agent57Need SOTA on all 57 Atari gamesBuilds on NGU with meta-controller for exploration-exploitation; very heavy
RE3 (Random Encoders for Efficient Exploration)Want a simpler random-encoder bonusUses k-NN in random encoder space instead of prediction error; no predictor to train

RND emerged from a line of work on curiosity-driven exploration. Schmidhuber (1991) first proposed using prediction error as intrinsic motivation. Pathak et al. (2017) revived this idea as ICM, which learns a feature space via inverse dynamics to make prediction error robust to irrelevant visual details. However, ICM requires two auxiliary models (forward and inverse) and can still struggle with stochastic environments.

Burda et al. (2018) made two key observations: (1) a fixed random network already provides a useful feature space — no need to learn one, and (2) prediction error in this random feature space serves as an effective novelty signal. The resulting method (RND) is dramatically simpler than ICM while being more robust. Their paper also carefully analysed the “noisy TV problem” — where stochastic environment elements create permanently high prediction error — and showed that observation normalisation largely solves it.

RND’s success on Montezuma’s Revenge (a game requiring long sequences of specific actions with no intermediate reward) was a landmark result. It demonstrated that pure curiosity, without demonstrations or handcrafted rewards, could drive deep exploration in hard problems. This influenced subsequent work like Never Give Up (Badia et al., 2020) and Agent57 (Badia et al., 2020), which combined RND-style life-long novelty with episodic novelty to achieve superhuman performance across all 57 Atari games.