Random Network Distillation (RND)

Measure exploration novelty by how poorly a predictor network can mimic a fixed random network on each observation — high prediction error means the state is unfamiliar. RND adds this prediction error as an intrinsic reward bonus, driving the agent to seek out states it has rarely or never visited. Introduced by Burda et al. (2018) at OpenAI, RND was the first method to achieve superhuman performance on Montezuma’s Revenge without demonstrations or domain knowledge.

Intuition

Imagine you have a friend who outputs random but consistent answers to any question — always the same random answer for the same question. You train yourself to predict your friend’s answers. For questions you’ve practiced on many times, you can predict the answer well. For questions you’ve never heard before, your prediction is terrible.

RND exploits exactly this. A target network (your friend) is a fixed, randomly initialised neural network — it maps observations to embeddings deterministically but arbitrarily. A predictor network (you) is trained to match the target’s output on observations the agent actually visits. When the agent encounters a familiar state, the predictor has been trained on similar inputs and the prediction error is low. When the agent reaches a genuinely novel state, the predictor has never seen anything like it and the error is high.

This prediction error becomes an intrinsic reward — a bonus added to the environment’s (extrinsic) reward. The agent is effectively rewarded for finding states that surprise its predictor. Over time, as the predictor trains on more data, previously novel states become familiar and their bonus decays naturally. This is crucial: unlike count-based methods, RND works in high-dimensional observation spaces (images) where explicit state counting is impossible.

The elegance of RND is its simplicity compared to other curiosity methods. ICM (Intrinsic Curiosity Module) requires learning a forward dynamics model and an inverse model. Count-based methods require density estimation. RND needs only a single extra network to train, the target network is frozen from initialisation, and the whole thing is a simple regression loss. The tradeoff is that RND can be distracted by stochastic elements in the environment (a noisy TV showing random static is always “novel”), though observation normalisation mitigates this significantly in practice.

Math

Target network (fixed, random weights $\theta^*$ drawn once at init):

$f^* : \mathcal{O} \to \mathbb{R}^k, \quad \theta^* \sim \text{init distribution (never updated)}$

Predictor network (trained weights $\theta$ ):

$\hat{f} : \mathcal{O} \to \mathbb{R}^k, \quad \theta \text{ updated by gradient descent}$

Intrinsic reward for observation $o_t$ :

$r_t^{\text{int}} = \lVert \hat{f}(o_t; \theta) - f^*(o_t; \theta^*) \rVert^2$

Predictor loss (minimised during training):

$L(\theta) = \mathbb{E}_{o_t \sim \pi} \bigl[ \lVert \hat{f}(o_t; \theta) - f^*(o_t; \theta^*) \rVert^2 \bigr]$

Combined reward used by the RL agent:

$r_t = r_t^{\text{ext}} + \beta \, r_t^{\text{int}}$

where $\beta$ controls the exploration-exploitation balance. In practice, the Burda et al. paper uses two separate value heads rather than summing rewards:

$V^{\text{ext}}(s_t) \approx \mathbb{E}\Bigl[\sum_{k=0}^{\infty} \gamma_{\text{ext}}^k \, r_{t+k}^{\text{ext}}\Bigr], \quad V^{\text{int}}(s_t) \approx \mathbb{E}\Bigl[\sum_{k=0}^{\infty} \gamma_{\text{int}}^k \, r_{t+k}^{\text{int}}\Bigr]$

with $\gamma_{\text{ext}} = 0.999$ (long horizon for sparse extrinsic reward) and $\gamma_{\text{int}} = 0.99$ (shorter horizon because intrinsic reward is non-stationary — novelty decays).

Observation normalisation (critical for stability):

$\hat{o}_t = \text{clip}\!\left(\frac{o_t - \mu_{\text{running}}}{\sigma_{\text{running}}}, -5, 5\right)$

The intrinsic reward is also normalised by a running estimate of its standard deviation to keep the bonus scale stable across training.

Code

import torch
import torch.nn as nn

class RNDModel(nn.Module):
    """Random Network Distillation for exploration bonuses."""

    def __init__(self, obs_dim, embed_dim=128, hidden_dim=256):
        super().__init__()
        # Target: fixed random network (never trained)
        self.target = nn.Sequential(
            nn.Linear(obs_dim, hidden_dim), nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim), nn.ReLU(),
            nn.Linear(hidden_dim, embed_dim),
        )
        for p in self.target.parameters():
            p.requires_grad = False

        # Predictor: trained to match target output
        self.predictor = nn.Sequential(
            nn.Linear(obs_dim, hidden_dim), nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim), nn.ReLU(),
            nn.Linear(hidden_dim, embed_dim),
        )

    def forward(self, obs):
        """Returns (predictor_embed, target_embed)."""
        with torch.no_grad():
            target_embed = self.target(obs)
        predictor_embed = self.predictor(obs)
        return predictor_embed, target_embed

    def intrinsic_reward(self, obs):
        """Prediction error as exploration bonus. obs: (B, obs_dim)."""
        pred, target = self.forward(obs)
        return (pred - target).pow(2).sum(dim=-1)           # (B,)

    def predictor_loss(self, obs):
        """MSE loss to train the predictor. obs: (B, obs_dim)."""
        pred, target = self.forward(obs)
        return (pred - target).pow(2).sum(dim=-1).mean()


# ── Running normalisation for observations and rewards ────────
class RunningStats:
    """Welford's online algorithm for running mean/std."""

    def __init__(self, shape):
        self.mean = torch.zeros(shape)
        self.var = torch.ones(shape)
        self.count = 1e-4

    @torch.no_grad()
    def update(self, x):
        batch_mean = x.mean(dim=0)
        batch_var = x.var(dim=0)
        batch_count = x.shape[0]
        delta = batch_mean - self.mean
        total = self.count + batch_count
        self.mean += delta * batch_count / total
        self.var = (self.var * self.count + batch_var * batch_count
                    + delta.pow(2) * self.count * batch_count / total) / total
        self.count = total

    def normalise(self, x):
        return ((x - self.mean) / (self.var.sqrt() + 1e-8)).clamp(-5, 5)


# ── Integration with PPO training loop ───────────────────────
def compute_combined_reward(ext_reward, obs, rnd_model, reward_stats):
    """
    ext_reward: (B,) extrinsic reward from environment
    obs:        (B, obs_dim) normalised observations
    Returns:    (ext_reward, int_reward) kept separate for dual value heads
    """
    int_reward = rnd_model.intrinsic_reward(obs)            # (B,)
    reward_stats.update(int_reward.unsqueeze(-1))
    int_reward = int_reward / (reward_stats.var.sqrt() + 1e-8)  # normalise
    return ext_reward, int_reward.detach()

Manual Implementation

import numpy as np

class RNDNumpy:
    """
    Minimal RND in pure numpy. Uses simple MLPs with ReLU.
    For understanding the algorithm — use the PyTorch version for real training.
    """

    def __init__(self, obs_dim, embed_dim=64, hidden_dim=128, lr=1e-3, seed=42):
        rng = np.random.default_rng(seed)

        # Xavier-style init for both networks
        def make_weights(dims):
            layers = []
            for fan_in, fan_out in zip(dims[:-1], dims[1:]):
                scale = np.sqrt(2.0 / fan_in)
                W = rng.normal(0, scale, (fan_in, fan_out)).astype(np.float32)
                b = np.zeros(fan_out, dtype=np.float32)
                layers.append((W, b))
            return layers

        arch = [obs_dim, hidden_dim, hidden_dim, embed_dim]
        self.target_layers = make_weights(arch)              # frozen
        self.pred_layers = make_weights(arch)                # trainable
        self.lr = lr

        # Running stats for observation normalisation
        self.obs_mean = np.zeros(obs_dim, dtype=np.float32)
        self.obs_var = np.ones(obs_dim, dtype=np.float32)
        self.obs_count = 1e-4

    def _forward(self, x, layers):
        """MLP forward pass with ReLU (no activation on final layer)."""
        for i, (W, b) in enumerate(layers):
            x = x @ W + b
            if i < len(layers) - 1:
                x = np.maximum(x, 0)                        # ReLU
        return x

    def normalise_obs(self, obs):
        return np.clip((obs - self.obs_mean) / (np.sqrt(self.obs_var) + 1e-8),
                       -5, 5)

    def update_obs_stats(self, obs_batch):
        """Welford online update for observation normalisation."""
        batch_mean = obs_batch.mean(axis=0)
        batch_var = obs_batch.var(axis=0)
        batch_count = obs_batch.shape[0]
        delta = batch_mean - self.obs_mean
        total = self.obs_count + batch_count
        self.obs_mean += delta * batch_count / total
        self.obs_var = (self.obs_var * self.obs_count
                        + batch_var * batch_count
                        + delta ** 2 * self.obs_count * batch_count / total) / total
        self.obs_count = total

    def intrinsic_reward(self, obs):
        """
        obs: (B, obs_dim) or (obs_dim,), raw observations
        Returns: (B,) or scalar, prediction error as intrinsic reward
        """
        single = obs.ndim == 1
        if single:
            obs = obs[np.newaxis]

        obs_norm = self.normalise_obs(obs)
        target_out = self._forward(obs_norm, self.target_layers)
        pred_out = self._forward(obs_norm, self.pred_layers)
        error = np.sum((pred_out - target_out) ** 2, axis=-1)

        return error[0] if single else error

    def train_predictor(self, obs_batch):
        """
        One gradient step on the predictor network (numerical gradients).
        obs_batch: (B, obs_dim), raw observations
        Returns: float, mean prediction error (loss)
        """
        obs_norm = self.normalise_obs(obs_batch)
        target_out = self._forward(obs_norm, self.target_layers)

        # Forward pass through predictor, caching activations for backprop
        activations = [obs_norm]
        x = obs_norm
        for i, (W, b) in enumerate(self.pred_layers):
            x = x @ W + b
            if i < len(self.pred_layers) - 1:
                x = np.maximum(x, 0)
            activations.append(x)

        pred_out = activations[-1]
        error = pred_out - target_out                        # (B, embed_dim)
        loss = np.mean(np.sum(error ** 2, axis=-1))

        # Backprop through predictor
        grad = 2.0 * error / obs_batch.shape[0]             # d(loss)/d(pred_out)
        for i in range(len(self.pred_layers) - 1, -1, -1):
            W, b = self.pred_layers[i]
            act = activations[i]

            dW = act.T @ grad
            db = grad.sum(axis=0)

            self.pred_layers[i] = (W - self.lr * dW, b - self.lr * db)

            if i > 0:
                grad = grad @ W.T
                # ReLU backward
                grad = grad * (activations[i] > 0)

        return loss

Popular Uses

PPO + RND (Burda et al., 2018): the original paper pairs RND with PPO using 128 parallel environments, achieving a mean score of 6,600+ on Montezuma’s Revenge — the first method to clear the first level without human demonstrations
Large-scale exploration (OpenAI): the same paper trains on 128 parallel workers with a total of ~1.6 billion frames, showing that RND scales well with compute
Go-Explore (Ecoffet et al., 2021): uses RND-style novelty detection as one component of its exploration strategy for hard-exploration Atari games
MiniGrid / procedurally generated environments: RND is a standard exploration baseline in environments where reward is extremely sparse and the observation space is high-dimensional

Alternatives

Alternative	When to use	Tradeoff
Epsilon-greedy	Simple discrete action spaces with shaped rewards	Zero overhead but completely undirected — fails in sparse reward settings
ICM (Intrinsic Curiosity Module)	Want exploration driven by dynamics prediction error	Uses forward/inverse dynamics models; robust to visual distractors via learned feature space; more complex than RND
Count-based exploration (hash, pseudo-counts)	Low-dimensional or discretisable state spaces	Principled (upper confidence bounds); breaks down in high-dimensional observation spaces
Noisy Networks (NoisyNet)	Want exploration without separate bonus	Learned noise in network weights; simpler but less directed than curiosity methods
Never Give Up (NGU)	Need both episodic and life-long novelty	Combines RND-like life-long novelty with episodic counts; significantly more complex
Agent57	Need SOTA on all 57 Atari games	Builds on NGU with meta-controller for exploration-exploitation; very heavy
RE3 (Random Encoders for Efficient Exploration)	Want a simpler random-encoder bonus	Uses k-NN in random encoder space instead of prediction error; no predictor to train

Historical Context

RND emerged from a line of work on curiosity-driven exploration. Schmidhuber (1991) first proposed using prediction error as intrinsic motivation. Pathak et al. (2017) revived this idea as ICM, which learns a feature space via inverse dynamics to make prediction error robust to irrelevant visual details. However, ICM requires two auxiliary models (forward and inverse) and can still struggle with stochastic environments.

Burda et al. (2018) made two key observations: (1) a fixed random network already provides a useful feature space — no need to learn one, and (2) prediction error in this random feature space serves as an effective novelty signal. The resulting method (RND) is dramatically simpler than ICM while being more robust. Their paper also carefully analysed the “noisy TV problem” — where stochastic environment elements create permanently high prediction error — and showed that observation normalisation largely solves it.

RND’s success on Montezuma’s Revenge (a game requiring long sequences of specific actions with no intermediate reward) was a landmark result. It demonstrated that pure curiosity, without demonstrations or handcrafted rewards, could drive deep exploration in hard problems. This influenced subsequent work like Never Give Up (Badia et al., 2020) and Agent57 (Badia et al., 2020), which combined RND-style life-long novelty with episodic novelty to achieve superhuman performance across all 57 Atari games.