Dropout

Randomly zeros activations during training with probability $p$ , then scales surviving activations by $1/(1-p)$ . Prevents co-adaptation of neurons by forcing the network to learn redundant representations. Standard in MLPs and older architectures; largely replaced by other regularisation in modern transformers.

Intuition

Imagine a team where one member is so strong that everyone else stops contributing — they just relay to the star player. Dropout randomly benches players each round, forcing everyone to develop independent skills. At test time, the full team plays together and is stronger for it.

The rescaling by $1/(1-p)$ is critical: if you drop 50% of neurons during training, the surviving ones produce outputs that are 2x too small at test time (since all neurons are now active). Multiplying by $1/(1-p) = 2$ during training keeps the expected value of each activation identical at train and test time. This is called “inverted dropout” and is what every modern framework implements.

A useful mental model: dropout trains an implicit ensemble of $2^n$ subnetworks (all possible masks for $n$ neurons), with shared weights. At test time, you’re using the geometric average of all these subnetworks. This ensemble interpretation explains why dropout reduces overfitting — it’s variance reduction through averaging.

Math

Forward pass (training):

$\mathbf{m} \sim \text{Bernoulli}(1 - p), \quad \mathbf{m} \in \{0, 1\}^d$

$\tilde{\mathbf{h}} = \frac{\mathbf{m} \odot \mathbf{h}}{1 - p}$

where $\mathbf{h}$ is the activation vector, $p$ is the drop probability, and $\odot$ is element-wise multiplication.

Forward pass (inference): no-op. $\tilde{\mathbf{h}} = \mathbf{h}$ .

Gradient (training): the mask gates gradients too:

$\frac{\partial \mathcal{L}}{\partial \mathbf{h}} = \frac{\mathbf{m}}{1 - p} \odot \frac{\partial \mathcal{L}}{\partial \tilde{\mathbf{h}}}$

Dropped neurons receive zero gradient — they don’t learn on that step.

Code

import torch
import torch.nn as nn

# ── Standard usage ──────────────────────────────────────────────
layer = nn.Dropout(p=0.5)          # p = probability of ZEROING
h = torch.randn(32, 512)           # (B, d)
out = layer(h)                     # (B, d) — same shape, some zeros

# WARNING: Dropout behaves differently in train vs eval mode.
# Always call model.train() / model.eval() — forgetting this is a
# very common bug that causes silent accuracy drops at inference.

model.train()   # dropout active
model.eval()    # dropout disabled (identity)

# ── Dropout on attention/sequences ──────────────────────────────
# For 3D+ tensors, nn.Dropout zeros individual elements.
# For spatial dropout (drop entire channels), use Dropout2d.
attn = torch.randn(32, 8, 64, 64)     # (B, heads, T, T)
drop = nn.Dropout(p=0.1)
attn = drop(attn)                      # zeros individual entries

# ── Dropout in transformer blocks ───────────────────────────────
# Modern transformers (GPT-2, BERT) apply dropout at:
# 1. After attention weights (before multiplying with V)
# 2. After the output projection of each sublayer
# 3. After each FFN sublayer
# Typical p = 0.1. Many modern LLMs (LLaMA, Mistral) use p = 0.0.

Manual Implementation

import numpy as np

def dropout_forward(h, p, training=True, rng=None):
    """
    Inverted dropout.
    h:        (B, d) activations
    p:        drop probability (0.0 = no dropout)
    training: if False, return h unchanged
    """
    if not training or p == 0.0:
        return h, None

    rng = rng or np.random.default_rng()
    mask = (rng.random(h.shape) >= p).astype(h.dtype)  # (B, d) 0s and 1s
    return (h * mask) / (1 - p), mask                  # (B, d) scaled


def dropout_backward(grad_out, mask, p):
    """Gradient flows only through surviving neurons."""
    return (grad_out * mask) / (1 - p)                 # (B, d)

Popular Uses

MLPs and fully-connected layers: the original and still most common application — p=0.5 for hidden layers, lower for input
Transformer attention (GPT-2, BERT): p=0.1 dropout on attention weights and sublayer outputs, though modern LLMs often set p=0.0
CNN classifiers (VGG, AlexNet): dropout before final FC layers, now mostly replaced by batch norm and data augmentation
Variational dropout (Kingma et al. 2015): learned per-weight drop rates, connects dropout to variational inference
DropConnect (Wan et al. 2013): drops weights instead of activations — more general but rarely used in practice
Monte Carlo dropout (Gal & Ghahramani 2016): keep dropout ON at test time, run multiple forward passes to estimate uncertainty

Alternatives

Alternative	When to use	Tradeoff
Weight decay	Always applicable, modern default	Regularises weights not activations; complementary to dropout, not a replacement
Batch normalisation	CNNs	Provides implicit regularisation through mini-batch noise; largely replaced dropout in vision models
Data augmentation	When you can define meaningful augmentations	Regularises via input diversity; no train/test discrepancy
DropPath / Stochastic Depth	Deep residual networks, ViT	Drops entire residual blocks instead of individual neurons; better for very deep models
Label smoothing	Classification	Regularises the target side rather than activations; addresses overconfidence specifically
Early stopping	Always applicable	Stops before overfitting; no architectural change but requires validation monitoring

Historical Context

Dropout was introduced by Hinton et al. (2012) and formalised by Srivastava et al. (2014, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”). It was motivated by biological intuition — neurons in the brain don’t reliably fire together — and by the idea that preventing co-adaptation would force robust feature learning. It was transformative for its era, making deep networks trainable on small datasets without severe overfitting.

The “inverted dropout” trick (scaling during training rather than at test time) became standard because it means the test-time model is unchanged — no special handling needed. Dropout’s importance has waned in modern architectures: batch normalisation, better optimisers, and massive datasets have reduced the need for explicit activation noise. Most modern LLMs (LLaMA, Mistral, Gemma) use zero dropout, relying on weight decay, data scale, and other regularisers instead.