Skip to content

Dropout

Randomly zeros activations during training with probability pp, then scales surviving activations by 1/(1p)1/(1-p). Prevents co-adaptation of neurons by forcing the network to learn redundant representations. Standard in MLPs and older architectures; largely replaced by other regularisation in modern transformers.

Imagine a team where one member is so strong that everyone else stops contributing — they just relay to the star player. Dropout randomly benches players each round, forcing everyone to develop independent skills. At test time, the full team plays together and is stronger for it.

The rescaling by 1/(1p)1/(1-p) is critical: if you drop 50% of neurons during training, the surviving ones produce outputs that are 2x too small at test time (since all neurons are now active). Multiplying by 1/(1p)=21/(1-p) = 2 during training keeps the expected value of each activation identical at train and test time. This is called “inverted dropout” and is what every modern framework implements.

A useful mental model: dropout trains an implicit ensemble of 2n2^n subnetworks (all possible masks for nn neurons), with shared weights. At test time, you’re using the geometric average of all these subnetworks. This ensemble interpretation explains why dropout reduces overfitting — it’s variance reduction through averaging.

Forward pass (training):

mBernoulli(1p),m{0,1}d\mathbf{m} \sim \text{Bernoulli}(1 - p), \quad \mathbf{m} \in \{0, 1\}^d

h~=mh1p\tilde{\mathbf{h}} = \frac{\mathbf{m} \odot \mathbf{h}}{1 - p}

where h\mathbf{h} is the activation vector, pp is the drop probability, and \odot is element-wise multiplication.

Forward pass (inference): no-op. h~=h\tilde{\mathbf{h}} = \mathbf{h}.

Gradient (training): the mask gates gradients too:

Lh=m1pLh~\frac{\partial \mathcal{L}}{\partial \mathbf{h}} = \frac{\mathbf{m}}{1 - p} \odot \frac{\partial \mathcal{L}}{\partial \tilde{\mathbf{h}}}

Dropped neurons receive zero gradient — they don’t learn on that step.

import torch
import torch.nn as nn
# ── Standard usage ──────────────────────────────────────────────
layer = nn.Dropout(p=0.5) # p = probability of ZEROING
h = torch.randn(32, 512) # (B, d)
out = layer(h) # (B, d) — same shape, some zeros
# WARNING: Dropout behaves differently in train vs eval mode.
# Always call model.train() / model.eval() — forgetting this is a
# very common bug that causes silent accuracy drops at inference.
model.train() # dropout active
model.eval() # dropout disabled (identity)
# ── Dropout on attention/sequences ──────────────────────────────
# For 3D+ tensors, nn.Dropout zeros individual elements.
# For spatial dropout (drop entire channels), use Dropout2d.
attn = torch.randn(32, 8, 64, 64) # (B, heads, T, T)
drop = nn.Dropout(p=0.1)
attn = drop(attn) # zeros individual entries
# ── Dropout in transformer blocks ───────────────────────────────
# Modern transformers (GPT-2, BERT) apply dropout at:
# 1. After attention weights (before multiplying with V)
# 2. After the output projection of each sublayer
# 3. After each FFN sublayer
# Typical p = 0.1. Many modern LLMs (LLaMA, Mistral) use p = 0.0.
import numpy as np
def dropout_forward(h, p, training=True, rng=None):
"""
Inverted dropout.
h: (B, d) activations
p: drop probability (0.0 = no dropout)
training: if False, return h unchanged
"""
if not training or p == 0.0:
return h, None
rng = rng or np.random.default_rng()
mask = (rng.random(h.shape) >= p).astype(h.dtype) # (B, d) 0s and 1s
return (h * mask) / (1 - p), mask # (B, d) scaled
def dropout_backward(grad_out, mask, p):
"""Gradient flows only through surviving neurons."""
return (grad_out * mask) / (1 - p) # (B, d)
  • MLPs and fully-connected layers: the original and still most common application — p=0.5 for hidden layers, lower for input
  • Transformer attention (GPT-2, BERT): p=0.1 dropout on attention weights and sublayer outputs, though modern LLMs often set p=0.0
  • CNN classifiers (VGG, AlexNet): dropout before final FC layers, now mostly replaced by batch norm and data augmentation
  • Variational dropout (Kingma et al. 2015): learned per-weight drop rates, connects dropout to variational inference
  • DropConnect (Wan et al. 2013): drops weights instead of activations — more general but rarely used in practice
  • Monte Carlo dropout (Gal & Ghahramani 2016): keep dropout ON at test time, run multiple forward passes to estimate uncertainty
AlternativeWhen to useTradeoff
Weight decayAlways applicable, modern defaultRegularises weights not activations; complementary to dropout, not a replacement
Batch normalisationCNNsProvides implicit regularisation through mini-batch noise; largely replaced dropout in vision models
Data augmentationWhen you can define meaningful augmentationsRegularises via input diversity; no train/test discrepancy
DropPath / Stochastic DepthDeep residual networks, ViTDrops entire residual blocks instead of individual neurons; better for very deep models
Label smoothingClassificationRegularises the target side rather than activations; addresses overconfidence specifically
Early stoppingAlways applicableStops before overfitting; no architectural change but requires validation monitoring

Dropout was introduced by Hinton et al. (2012) and formalised by Srivastava et al. (2014, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”). It was motivated by biological intuition — neurons in the brain don’t reliably fire together — and by the idea that preventing co-adaptation would force robust feature learning. It was transformative for its era, making deep networks trainable on small datasets without severe overfitting.

The “inverted dropout” trick (scaling during training rather than at test time) became standard because it means the test-time model is unchanged — no special handling needed. Dropout’s importance has waned in modern architectures: batch normalisation, better optimisers, and massive datasets have reduced the need for explicit activation noise. Most modern LLMs (LLaMA, Mistral, Gemma) use zero dropout, relying on weight decay, data scale, and other regularisers instead.