Vanishing Gradients

Gradients shrink exponentially as they propagate backward through layers, starving early layers of learning signal. The deeper the network, the worse it gets — deep networks without mitigation effectively only train their last few layers.

Intuition

Think of a game of telephone where each person whispers 90% of the original message. After 10 people, only 35% survives. After 50 people, less than 1%. That’s what happens to gradients in a deep network: each layer multiplies the gradient by its local Jacobian, and if those multipliers are consistently less than 1, the signal decays exponentially.

The culprit is usually the activation function. Sigmoid and tanh both squash their inputs into a bounded range, and their derivatives are always less than 1 (sigmoid’s max derivative is 0.25). So each layer shrinks the gradient by at least 4x. After 10 layers of sigmoid activations, the gradient reaching the first layer is on the order of $10^{-6}$ — effectively zero.

This isn’t just a “slow training” problem. Vanishing gradients create a qualitative failure: the first layers freeze at near-random values while the last layers learn on top of garbage features. The network might still reduce training loss (the last layers are learning), but it never develops the hierarchical representations that make deep networks powerful.

A common misunderstanding is that vanishing gradients only affect RNNs. They affect ANY deep network — RNNs just suffer more because unrolling through time creates very deep effective networks (one “layer” per timestep, often hundreds).

Math

For a network with $L$ layers, the gradient of the loss with respect to the first layer’s parameters involves a chain of matrix products:

$\frac{\partial \mathcal{L}}{\partial W_1} = \frac{\partial \mathcal{L}}{\partial h_L} \prod_{l=2}^{L} \frac{\partial h_l}{\partial h_{l-1}} \cdot \frac{\partial h_2}{\partial W_1}$

Each Jacobian $\frac{\partial h_l}{\partial h_{l-1}} = \text{diag}(\sigma'(z_l)) \cdot W_l$ , where $\sigma'$ is the activation derivative. If the spectral radius of these Jacobians is consistently $< 1$ , the product shrinks exponentially with depth $L$ .

For sigmoid: $\sigma'(z) \leq 0.25$ , so each layer shrinks the gradient by at least 4x regardless of weights.

Manifestation

Gradient norms per layer decrease exponentially from output to input — plot grad.norm() for each layer’s parameters and you’ll see orders-of-magnitude differences
Early layer weights barely change between epochs while later layers update normally
Loss decreases slowly then plateaus at a value well above what a shallower network achieves
Activations in early layers look random and don’t develop meaningful features even after extended training
In RNNs: the model fails to capture long-range dependencies — it can predict the next word but not maintain context from 50 tokens ago

Where It Appears

MLP training (nn-training/): deep MLPs with sigmoid/tanh fail to train — motivates ReLU activation and proper weight initialisation
Transformer (transformer/): without residual connections, a 12-layer transformer would suffer severe gradient vanishing — pre-norm + residuals keep gradients flowing
GANs (gans/): deep discriminators can suffer vanishing gradients, especially with saturating losses — one motivation for using spectral normalisation
Policy gradient (policy-gradient/): when the policy network is deep, vanishing gradients interact with high-variance gradient estimates to make training nearly impossible
RNNs (not yet in series): the original context where the problem was identified — LSTM gates were designed specifically to create gradient highways through time

Solutions at a Glance

Solution	Mechanism	Where documented
ReLU / GELU / SiLU	Derivative is 1 (or near 1) for positive inputs — no shrinkage	`atomic-concepts/activation-functions/relu.md`, `gelu.md`
Residual connections	Gradient has a direct path that skips layers: $\frac{\partial}{\partial x}(x + f(x)) = 1 + f'(x)$	`atomic-concepts/architectural-primitives/residual-connections.md`
Pre-norm (LayerNorm before attention/FFN)	Normalises inputs to each sublayer, stabilising Jacobian magnitudes	`transformer/`
Proper weight initialisation	Sets initial weight variance so signal magnitude is preserved layer-to-layer	`atomic-concepts/optimisation-primitives/weight-initialisation.md`
LSTM / GRU gates	Multiplicative gates that can pass gradients through time unchanged	(not yet in series)
Gradient clipping	Doesn’t fix vanishing directly, but prevents the oscillation that sometimes accompanies attempts to fix it	`atomic-concepts/optimisation-primitives/gradient-clipping.md`

Historical Context

Hochreiter identified the vanishing gradient problem in his 1991 diploma thesis, and Hochreiter & Schmidhuber formalised it in their 1997 LSTM paper. The problem was already known informally — training deep networks had been frustratingly difficult since the 1980s — but Hochreiter gave it a precise mathematical characterisation. For over a decade, the consensus was that very deep networks simply couldn’t be trained, which is why most practical work used shallow architectures (1-3 hidden layers). The modern era of deep learning was unlocked by the combination of ReLU (Nair & Hinton, 2010), proper initialisation (Glorot & Bengio, 2010; He et al., 2015), and residual connections (He et al., 2015) — each attacking a different aspect of the gradient flow problem.