Vanishing Gradients
Vanishing Gradients
Section titled “Vanishing Gradients”Gradients shrink exponentially as they propagate backward through layers, starving early layers of learning signal. The deeper the network, the worse it gets — deep networks without mitigation effectively only train their last few layers.
Intuition
Section titled “Intuition”Think of a game of telephone where each person whispers 90% of the original message. After 10 people, only 35% survives. After 50 people, less than 1%. That’s what happens to gradients in a deep network: each layer multiplies the gradient by its local Jacobian, and if those multipliers are consistently less than 1, the signal decays exponentially.
The culprit is usually the activation function. Sigmoid and tanh both squash their inputs into a bounded range, and their derivatives are always less than 1 (sigmoid’s max derivative is 0.25). So each layer shrinks the gradient by at least 4x. After 10 layers of sigmoid activations, the gradient reaching the first layer is on the order of — effectively zero.
This isn’t just a “slow training” problem. Vanishing gradients create a qualitative failure: the first layers freeze at near-random values while the last layers learn on top of garbage features. The network might still reduce training loss (the last layers are learning), but it never develops the hierarchical representations that make deep networks powerful.
A common misunderstanding is that vanishing gradients only affect RNNs. They affect ANY deep network — RNNs just suffer more because unrolling through time creates very deep effective networks (one “layer” per timestep, often hundreds).
For a network with layers, the gradient of the loss with respect to the first layer’s parameters involves a chain of matrix products:
Each Jacobian , where is the activation derivative. If the spectral radius of these Jacobians is consistently , the product shrinks exponentially with depth .
For sigmoid: , so each layer shrinks the gradient by at least 4x regardless of weights.
Manifestation
Section titled “Manifestation”- Gradient norms per layer decrease exponentially from output to input — plot
grad.norm()for each layer’s parameters and you’ll see orders-of-magnitude differences - Early layer weights barely change between epochs while later layers update normally
- Loss decreases slowly then plateaus at a value well above what a shallower network achieves
- Activations in early layers look random and don’t develop meaningful features even after extended training
- In RNNs: the model fails to capture long-range dependencies — it can predict the next word but not maintain context from 50 tokens ago
Where It Appears
Section titled “Where It Appears”- MLP training (
nn-training/): deep MLPs with sigmoid/tanh fail to train — motivates ReLU activation and proper weight initialisation - Transformer (
transformer/): without residual connections, a 12-layer transformer would suffer severe gradient vanishing — pre-norm + residuals keep gradients flowing - GANs (
gans/): deep discriminators can suffer vanishing gradients, especially with saturating losses — one motivation for using spectral normalisation - Policy gradient (
policy-gradient/): when the policy network is deep, vanishing gradients interact with high-variance gradient estimates to make training nearly impossible - RNNs (not yet in series): the original context where the problem was identified — LSTM gates were designed specifically to create gradient highways through time
Solutions at a Glance
Section titled “Solutions at a Glance”| Solution | Mechanism | Where documented |
|---|---|---|
| ReLU / GELU / SiLU | Derivative is 1 (or near 1) for positive inputs — no shrinkage | atomic-concepts/activation-functions/relu.md, gelu.md |
| Residual connections | Gradient has a direct path that skips layers: | atomic-concepts/architectural-primitives/residual-connections.md |
| Pre-norm (LayerNorm before attention/FFN) | Normalises inputs to each sublayer, stabilising Jacobian magnitudes | transformer/ |
| Proper weight initialisation | Sets initial weight variance so signal magnitude is preserved layer-to-layer | atomic-concepts/optimisation-primitives/weight-initialisation.md |
| LSTM / GRU gates | Multiplicative gates that can pass gradients through time unchanged | (not yet in series) |
| Gradient clipping | Doesn’t fix vanishing directly, but prevents the oscillation that sometimes accompanies attempts to fix it | atomic-concepts/optimisation-primitives/gradient-clipping.md |
Historical Context
Section titled “Historical Context”Hochreiter identified the vanishing gradient problem in his 1991 diploma thesis, and Hochreiter & Schmidhuber formalised it in their 1997 LSTM paper. The problem was already known informally — training deep networks had been frustratingly difficult since the 1980s — but Hochreiter gave it a precise mathematical characterisation. For over a decade, the consensus was that very deep networks simply couldn’t be trained, which is why most practical work used shallow architectures (1-3 hidden layers). The modern era of deep learning was unlocked by the combination of ReLU (Nair & Hinton, 2010), proper initialisation (Glorot & Bengio, 2010; He et al., 2015), and residual connections (He et al., 2015) — each attacking a different aspect of the gradient flow problem.