Skip to content

Vanishing Gradients

Gradients shrink exponentially as they propagate backward through layers, starving early layers of learning signal. The deeper the network, the worse it gets — deep networks without mitigation effectively only train their last few layers.

Think of a game of telephone where each person whispers 90% of the original message. After 10 people, only 35% survives. After 50 people, less than 1%. That’s what happens to gradients in a deep network: each layer multiplies the gradient by its local Jacobian, and if those multipliers are consistently less than 1, the signal decays exponentially.

The culprit is usually the activation function. Sigmoid and tanh both squash their inputs into a bounded range, and their derivatives are always less than 1 (sigmoid’s max derivative is 0.25). So each layer shrinks the gradient by at least 4x. After 10 layers of sigmoid activations, the gradient reaching the first layer is on the order of 10610^{-6} — effectively zero.

This isn’t just a “slow training” problem. Vanishing gradients create a qualitative failure: the first layers freeze at near-random values while the last layers learn on top of garbage features. The network might still reduce training loss (the last layers are learning), but it never develops the hierarchical representations that make deep networks powerful.

A common misunderstanding is that vanishing gradients only affect RNNs. They affect ANY deep network — RNNs just suffer more because unrolling through time creates very deep effective networks (one “layer” per timestep, often hundreds).

For a network with LL layers, the gradient of the loss with respect to the first layer’s parameters involves a chain of matrix products:

LW1=LhLl=2Lhlhl1h2W1\frac{\partial \mathcal{L}}{\partial W_1} = \frac{\partial \mathcal{L}}{\partial h_L} \prod_{l=2}^{L} \frac{\partial h_l}{\partial h_{l-1}} \cdot \frac{\partial h_2}{\partial W_1}

Each Jacobian hlhl1=diag(σ(zl))Wl\frac{\partial h_l}{\partial h_{l-1}} = \text{diag}(\sigma'(z_l)) \cdot W_l, where σ\sigma' is the activation derivative. If the spectral radius of these Jacobians is consistently <1< 1, the product shrinks exponentially with depth LL.

For sigmoid: σ(z)0.25\sigma'(z) \leq 0.25, so each layer shrinks the gradient by at least 4x regardless of weights.

  • Gradient norms per layer decrease exponentially from output to input — plot grad.norm() for each layer’s parameters and you’ll see orders-of-magnitude differences
  • Early layer weights barely change between epochs while later layers update normally
  • Loss decreases slowly then plateaus at a value well above what a shallower network achieves
  • Activations in early layers look random and don’t develop meaningful features even after extended training
  • In RNNs: the model fails to capture long-range dependencies — it can predict the next word but not maintain context from 50 tokens ago
  • MLP training (nn-training/): deep MLPs with sigmoid/tanh fail to train — motivates ReLU activation and proper weight initialisation
  • Transformer (transformer/): without residual connections, a 12-layer transformer would suffer severe gradient vanishing — pre-norm + residuals keep gradients flowing
  • GANs (gans/): deep discriminators can suffer vanishing gradients, especially with saturating losses — one motivation for using spectral normalisation
  • Policy gradient (policy-gradient/): when the policy network is deep, vanishing gradients interact with high-variance gradient estimates to make training nearly impossible
  • RNNs (not yet in series): the original context where the problem was identified — LSTM gates were designed specifically to create gradient highways through time
SolutionMechanismWhere documented
ReLU / GELU / SiLUDerivative is 1 (or near 1) for positive inputs — no shrinkageatomic-concepts/activation-functions/relu.md, gelu.md
Residual connectionsGradient has a direct path that skips layers: x(x+f(x))=1+f(x)\frac{\partial}{\partial x}(x + f(x)) = 1 + f'(x)atomic-concepts/architectural-primitives/residual-connections.md
Pre-norm (LayerNorm before attention/FFN)Normalises inputs to each sublayer, stabilising Jacobian magnitudestransformer/
Proper weight initialisationSets initial weight variance so signal magnitude is preserved layer-to-layeratomic-concepts/optimisation-primitives/weight-initialisation.md
LSTM / GRU gatesMultiplicative gates that can pass gradients through time unchanged(not yet in series)
Gradient clippingDoesn’t fix vanishing directly, but prevents the oscillation that sometimes accompanies attempts to fix itatomic-concepts/optimisation-primitives/gradient-clipping.md

Hochreiter identified the vanishing gradient problem in his 1991 diploma thesis, and Hochreiter & Schmidhuber formalised it in their 1997 LSTM paper. The problem was already known informally — training deep networks had been frustratingly difficult since the 1980s — but Hochreiter gave it a precise mathematical characterisation. For over a decade, the consensus was that very deep networks simply couldn’t be trained, which is why most practical work used shallow architectures (1-3 hidden layers). The modern era of deep learning was unlocked by the combination of ReLU (Nair & Hinton, 2010), proper initialisation (Glorot & Bengio, 2010; He et al., 2015), and residual connections (He et al., 2015) — each attacking a different aspect of the gradient flow problem.