Exploding Gradients
Exploding Gradients
Section titled “Exploding Gradients”Gradients grow exponentially as they propagate backward through layers, causing weight updates so large they destroy the model. The mirror image of vanishing gradients — where vanishing gradients cause slow death, exploding gradients cause sudden death (NaN weights, loss jumping to infinity).
Intuition
Section titled “Intuition”If vanishing gradients are a game of telephone where each person whispers quieter, exploding gradients are a game where each person shouts louder. Each layer multiplies the gradient by its Jacobian, and if those multipliers are consistently greater than 1, the signal grows exponentially. After enough layers, the gradient reaching early layers is astronomically large, producing weight updates that overshoot by miles.
The result is dramatic: one step the model is training normally, the next step the loss spikes to infinity or the weights become NaN. Unlike vanishing gradients (which cause a slow, silent failure), exploding gradients are loud and obvious — but can be hard to prevent without also causing vanishing gradients. The two problems are linked: if the spectral radius of the Jacobians fluctuates around 1, some backward paths explode while others vanish.
RNNs are especially vulnerable because the same weight matrix is multiplied at every timestep. If the largest singular value of that matrix is even slightly above 1, gradients after 100 timesteps are times the original — catastrophic.
Same chain rule as vanishing gradients, but with spectral radius :
If consistently, the product grows as for some .
For an RNN with weight matrix : , where is sequence length and is the largest singular value.
Manifestation
Section titled “Manifestation”- Loss suddenly spikes to infinity or NaN — often after training seemed to be progressing normally
- Gradient norms per layer increase from output to input (the reverse of vanishing gradients)
- Weight values grow very large — inspect
param.data.norm()across layers - Training is unstable even with small learning rates — occasional NaN spikes interrupt otherwise normal training
- In RNNs: instability gets worse with longer sequences (more timesteps = deeper effective network)
Where It Appears
Section titled “Where It Appears”- MLP training (
nn-training/): deep networks with large weight initialisation or no normalisation — motivates careful initialisation schemes (He, Xavier) - Transformer (
transformer/): post-norm architectures (original Transformer) are more prone than pre-norm — pre-norm stabilises gradient magnitudes across layers - Policy gradient (
policy-gradient/): large advantage estimates can produce enormous policy gradients, especially early in training — gradient clipping is standard practice - RNNs (not yet in series): the classic setting — LSTM gates and gradient clipping were both motivated by this problem
Solutions at a Glance
Section titled “Solutions at a Glance”| Solution | Mechanism | Where documented |
|---|---|---|
| Gradient clipping | Cap the gradient norm to a maximum value before the optimizer step | atomic-concepts/optimisation-primitives/gradient-clipping.md |
| Proper weight initialisation | Set initial weight scale so forward/backward signal magnitude is preserved | atomic-concepts/optimisation-primitives/weight-initialisation.md |
| Residual connections | Gradient has a skip path, preventing exponential growth through the transform branch | atomic-concepts/architectural-primitives/residual-connections.md |
| Pre-norm (LayerNorm before sublayer) | Normalises inputs to each layer, bounding the Jacobian | transformer/ |
| LSTM / GRU gates | Gating mechanism bounds the gradient flow through time | (not yet in series) |
| Learning rate warmup | Start with tiny learning rate so early large gradients don’t destroy weights | atomic-concepts/optimisation-primitives/learning-rate-warmup.md |
Historical Context
Section titled “Historical Context”Exploding gradients were characterised alongside vanishing gradients by Bengio, Simard & Frasconi (1994), who showed that RNN gradient flow was fundamentally unstable — gradients either vanish or explode, with a stable middle ground that’s a set of measure zero. Pascanu, Mikolov & Bengio (2013) provided a thorough analysis and proposed gradient clipping as a practical solution — simply rescale the gradient vector if its norm exceeds a threshold. This turns out to be one of the most universally effective tricks in deep learning: nearly every modern training recipe includes gradient clipping, from LLM pretraining to RL.