Skip to content

Exploding Gradients

Gradients grow exponentially as they propagate backward through layers, causing weight updates so large they destroy the model. The mirror image of vanishing gradients — where vanishing gradients cause slow death, exploding gradients cause sudden death (NaN weights, loss jumping to infinity).

If vanishing gradients are a game of telephone where each person whispers quieter, exploding gradients are a game where each person shouts louder. Each layer multiplies the gradient by its Jacobian, and if those multipliers are consistently greater than 1, the signal grows exponentially. After enough layers, the gradient reaching early layers is astronomically large, producing weight updates that overshoot by miles.

The result is dramatic: one step the model is training normally, the next step the loss spikes to infinity or the weights become NaN. Unlike vanishing gradients (which cause a slow, silent failure), exploding gradients are loud and obvious — but can be hard to prevent without also causing vanishing gradients. The two problems are linked: if the spectral radius of the Jacobians fluctuates around 1, some backward paths explode while others vanish.

RNNs are especially vulnerable because the same weight matrix is multiplied at every timestep. If the largest singular value of that matrix is even slightly above 1, gradients after 100 timesteps are >1100>1^{100} times the original — catastrophic.

Same chain rule as vanishing gradients, but with spectral radius >1> 1:

LW1=LhLl=2Lhlhl1h2W1\frac{\partial \mathcal{L}}{\partial W_1} = \frac{\partial \mathcal{L}}{\partial h_L} \prod_{l=2}^{L} \frac{\partial h_l}{\partial h_{l-1}} \cdot \frac{\partial h_2}{\partial W_1}

If hlhl1>1\|\frac{\partial h_l}{\partial h_{l-1}}\| > 1 consistently, the product grows as O(cL)\mathcal{O}(c^L) for some c>1c > 1.

For an RNN with weight matrix WW: (diag(σ)W)Tλmax(W)T\|(\text{diag}(\sigma') \cdot W)^T\| \approx \lambda_{\max}(W)^T, where TT is sequence length and λmax\lambda_{\max} is the largest singular value.

  • Loss suddenly spikes to infinity or NaN — often after training seemed to be progressing normally
  • Gradient norms per layer increase from output to input (the reverse of vanishing gradients)
  • Weight values grow very large — inspect param.data.norm() across layers
  • Training is unstable even with small learning rates — occasional NaN spikes interrupt otherwise normal training
  • In RNNs: instability gets worse with longer sequences (more timesteps = deeper effective network)
  • MLP training (nn-training/): deep networks with large weight initialisation or no normalisation — motivates careful initialisation schemes (He, Xavier)
  • Transformer (transformer/): post-norm architectures (original Transformer) are more prone than pre-norm — pre-norm stabilises gradient magnitudes across layers
  • Policy gradient (policy-gradient/): large advantage estimates can produce enormous policy gradients, especially early in training — gradient clipping is standard practice
  • RNNs (not yet in series): the classic setting — LSTM gates and gradient clipping were both motivated by this problem
SolutionMechanismWhere documented
Gradient clippingCap the gradient norm to a maximum value before the optimizer stepatomic-concepts/optimisation-primitives/gradient-clipping.md
Proper weight initialisationSet initial weight scale so forward/backward signal magnitude is preservedatomic-concepts/optimisation-primitives/weight-initialisation.md
Residual connectionsGradient has a skip path, preventing exponential growth through the transform branchatomic-concepts/architectural-primitives/residual-connections.md
Pre-norm (LayerNorm before sublayer)Normalises inputs to each layer, bounding the Jacobiantransformer/
LSTM / GRU gatesGating mechanism bounds the gradient flow through time(not yet in series)
Learning rate warmupStart with tiny learning rate so early large gradients don’t destroy weightsatomic-concepts/optimisation-primitives/learning-rate-warmup.md

Exploding gradients were characterised alongside vanishing gradients by Bengio, Simard & Frasconi (1994), who showed that RNN gradient flow was fundamentally unstable — gradients either vanish or explode, with a stable middle ground that’s a set of measure zero. Pascanu, Mikolov & Bengio (2013) provided a thorough analysis and proposed gradient clipping as a practical solution — simply rescale the gradient vector if its norm exceeds a threshold. This turns out to be one of the most universally effective tricks in deep learning: nearly every modern training recipe includes gradient clipping, from LLM pretraining to RL.