Exploding Gradients

Gradients grow exponentially as they propagate backward through layers, causing weight updates so large they destroy the model. The mirror image of vanishing gradients — where vanishing gradients cause slow death, exploding gradients cause sudden death (NaN weights, loss jumping to infinity).

Intuition

If vanishing gradients are a game of telephone where each person whispers quieter, exploding gradients are a game where each person shouts louder. Each layer multiplies the gradient by its Jacobian, and if those multipliers are consistently greater than 1, the signal grows exponentially. After enough layers, the gradient reaching early layers is astronomically large, producing weight updates that overshoot by miles.

The result is dramatic: one step the model is training normally, the next step the loss spikes to infinity or the weights become NaN. Unlike vanishing gradients (which cause a slow, silent failure), exploding gradients are loud and obvious — but can be hard to prevent without also causing vanishing gradients. The two problems are linked: if the spectral radius of the Jacobians fluctuates around 1, some backward paths explode while others vanish.

RNNs are especially vulnerable because the same weight matrix is multiplied at every timestep. If the largest singular value of that matrix is even slightly above 1, gradients after 100 timesteps are $>1^{100}$ times the original — catastrophic.

Math

Same chain rule as vanishing gradients, but with spectral radius $> 1$ :

$\frac{\partial \mathcal{L}}{\partial W_1} = \frac{\partial \mathcal{L}}{\partial h_L} \prod_{l=2}^{L} \frac{\partial h_l}{\partial h_{l-1}} \cdot \frac{\partial h_2}{\partial W_1}$

If $\|\frac{\partial h_l}{\partial h_{l-1}}\| > 1$ consistently, the product grows as $\mathcal{O}(c^L)$ for some $c > 1$ .

For an RNN with weight matrix $W$ : $\|(\text{diag}(\sigma') \cdot W)^T\| \approx \lambda_{\max}(W)^T$ , where $T$ is sequence length and $\lambda_{\max}$ is the largest singular value.

Manifestation

Loss suddenly spikes to infinity or NaN — often after training seemed to be progressing normally
Gradient norms per layer increase from output to input (the reverse of vanishing gradients)
Weight values grow very large — inspect param.data.norm() across layers
Training is unstable even with small learning rates — occasional NaN spikes interrupt otherwise normal training
In RNNs: instability gets worse with longer sequences (more timesteps = deeper effective network)

Where It Appears

MLP training (nn-training/): deep networks with large weight initialisation or no normalisation — motivates careful initialisation schemes (He, Xavier)
Transformer (transformer/): post-norm architectures (original Transformer) are more prone than pre-norm — pre-norm stabilises gradient magnitudes across layers
Policy gradient (policy-gradient/): large advantage estimates can produce enormous policy gradients, especially early in training — gradient clipping is standard practice
RNNs (not yet in series): the classic setting — LSTM gates and gradient clipping were both motivated by this problem

Solutions at a Glance

Solution	Mechanism	Where documented
Gradient clipping	Cap the gradient norm to a maximum value before the optimizer step	`atomic-concepts/optimisation-primitives/gradient-clipping.md`
Proper weight initialisation	Set initial weight scale so forward/backward signal magnitude is preserved	`atomic-concepts/optimisation-primitives/weight-initialisation.md`
Residual connections	Gradient has a skip path, preventing exponential growth through the transform branch	`atomic-concepts/architectural-primitives/residual-connections.md`
Pre-norm (LayerNorm before sublayer)	Normalises inputs to each layer, bounding the Jacobian	`transformer/`
LSTM / GRU gates	Gating mechanism bounds the gradient flow through time	(not yet in series)
Learning rate warmup	Start with tiny learning rate so early large gradients don’t destroy weights	`atomic-concepts/optimisation-primitives/learning-rate-warmup.md`

Historical Context

Exploding gradients were characterised alongside vanishing gradients by Bengio, Simard & Frasconi (1994), who showed that RNN gradient flow was fundamentally unstable — gradients either vanish or explode, with a stable middle ground that’s a set of measure zero. Pascanu, Mikolov & Bengio (2013) provided a thorough analysis and proposed gradient clipping as a practical solution — simply rescale the gradient vector if its norm exceeds a threshold. This turns out to be one of the most universally effective tricks in deep learning: nearly every modern training recipe includes gradient clipping, from LLM pretraining to RL.