Dying ReLU

Neurons with ReLU activation become permanently stuck at zero output when their weights shift so that the input is always negative. Once dead, a ReLU neuron produces zero gradient and can never recover — it’s removed from the network forever. Also called “dead neurons.”

Intuition

ReLU is a one-way door: $\text{ReLU}(x) = \max(0, x)$ . When $x > 0$ , the gradient is 1 (perfect signal). When $x < 0$ , the gradient is exactly 0 (no signal at all). If a neuron’s bias and incoming weights shift so that $Wx + b < 0$ for every input in the training set, the gradient is zero for every training example. Zero gradient means zero weight update, which means the neuron stays dead permanently.

This can happen suddenly. A single large gradient update can push the bias far enough negative that the neuron “falls off a cliff” into permanent zero territory. Learning rates that are too high make this more likely — they allow large jumps that land on the wrong side of the ReLU threshold. But it can also happen gradually through the accumulation of small updates that slowly shift the neuron’s operating point into negative territory.

The problem scales with network depth and learning rate. In a deep network, a large fraction of neurons can die early in training (sometimes 20-40%), permanently reducing the network’s effective capacity. The network still trains, but with fewer parameters than intended.

Manifestation

Activation statistics: a growing fraction of neurons output exactly 0.0 for every input in a batch — track (activations == 0).float().mean() per layer
Dead neurons never recover — once the fraction starts climbing, it doesn’t come back down
Training loss plateaus at a value higher than expected, as effective model capacity has been reduced
Weight norms for dead neurons stop changing entirely (zero gradient = zero update)
More common with high learning rates and in early training when the optimiser makes large steps

Where It Appears

MLP training (nn-training/): the most direct setting — deep MLPs with ReLU can lose significant capacity to dead neurons; motivates GELU/SiLU as the modern default
GANs (gans/): the discriminator can suffer from dying ReLU, especially with aggressive learning rates — contributes to training instability
Q-learning (q-learning/): Q-networks can gradually lose active neurons, reducing approximation capacity over long training runs

Solutions at a Glance

Solution	Mechanism	Where documented
Leaky ReLU	Small negative slope ( $0.01x$ for $x < 0$ ) ensures gradient is never exactly zero	(standard activation)
GELU / SiLU	Smooth approximation to ReLU with non-zero gradient everywhere	`atomic-concepts/activation-functions/gelu.md`, `silu-swish.md`
Proper initialisation	He initialisation keeps activations in a good range at the start of training	`atomic-concepts/optimisation-primitives/weight-initialisation.md`
Lower learning rate / warmup	Prevents the large early updates that push neurons into permanently negative territory	`atomic-concepts/optimisation-primitives/learning-rate-warmup.md`
Batch normalisation	Re-centres activations, making it harder for an entire neuron to be stuck negative	(standard technique)

Historical Context

The dying ReLU problem was observed empirically soon after ReLU became the standard activation (around 2011-2012). Maas et al. (2013) proposed Leaky ReLU as a direct fix. Despite the problem being well-known, ReLU remained the default for years due to its simplicity and the fact that dying neurons don’t prevent training — they just reduce capacity. The shift away from ReLU toward smooth alternatives (GELU in BERT/GPT, SiLU/Swish in EfficientNet, SwiGLU in LLaMA) was motivated by a combination of dying ReLU concerns and the observation that smooth activations train slightly better in practice.