Residual Connections
Residual Connections
Section titled “Residual Connections”Adding the input of a block to its output: y = F(x) + x. Also called “skip connections.” The single most important architectural innovation for training deep networks — used in every modern architecture from ResNet to GPT to Stable Diffusion.
Intuition
Section titled “Intuition”Without residual connections, a 100-layer network must learn the entire transformation from input to output as one long chain of matrix multiplications. Gradients must flow backwards through every layer, and they tend to either vanish (multiply by numbers < 1 repeatedly) or explode (multiply by numbers > 1 repeatedly). By the time the gradient reaches early layers, it’s either negligibly small or catastrophically large.
Residual connections fix this by giving the gradient a highway. The addition y = F(x) + x means the gradient of y with respect to x is dF/dx + I, where I is the identity matrix. That ”+ I” term means the gradient always has a direct path back to earlier layers, regardless of what F does. Even if dF/dx vanishes entirely, the gradient still flows through the identity branch. This is why you can train networks with hundreds or thousands of layers.
There’s a deeper insight: residual connections change what each layer learns. Instead of learning the full mapping H(x), each layer only needs to learn the residual F(x) = H(x) - x, i.e. the delta from the identity. If the optimal transformation is close to identity (which it often is in deep networks), learning a small residual is much easier than learning the full mapping from scratch. This is why the paper is called “Deep Residual Learning.”
Residual block:
where is any parameterised function (one or more layers). The gradient flows as:
The term guarantees gradient flow regardless of .
With a projection (when input and output dimensions differ):
where is a linear projection to match dimensions. In practice this is a 1x1 convolution (ResNet) or a linear layer (transformers).
Pre-norm variant (the modern standard, used in GPT, LLaMA):
Normalisation is applied before the block, so the residual stream stays unnormalised. This is more stable for very deep networks because the residual stream isn’t repeatedly normalised.
import torchimport torch.nn as nn
# ── Basic residual block ────────────────────────────────────────class ResidualBlock(nn.Module): def __init__(self, d_model: int): super().__init__() self.norm = nn.LayerNorm(d_model) self.mlp = nn.Sequential( nn.Linear(d_model, 4 * d_model), # expand nn.GELU(), nn.Linear(4 * d_model, d_model), # project back )
def forward(self, x): # x: (B, T, d_model) return x + self.mlp(self.norm(x)) # (B, T, d_model) # That's it. The entire point is the "+ x".
# ── With dimension mismatch (e.g. ResNet downsampling) ──────────class ResidualBlockWithProjection(nn.Module): def __init__(self, in_dim: int, out_dim: int): super().__init__() self.block = nn.Sequential( nn.Linear(in_dim, out_dim), nn.ReLU(), nn.Linear(out_dim, out_dim), ) # Project the skip path to match output dimension self.shortcut = nn.Linear(in_dim, out_dim) if in_dim != out_dim else nn.Identity()
def forward(self, x): # x: (B, in_dim) return self.block(x) + self.shortcut(x) # (B, out_dim)Manual Implementation
Section titled “Manual Implementation”import numpy as np
def residual_block_forward(x, W1, b1, W2, b2): """ Pre-norm residual MLP block, numpy only. x: (B, D) input W1: (D, 4D), b1: (4D,), W2: (4D, D), b2: (D,) Returns: (B, D) """ # Layer norm (simplified: per-sample, per-feature) mu = x.mean(axis=-1, keepdims=True) # (B, 1) var = x.var(axis=-1, keepdims=True) # (B, 1) x_norm = (x - mu) / np.sqrt(var + 1e-5) # (B, D)
# MLP: expand -> GELU -> project back h = x_norm @ W1 + b1 # (B, 4D) h = h * 0.5 * (1.0 + np.tanh( # GELU approximation np.sqrt(2.0 / np.pi) * (h + 0.044715 * h ** 3) )) # (B, 4D) out = h @ W2 + b2 # (B, D)
return x + out # <-- the residual connection: input + block outputPopular Uses
Section titled “Popular Uses”- Transformers (GPT, LLaMA, BERT, ViT): every attention and MLP sub-layer uses a residual connection. The “residual stream” is the backbone of transformer computation (see
transformer/) - ResNet (He et al., 2015): the original application — enabled training 152-layer CNNs for ImageNet, up from ~20 layers without skip connections
- Diffusion U-Nets (see
diffusion/): skip connections between encoder and decoder at matching resolutions, plus residual blocks within each resolution level - Policy and value networks (see
policy-gradient/,q-learning/): deeper RL networks use residual blocks to stabilise training - GANs (see
gans/): both generators and discriminators in modern GANs (StyleGAN, BigGAN) use residual blocks extensively
Alternatives
Section titled “Alternatives”| Alternative | When to use | Tradeoff |
|---|---|---|
| Dense connections (DenseNet) | Need maximum feature reuse in CNNs | Concatenates all previous outputs instead of adding; much higher memory cost |
| Highway networks | Predecessor to ResNets; historical interest | Learned gating T(x) controls how much signal passes through; more parameters, no clear benefit over simple addition |
| No skip connection | Very shallow networks (< 5 layers) | Simpler, but gradient flow degrades rapidly with depth |
| Weighted residual | Fine-grained control over skip strength | y = F(x) + alpha * x with learnable alpha; used in some diffusion architectures (e.g. progressive training) |
| ReZero | Faster early training | y = x + alpha * F(x), with alpha initialised to 0; each layer starts as identity |
Historical Context
Section titled “Historical Context”Residual connections were introduced by He et al. in “Deep Residual Learning for Image Recognition” (2015), which won the ImageNet competition by training a 152-layer CNN — dramatically deeper than anything before. The key insight was reframing each layer as learning a residual delta rather than a full transformation, which they showed eliminated the “degradation problem” where adding more layers to a sufficiently deep network actually increased training error.
The idea was quickly adopted by the transformer architecture (Vaswani et al., 2017), where it became even more critical — transformers stack dozens of attention and MLP blocks, and without residual connections, training diverges. The shift from “post-norm” (original transformer: normalise after the addition) to “pre-norm” (GPT-2 onward: normalise before the block) further improved training stability and is now the universal default. The concept has become so fundamental that modern architectures are often described in terms of their “residual stream” — the data flowing through the skip connections, modified incrementally by each block.