Residual Connections

Adding the input of a block to its output: y = F(x) + x. Also called “skip connections.” The single most important architectural innovation for training deep networks — used in every modern architecture from ResNet to GPT to Stable Diffusion.

Intuition

Without residual connections, a 100-layer network must learn the entire transformation from input to output as one long chain of matrix multiplications. Gradients must flow backwards through every layer, and they tend to either vanish (multiply by numbers < 1 repeatedly) or explode (multiply by numbers > 1 repeatedly). By the time the gradient reaches early layers, it’s either negligibly small or catastrophically large.

Residual connections fix this by giving the gradient a highway. The addition y = F(x) + x means the gradient of y with respect to x is dF/dx + I, where I is the identity matrix. That ”+ I” term means the gradient always has a direct path back to earlier layers, regardless of what F does. Even if dF/dx vanishes entirely, the gradient still flows through the identity branch. This is why you can train networks with hundreds or thousands of layers.

There’s a deeper insight: residual connections change what each layer learns. Instead of learning the full mapping H(x), each layer only needs to learn the residual F(x) = H(x) - x, i.e. the delta from the identity. If the optimal transformation is close to identity (which it often is in deep networks), learning a small residual is much easier than learning the full mapping from scratch. This is why the paper is called “Deep Residual Learning.”

Math

Residual block:

$y = F(x) + x$

where $F$ is any parameterised function (one or more layers). The gradient flows as:

$\frac{\partial \mathcal{L}}{\partial x} = \frac{\partial \mathcal{L}}{\partial y} \cdot \left(\frac{\partial F}{\partial x} + I\right)$

The $+ I$ term guarantees gradient flow regardless of $\frac{\partial F}{\partial x}$ .

With a projection (when input and output dimensions differ):

$y = F(x) + W_s x$

where $W_s$ is a linear projection to match dimensions. In practice this is a 1x1 convolution (ResNet) or a linear layer (transformers).

Pre-norm variant (the modern standard, used in GPT, LLaMA):

$y = x + F(\text{LayerNorm}(x))$

Normalisation is applied before the block, so the residual stream stays unnormalised. This is more stable for very deep networks because the residual stream isn’t repeatedly normalised.

Code

import torch
import torch.nn as nn

# ── Basic residual block ────────────────────────────────────────
class ResidualBlock(nn.Module):
    def __init__(self, d_model: int):
        super().__init__()
        self.norm = nn.LayerNorm(d_model)
        self.mlp = nn.Sequential(
            nn.Linear(d_model, 4 * d_model),  # expand
            nn.GELU(),
            nn.Linear(4 * d_model, d_model),  # project back
        )

    def forward(self, x):                      # x: (B, T, d_model)
        return x + self.mlp(self.norm(x))      # (B, T, d_model)
        # That's it. The entire point is the "+ x".

# ── With dimension mismatch (e.g. ResNet downsampling) ──────────
class ResidualBlockWithProjection(nn.Module):
    def __init__(self, in_dim: int, out_dim: int):
        super().__init__()
        self.block = nn.Sequential(
            nn.Linear(in_dim, out_dim), nn.ReLU(),
            nn.Linear(out_dim, out_dim),
        )
        # Project the skip path to match output dimension
        self.shortcut = nn.Linear(in_dim, out_dim) if in_dim != out_dim else nn.Identity()

    def forward(self, x):                      # x: (B, in_dim)
        return self.block(x) + self.shortcut(x)  # (B, out_dim)

Manual Implementation

import numpy as np

def residual_block_forward(x, W1, b1, W2, b2):
    """
    Pre-norm residual MLP block, numpy only.
    x:  (B, D) input
    W1: (D, 4D), b1: (4D,), W2: (4D, D), b2: (D,)
    Returns: (B, D)
    """
    # Layer norm (simplified: per-sample, per-feature)
    mu = x.mean(axis=-1, keepdims=True)                # (B, 1)
    var = x.var(axis=-1, keepdims=True)                 # (B, 1)
    x_norm = (x - mu) / np.sqrt(var + 1e-5)            # (B, D)

    # MLP: expand -> GELU -> project back
    h = x_norm @ W1 + b1                                # (B, 4D)
    h = h * 0.5 * (1.0 + np.tanh(                       # GELU approximation
        np.sqrt(2.0 / np.pi) * (h + 0.044715 * h ** 3)
    ))                                                   # (B, 4D)
    out = h @ W2 + b2                                   # (B, D)

    return x + out  # <-- the residual connection: input + block output

Popular Uses

Transformers (GPT, LLaMA, BERT, ViT): every attention and MLP sub-layer uses a residual connection. The “residual stream” is the backbone of transformer computation (see transformer/)
ResNet (He et al., 2015): the original application — enabled training 152-layer CNNs for ImageNet, up from ~20 layers without skip connections
Diffusion U-Nets (see diffusion/): skip connections between encoder and decoder at matching resolutions, plus residual blocks within each resolution level
Policy and value networks (see policy-gradient/, q-learning/): deeper RL networks use residual blocks to stabilise training
GANs (see gans/): both generators and discriminators in modern GANs (StyleGAN, BigGAN) use residual blocks extensively

Alternatives

Alternative	When to use	Tradeoff
Dense connections (DenseNet)	Need maximum feature reuse in CNNs	Concatenates all previous outputs instead of adding; much higher memory cost
Highway networks	Predecessor to ResNets; historical interest	Learned gating T(x) controls how much signal passes through; more parameters, no clear benefit over simple addition
No skip connection	Very shallow networks (< 5 layers)	Simpler, but gradient flow degrades rapidly with depth
Weighted residual	Fine-grained control over skip strength	y = F(x) + alpha * x with learnable alpha; used in some diffusion architectures (e.g. progressive training)
ReZero	Faster early training	y = x + alpha * F(x), with alpha initialised to 0; each layer starts as identity

Historical Context

Residual connections were introduced by He et al. in “Deep Residual Learning for Image Recognition” (2015), which won the ImageNet competition by training a 152-layer CNN — dramatically deeper than anything before. The key insight was reframing each layer as learning a residual delta rather than a full transformation, which they showed eliminated the “degradation problem” where adding more layers to a sufficiently deep network actually increased training error.

The idea was quickly adopted by the transformer architecture (Vaswani et al., 2017), where it became even more critical — transformers stack dozens of attention and MLP blocks, and without residual connections, training diverges. The shift from “post-norm” (original transformer: normalise after the addition) to “pre-norm” (GPT-2 onward: normalise before the block) further improved training stability and is now the universal default. The concept has become so fundamental that modern architectures are often described in terms of their “residual stream” — the data flowing through the skip connections, modified incrementally by each block.