Gradient Penalty

Adds a penalty on the norm of the discriminator’s gradients with respect to its inputs: $\lambda (\|\nabla_{\hat{x}} D(\hat{x})\| - 1)^2$ . Enforces the Lipschitz constraint required by Wasserstein GANs without weight clipping. The defining feature of WGAN-GP.

Intuition

The Wasserstein distance requires the discriminator (critic) to be 1-Lipschitz — its output can’t change faster than its input. Mathematically, this means $\|\nabla_x D(x)\| \leq 1$ everywhere. Rather than constraining the weights directly (clipping, spectral norm), gradient penalty enforces this constraint where it matters: in the input space, on the data the discriminator actually sees.

The penalty is computed on interpolated points $\hat{x} = \epsilon x_{\text{real}} + (1-\epsilon) x_{\text{fake}}$ — random points along straight lines between real and fake samples. The theory says the optimal WGAN critic has gradient norm exactly 1 almost everywhere along these lines, so the penalty targets $(\|\nabla D(\hat{x})\| - 1)^2$ rather than just $\|\nabla D(\hat{x})\|^2$ . This two-sided penalty pushes gradients toward 1, not just below 1.

The cost: you must compute gradients of the discriminator output with respect to its input, which requires a second backward pass through the discriminator. This makes each step roughly 2-3x more expensive than standard training or spectral normalisation.

Math

Gradient penalty term:

$\mathcal{L}_{\text{GP}} = \lambda \, \mathbb{E}_{\hat{x} \sim \mathcal{P}_{\hat{x}}} \left[ \left( \|\nabla_{\hat{x}} D(\hat{x})\|_2 - 1 \right)^2 \right]$

where $\hat{x}$ is sampled along lines between real and fake data:

$\hat{x} = \epsilon \, x_{\text{real}} + (1 - \epsilon) \, x_{\text{fake}}, \quad \epsilon \sim \text{Uniform}(0, 1)$

Full WGAN-GP discriminator loss:

$\mathcal{L}_D = \underbrace{D(x_{\text{fake}}) - D(x_{\text{real}})}_{\text{Wasserstein estimate}} + \underbrace{\lambda \left( \|\nabla_{\hat{x}} D(\hat{x})\|_2 - 1 \right)^2}_{\text{gradient penalty}}$

Typical $\lambda$ : 10 (from the original paper, rarely changed).

R1 penalty (simplified variant, used in StyleGAN):

$\mathcal{L}_{\text{R1}} = \frac{\gamma}{2} \, \mathbb{E}_{x_{\text{real}}} \left[ \|\nabla_x D(x)\|_2^2 \right]$

Applied only on real data, penalises gradient magnitude (not the deviation from 1). Simpler and sufficient for non-Wasserstein GANs.

Code

import torch
import torch.autograd as autograd

def gradient_penalty(discriminator, real, fake, device, lambda_gp=10.0):
    """
    Compute WGAN-GP gradient penalty.
    real: (B, *) real samples
    fake: (B, *) generated samples (detached)
    """
    B = real.size(0)
    # Random interpolation factor per sample
    eps = torch.rand(B, *([1] * (real.dim() - 1)), device=device)  # (B, 1, ..., 1)
    interpolated = (eps * real + (1 - eps) * fake).requires_grad_(True)  # (B, *)

    d_out = discriminator(interpolated)                                  # (B, 1)

    # Compute gradients of D output w.r.t. interpolated input
    grads = autograd.grad(
        outputs=d_out,
        inputs=interpolated,
        grad_outputs=torch.ones_like(d_out),
        create_graph=True,     # need second-order gradients for backprop
        retain_graph=True,
    )[0]                                                                 # (B, *)

    grads = grads.view(B, -1)                              # (B, flat_dim)
    penalty = ((grads.norm(2, dim=1) - 1) ** 2).mean()    # scalar
    return lambda_gp * penalty

# WARNING: create_graph=True is essential — without it, the penalty
# term won't receive gradients and the constraint won't be enforced.

# WARNING: Do NOT use batch normalisation in the discriminator with
# gradient penalty. BatchNorm creates dependencies between samples in
# a batch, making the per-sample gradient computation invalid.
# Use layer norm or instance norm instead.

Manual Implementation

import numpy as np

def gradient_penalty_manual(D_fn, D_grad_fn, real, fake, lambda_gp=10.0):
    """
    Compute WGAN-GP gradient penalty (forward only, no autograd).
    D_fn:       discriminator forward function (x -> scalar per sample)
    D_grad_fn:  function returning dD/dx at a given x (B, d) -> (B, d)
    real:       (B, d) real samples
    fake:       (B, d) fake samples
    """
    B, d = real.shape
    eps = np.random.uniform(0, 1, size=(B, 1))               # (B, 1)
    interpolated = eps * real + (1 - eps) * fake              # (B, d)

    grads = D_grad_fn(interpolated)                           # (B, d)
    grad_norms = np.sqrt((grads ** 2).sum(axis=1) + 1e-12)   # (B,)
    penalty = ((grad_norms - 1) ** 2).mean()                  # scalar
    return lambda_gp * penalty

Popular Uses

WGAN-GP (Gulrajani et al. 2017): the defining application — replaced weight clipping as the standard way to enforce the Lipschitz constraint
Progressive GAN (Karras et al. 2018): used GP for stable training during progressive resolution increases
R1 penalty in StyleGAN (Karras et al. 2019): simplified gradient penalty on real data only, became standard for style-based generators
Wasserstein autoencoders (WAE): gradient penalty on the discriminator in the latent space
Domain adaptation: gradient penalty on domain discriminators to enforce smooth class boundaries

Alternatives

Alternative	When to use	Tradeoff
Spectral normalisation	Default for GAN discriminators	Much cheaper (no second backward pass); constrains weight space not input space
Weight clipping	Never (superseded)	Causes capacity underuse, vanishing/exploding gradients; strictly worse
R1 penalty	Non-Wasserstein GANs (StyleGAN)	Simpler — only penalises on real data, no interpolation; doesn’t enforce Lipschitz-1 exactly
Dragan penalty	Alternative gradient regularisation	Penalises around real data only with added noise; less theoretically motivated but sometimes effective
Consistency regularisation	When you want smooth D outputs for similar inputs	Penalises output differences directly rather than gradients; no second backward pass

Historical Context

Gradient penalty was introduced by Gulrajani et al. (2017, “Improved Training of Wasserstein GANs”) as a direct fix for the problems with weight clipping in the original WGAN (Arjovsky et al. 2017). Weight clipping caused the critic to learn very simple functions (weights pushed to the clip boundary), wasting model capacity. Gradient penalty solved this elegantly by enforcing the constraint in input space rather than weight space.

The paper’s choice of $\lambda = 10$ and interpolation between real and fake samples became standard without much subsequent tuning. However, the computational cost — roughly 2-3x per discriminator step due to the second backward pass — motivated the development of cheaper alternatives. Spectral normalisation (Miyato et al. 2018) largely replaced gradient penalty for standard GAN training, while the simpler R1 penalty (Mescheder et al. 2018) became the default for StyleGAN-family models. Gradient penalty remains important conceptually as the clearest implementation of input-space Lipschitz regularisation.