Spectral Normalisation
Spectral Normalisation
Section titled “Spectral Normalisation”Divides each weight matrix by its largest singular value (spectral norm) to constrain the Lipschitz constant of the network to 1. The key stabiliser for GAN discriminators, ensuring smooth gradients and preventing training collapse.
Intuition
Section titled “Intuition”A function is Lipschitz-1 if its output changes by at most 1 unit for every 1 unit change in its input. For a neural network layer , the maximum amount the output can change relative to the input is exactly the largest singular value — the spectral norm. By dividing by , each layer becomes Lipschitz-1, and the whole network (a composition of Lipschitz-1 layers) is also Lipschitz-1.
Why does this matter for GANs? The Wasserstein distance between real and generated distributions requires the discriminator (critic) to be Lipschitz-continuous. WGAN’s original solution — hard weight clipping — works but has pathological behaviour: it pushes weights toward the clipping boundary, wasting capacity and causing gradients to either vanish or explode. Spectral normalisation achieves the Lipschitz constraint smoothly, without distorting the weight distribution.
The practical magic is that you don’t need to compute a full SVD every step. A single step of power iteration gives a good approximation of , and since weights change slowly between steps, the approximation stays accurate. This makes spectral normalisation nearly free — it adds one matrix-vector multiply per layer per step.
Spectral norm of a weight matrix :
Normalised weight:
This ensures , so the layer is Lipschitz-1.
Power iteration (to approximate without full SVD):
Start with random , then iterate:
After convergence: . In practice, one iteration per training step is sufficient because changes slowly.
import torchimport torch.nn as nnfrom torch.nn.utils import spectral_norm
# ── Apply to a single layer ────────────────────────────────────layer = spectral_norm(nn.Linear(256, 128))# This wraps the layer: before each forward(), it computes the# spectral norm via power iteration and normalises the weight.
# ── GAN discriminator (typical usage) ──────────────────────────class Discriminator(nn.Module): def __init__(self): super().__init__() self.net = nn.Sequential( spectral_norm(nn.Linear(784, 256)), nn.LeakyReLU(0.2), spectral_norm(nn.Linear(256, 256)), nn.LeakyReLU(0.2), spectral_norm(nn.Linear(256, 1)), )
def forward(self, x): # (B, 784) return self.net(x) # (B, 1)
# WARNING: spectral_norm is typically applied ONLY to the# discriminator, not the generator. Constraining the generator's# Lipschitz constant can hurt sample quality.
# ── Remove spectral norm (e.g. for fine-tuning) ────────────────layer = nn.utils.remove_spectral_norm(layer)Manual Implementation
Section titled “Manual Implementation”import numpy as np
def power_iteration(W, u, n_iters=1): """ Estimate the largest singular value of W via power iteration. W: (out, in) weight matrix u: (out,) left singular vector estimate (maintained across steps) Returns: sigma, updated u, v """ for _ in range(n_iters): v = W.T @ u # (in,) v = v / (np.linalg.norm(v) + 1e-12) # (in,) normalised u = W @ v # (out,) u = u / (np.linalg.norm(u) + 1e-12) # (out,) normalised sigma = u @ W @ v # scalar — spectral norm return sigma, u, v
def spectral_norm_forward(W, u): """ Normalise W by its spectral norm. W: (out, in) weight matrix u: (out,) persistent left singular vector Returns: normalised W, updated u """ sigma, u_new, _ = power_iteration(W, u, n_iters=1) W_bar = W / sigma # (out, in) — Lipschitz-1 return W_bar, u_newPopular Uses
Section titled “Popular Uses”- GAN discriminators (SNGAN, Miyato et al. 2018): the canonical application — stabilises WGAN and hinge-loss GANs without weight clipping or gradient penalty
- BigGAN (Brock et al. 2019): spectral norm on both generator and discriminator, enabling stable training at large scale
- StyleGAN: uses spectral normalisation variants in the discriminator
- Diffusion model discriminators: when GANs are combined with diffusion (e.g. adversarial diffusion distillation)
- Lipschitz-constrained networks: normalising flows, Wasserstein autoencoders, and any architecture requiring bounded gradients
Alternatives
Section titled “Alternatives”| Alternative | When to use | Tradeoff |
|---|---|---|
| Gradient penalty (WGAN-GP) | When you need input-space Lipschitz constraint | Per-sample gradient computation is expensive; 2-3x slower than spectral norm |
| Weight clipping | Original WGAN | Causes capacity underuse and gradient issues; spectral norm is strictly better |
| Layer normalisation | Transformers, RNNs | Normalises activations not weights; doesn’t enforce Lipschitz but stabilises training |
| Orthogonal regularisation | When you want all singular values near 1 | Stronger constraint — preserves all singular values, not just the max. More expensive |
| R1 gradient penalty | StyleGAN-family discriminators | Penalises gradient norm on real data only; simpler and effective for some GAN variants |
Historical Context
Section titled “Historical Context”Spectral normalisation was introduced by Miyato et al. (2018, “Spectral Normalization for Generative Adversarial Networks”). It solved a critical practical problem: WGAN needed a Lipschitz discriminator, but the original weight clipping solution was crude and caused well-documented training pathologies. Gradient penalty (WGAN-GP) was effective but expensive — requiring a backward pass per sample to compute input gradients.
The key contribution was recognising that power iteration — a classical numerical method — could cheaply approximate the spectral norm with a single iteration per step, since weights change slowly during training. This made Lipschitz-constrained discriminators essentially free. Spectral normalisation quickly became the default for GAN discriminators and enabled subsequent work like BigGAN to scale GANs to unprecedented resolutions.