Skip to content

Spectral Normalisation

Divides each weight matrix by its largest singular value (spectral norm) to constrain the Lipschitz constant of the network to 1. The key stabiliser for GAN discriminators, ensuring smooth gradients and preventing training collapse.

A function is Lipschitz-1 if its output changes by at most 1 unit for every 1 unit change in its input. For a neural network layer f(x)=Wxf(\mathbf{x}) = W\mathbf{x}, the maximum amount the output can change relative to the input is exactly the largest singular value σ1(W)\sigma_1(W) — the spectral norm. By dividing WW by σ1(W)\sigma_1(W), each layer becomes Lipschitz-1, and the whole network (a composition of Lipschitz-1 layers) is also Lipschitz-1.

Why does this matter for GANs? The Wasserstein distance between real and generated distributions requires the discriminator (critic) to be Lipschitz-continuous. WGAN’s original solution — hard weight clipping — works but has pathological behaviour: it pushes weights toward the clipping boundary, wasting capacity and causing gradients to either vanish or explode. Spectral normalisation achieves the Lipschitz constraint smoothly, without distorting the weight distribution.

The practical magic is that you don’t need to compute a full SVD every step. A single step of power iteration gives a good approximation of σ1\sigma_1, and since weights change slowly between steps, the approximation stays accurate. This makes spectral normalisation nearly free — it adds one matrix-vector multiply per layer per step.

Spectral norm of a weight matrix WW:

σ1(W)=maxv=1Wv=largest singular value of W\sigma_1(W) = \max_{\|\mathbf{v}\| = 1} \|W\mathbf{v}\| = \text{largest singular value of } W

Normalised weight:

Wˉ=Wσ1(W)\bar{W} = \frac{W}{\sigma_1(W)}

This ensures σ1(Wˉ)=1\sigma_1(\bar{W}) = 1, so the layer xWˉx\mathbf{x} \mapsto \bar{W}\mathbf{x} is Lipschitz-1.

Power iteration (to approximate σ1\sigma_1 without full SVD):

Start with random u\mathbf{u}, then iterate:

vWTuWTu,uWvWv\mathbf{v} \leftarrow \frac{W^T \mathbf{u}}{\|W^T \mathbf{u}\|}, \quad \mathbf{u} \leftarrow \frac{W\mathbf{v}}{\|W\mathbf{v}\|}

After convergence: σ1(W)uTWv\sigma_1(W) \approx \mathbf{u}^T W \mathbf{v}. In practice, one iteration per training step is sufficient because WW changes slowly.

import torch
import torch.nn as nn
from torch.nn.utils import spectral_norm
# ── Apply to a single layer ────────────────────────────────────
layer = spectral_norm(nn.Linear(256, 128))
# This wraps the layer: before each forward(), it computes the
# spectral norm via power iteration and normalises the weight.
# ── GAN discriminator (typical usage) ──────────────────────────
class Discriminator(nn.Module):
def __init__(self):
super().__init__()
self.net = nn.Sequential(
spectral_norm(nn.Linear(784, 256)),
nn.LeakyReLU(0.2),
spectral_norm(nn.Linear(256, 256)),
nn.LeakyReLU(0.2),
spectral_norm(nn.Linear(256, 1)),
)
def forward(self, x): # (B, 784)
return self.net(x) # (B, 1)
# WARNING: spectral_norm is typically applied ONLY to the
# discriminator, not the generator. Constraining the generator's
# Lipschitz constant can hurt sample quality.
# ── Remove spectral norm (e.g. for fine-tuning) ────────────────
layer = nn.utils.remove_spectral_norm(layer)
import numpy as np
def power_iteration(W, u, n_iters=1):
"""
Estimate the largest singular value of W via power iteration.
W: (out, in) weight matrix
u: (out,) left singular vector estimate (maintained across steps)
Returns: sigma, updated u, v
"""
for _ in range(n_iters):
v = W.T @ u # (in,)
v = v / (np.linalg.norm(v) + 1e-12) # (in,) normalised
u = W @ v # (out,)
u = u / (np.linalg.norm(u) + 1e-12) # (out,) normalised
sigma = u @ W @ v # scalar — spectral norm
return sigma, u, v
def spectral_norm_forward(W, u):
"""
Normalise W by its spectral norm.
W: (out, in) weight matrix
u: (out,) persistent left singular vector
Returns: normalised W, updated u
"""
sigma, u_new, _ = power_iteration(W, u, n_iters=1)
W_bar = W / sigma # (out, in) — Lipschitz-1
return W_bar, u_new
  • GAN discriminators (SNGAN, Miyato et al. 2018): the canonical application — stabilises WGAN and hinge-loss GANs without weight clipping or gradient penalty
  • BigGAN (Brock et al. 2019): spectral norm on both generator and discriminator, enabling stable training at large scale
  • StyleGAN: uses spectral normalisation variants in the discriminator
  • Diffusion model discriminators: when GANs are combined with diffusion (e.g. adversarial diffusion distillation)
  • Lipschitz-constrained networks: normalising flows, Wasserstein autoencoders, and any architecture requiring bounded gradients
AlternativeWhen to useTradeoff
Gradient penalty (WGAN-GP)When you need input-space Lipschitz constraintPer-sample gradient computation is expensive; 2-3x slower than spectral norm
Weight clippingOriginal WGANCauses capacity underuse and gradient issues; spectral norm is strictly better
Layer normalisationTransformers, RNNsNormalises activations not weights; doesn’t enforce Lipschitz but stabilises training
Orthogonal regularisationWhen you want all singular values near 1Stronger constraint — preserves all singular values, not just the max. More expensive
R1 gradient penaltyStyleGAN-family discriminatorsPenalises gradient norm on real data only; simpler and effective for some GAN variants

Spectral normalisation was introduced by Miyato et al. (2018, “Spectral Normalization for Generative Adversarial Networks”). It solved a critical practical problem: WGAN needed a Lipschitz discriminator, but the original weight clipping solution was crude and caused well-documented training pathologies. Gradient penalty (WGAN-GP) was effective but expensive — requiring a backward pass per sample to compute input gradients.

The key contribution was recognising that power iteration — a classical numerical method — could cheaply approximate the spectral norm with a single iteration per step, since weights change slowly during training. This made Lipschitz-constrained discriminators essentially free. Spectral normalisation quickly became the default for GAN discriminators and enabled subsequent work like BigGAN to scale GANs to unprecedented resolutions.