Unified Diffusion Algorithm
Unified Diffusion Algorithm
Section titled “Unified Diffusion Algorithm”Introduction
Section titled “Introduction”The structure follows the same pattern. The training step is arguably the simplest of all four files — it’s literally “add noise, predict something, MSE.” The complexity lives entirely in sampling.
The key insight this file tries to make clear:
Training is identical across DDPM, DDIM, and v-prediction. DDPM and DDIM have the exact same training — both predict ε with MSE loss. They only differ at sampling time. This is surprisingly under-appreciated. You can train once and sample with either method.
The three prediction targets (ε, x_0, v) are mathematically interchangeable. The “RELATIONSHIP BETWEEN PREDICTION TARGETS” section at the bottom shows the algebra — you can convert freely between them. The choice matters only because it changes the loss landscape: ε-prediction gives a loss dominated by high-noise timesteps, v-prediction balances the difficulty evenly, and x_0-prediction emphasises low-noise timesteps.
Classifier-free guidance is the critical practical addition. It’s what turned diffusion from “nice research” into “DALL-E / Stable Diffusion.” The trick — train with random conditioning dropout, then extrapolate away from unconditional at inference — is elegant and doesn’t fit neatly into the core algorithm’s pluggable methods, so it’s a separate composable component (same way the EMA is).
The progression tells a similar story to the other files: DDPM solves the core problem (stable training, full distribution coverage), DDIM solves speed, v-prediction solves numerical stability, and classifier-free guidance solves controllability. Each one changes exactly one piece.
Summary: What changes vs. what stays the same
Section titled “Summary: What changes vs. what stays the same”Always the same (training)
Section titled “Always the same (training)”- Sample random timestep t
- x_t = sqrt(alpha_bar_t) · x_0 + sqrt(1 − alpha_bar_t) · ε (add noise)
- loss = || target − model(x_t, t) ||² (MSE loss)
- Gradient step
Always the same (sampling structure)
Section titled “Always the same (sampling structure)”- Start from x_T ~ N(0, I)
- Loop: x_t → x_{t−1} via denoise_step() (PLUGGABLE)
- Return x_0
What varies by variant
Section titled “What varies by variant”| Variant | Prediction target | Sampling | Speed |
|---|---|---|---|
| DDPM | noise ε | Stochastic (+ σ·z) | T steps (slow) |
| DDIM | noise ε | Deterministic (η=0) | ~50 steps (fast) |
| v-prediction | velocity v | Deterministic | ~50 steps (fast) |
Motives for each variant
Section titled “Motives for each variant”| Variant | Problem Solved | Intuition for Solution |
|---|---|---|
| DDPM | GANs are unstable to train (mode collapse, training oscillation) and don’t cover the full data distribution | Instead of adversarial training, learn to reverse a simple noise process. Training is just MSE on noise prediction — as stable as any regression. Covers all modes because the forward process does |
| DDIM | DDPM needs T=1000 sequential steps to generate one sample — extremely slow | Rewrite the reverse process as a non-Markovian chain that gives the same marginals but allows skipping steps. 50 steps ≈ 1000-step quality. Determinism enables interpolation and inversion |
| v-prediction | ε-prediction is numerically unstable at extremes: at t≈0 the noise is tiny (hard to predict), at t≈T the signal is gone (prediction is meaningless) | Predict v = √ᾱ·ε − √(1−ᾱ)·x_0, which stays well-scaled at all timesteps. The network’s job is equally difficult at every t, preventing the loss from being dominated by certain timesteps |
| Cosine schedule | Linear schedule destroys coarse structure too early — the model wastes capacity denoising already-destroyed images | Shape the noise curve so ᾱ_t follows a cosine — gentle at first, steep at the end. Coarse structure survives longer, giving the model useful signal at more timesteps |
| Classifier-free guidance | Need conditional generation but classifier guidance needs a separate trained classifier and back-propagation through it at each sampling step | Train one model for both conditional and unconditional (random cond drop). At inference, extrapolate AWAY from unconditional: ε̃ = ε_u + w(ε_c − ε_u). w>1 amplifies the condition. Simple, no extra models needed |
Relationship between prediction targets
Section titled “Relationship between prediction targets”Given: x_t = √ᾱ · x_0 + √(1−ᾱ) · ε
You can recover any target from any other:
- x_0 from ε: x_0 = (x_t − √(1−ᾱ)·ε) / √ᾱ
- ε from x_0: ε = (x_t − √ᾱ·x_0) / √(1−ᾱ)
- v from both: v = √ᾱ·ε − √(1−ᾱ)·x_0
- x_0 from v: x_0 = √ᾱ·x_t − √(1−ᾱ)·v
- ε from v: ε = √(1−ᾱ)·x_t + √ᾱ·v
The choice of target changes what the network learns to be good at — the mathematical content is equivalent, but the loss landscape and gradient magnitudes differ.