Training Oscillation
Training Oscillation
Section titled “Training Oscillation”Two adversarially coupled components chase each other without converging — the generator adapts to fool the discriminator, the discriminator adapts to catch the generator, and the cycle repeats without either reaching a stable equilibrium. Also called “oscillatory dynamics” or “non-convergence” in game-theoretic settings.
Intuition
Section titled “Intuition”Imagine two chess players who can only learn by playing each other. Player A discovers a winning strategy. Player B adapts to counter it. Player A then adapts to counter the counter. Neither player ever settles into a stable strategy — they keep cycling through a sequence of moves and counter-moves. If they’re evenly matched, this cycle continues forever.
GANs are a two-player minimax game, and gradient descent on such games doesn’t have the same convergence guarantees as gradient descent on a single loss function. In a single-objective setting, the loss landscape has a clear “downhill” direction. In a minimax game, there may be no fixed point that both players converge to — the gradient field can rotate around the equilibrium rather than pointing toward it. Each player’s update makes the other player’s current state worse, and the combined dynamics spiral.
The problem is amplified when one player is much stronger than the other. If the discriminator is too powerful, it provides no useful gradient to the generator (gradients vanish or point in erratic directions). If the generator is too strong, the discriminator can’t keep up and the generator overfits to its current weaknesses.
Manifestation
Section titled “Manifestation”- Generator and discriminator losses oscillate in anti-phase — when one improves, the other degrades, and vice versa
- Generated sample quality fluctuates — good epochs alternate with bad epochs rather than steadily improving
- FID/IS metrics oscillate rather than decreasing monotonically
- No clear “convergence” — training can run for millions of steps without reaching a stable state
- Learning rate sensitivity: small changes in learning rate ratios between the two networks dramatically change training dynamics
Where It Appears
Section titled “Where It Appears”- GANs (
gans/): the canonical setting — the G-D minimax game has oscillatory dynamics by nature; WGAN’s Wasserstein objective and hinge loss both help by providing smoother, more informative gradients - Policy gradient (
policy-gradient/): not directly adversarial, but the interplay between policy and value function updates can create oscillation, especially when the value function can’t keep up with rapid policy changes - Multi-agent RL: when multiple agents learn simultaneously, each agent’s environment is non-stationary (other agents are changing), creating oscillatory dynamics similar to GANs
Solutions at a Glance
Section titled “Solutions at a Glance”| Solution | Mechanism | Where documented |
|---|---|---|
| WGAN / Wasserstein distance | Smoother loss landscape with less rotational dynamics | gans/ |
| Hinge loss | Saturates the discriminator at a margin, preventing it from becoming too strong | gans/, atomic-concepts/loss-functions/hinge-loss.md |
| Spectral normalisation | Constrains discriminator Lipschitz constant, balancing the two players | atomic-concepts/regularisation/spectral-normalisation.md |
| Two-timescale updates | Train discriminator multiple steps per generator step (or use different learning rates) | gans/ |
| Gradient penalty | Regularises the discriminator’s gradient, smoothing the loss landscape | atomic-concepts/regularisation/gradient-penalty.md |
| EMA of generator weights | Average generator weights over time to smooth oscillatory weight trajectories | atomic-concepts/optimisation-primitives/exponential-moving-average.md |
Historical Context
Section titled “Historical Context”Oscillatory dynamics in adversarial training were observed from the earliest GAN experiments (Goodfellow et al., 2014). The theoretical analysis came from game theory: simultaneous gradient descent on minimax games was known to be non-convergent in general (the dynamics rotate around Nash equilibria rather than converging to them). Mescheder et al. (2018) provided a thorough spectral analysis showing that GAN training dynamics have eigenvalues with large imaginary components, which corresponds to rotational (oscillatory) dynamics. This understanding motivated the shift toward regularised objectives (WGAN-GP, spectral normalisation) that reduce the rotational component. The instability of adversarial training was ultimately one of the major motivations for the field’s migration toward diffusion models, which replace the adversarial game with a stable regression objective.