Cosine Annealing
Cosine Annealing
Section titled “Cosine Annealing”Decaying the learning rate following a cosine curve from a maximum to a minimum over steps. Provides a smooth, gradual decay that spends more time at moderate learning rates than linear or step schedules. Often combined with warm restarts (SGDR) for cyclic training.
Intuition
Section titled “Intuition”Think of training as exploring a loss landscape. Early on, you want a high learning rate to cross ridges and escape bad basins. Late in training, you want a low learning rate to settle into a sharp minimum. The question is: how do you transition?
Step decay (divide LR by 10 at epoch 30, 60, 90) creates jarring transitions — the model is suddenly learning 10x slower. Linear decay wastes time at very low learning rates where progress is negligible. Cosine annealing is the sweet spot: it decays slowly at first (spending time near the peak where learning is fast), accelerates through the middle, then slows again near zero (giving the model a long, gentle landing).
The cosine shape is not derived from any optimality principle — it just works well empirically. The key property is that it’s smooth and concave in the first half: you stay near the high learning rate longer than linear decay would, squeezing more useful optimization out of those steps.
Warm restarts take this further: instead of one long cosine, use several shorter cosines back-to-back, snapping the learning rate back to the maximum each time. Each restart lets the optimizer escape a local minimum it may have settled into, then re-converge to a (potentially better) one.
Standard cosine annealing (step , total steps ):
At : . At : . The curve is symmetric around .
With warm restarts (SGDR, Loshchilov & Hutter, 2017) — restart period for the -th cycle:
where is the number of steps since the last restart. Common to double the period each cycle: .
Common defaults: or . LLaMA uses .
import torch
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4)
# ── Standard cosine decay over total_steps ──────────────────────scheduler = torch.optim.lr_scheduler.CosineAnnealingLR( optimizer, T_max=100000, eta_min=1e-5 # decay from 3e-4 → 1e-5)
# ── With warm restarts ──────────────────────────────────────────# T_0 = first cycle length, T_mult = cycle length multiplierscheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts( optimizer, T_0=10000, T_mult=2, eta_min=1e-5)# Cycle lengths: 10k, 20k, 40k, ...
# ── Warmup + cosine (the standard LLM recipe) ──────────────────warmup = torch.optim.lr_scheduler.LinearLR( optimizer, start_factor=1e-8/3e-4, total_iters=2000)cosine = torch.optim.lr_scheduler.CosineAnnealingLR( optimizer, T_max=98000, eta_min=3e-5)scheduler = torch.optim.lr_scheduler.SequentialLR( optimizer, schedulers=[warmup, cosine], milestones=[2000])
# ── In training loop ───────────────────────────────────────────for step, batch in enumerate(dataloader): loss = model(batch).loss loss.backward() optimizer.step() optimizer.zero_grad() scheduler.step() # call AFTER optimizer.step()Manual Implementation
Section titled “Manual Implementation”import numpy as np
def cosine_annealing_lr(step, lr_max, lr_min, total_steps): """ Standard cosine decay from lr_max to lr_min over total_steps. step: current step (0-indexed) lr_max: peak learning rate lr_min: floor learning rate total_steps: total number of training steps """ # Clamp to [0, total_steps] so we don't go negative t = min(step, total_steps) return lr_min + 0.5 * (lr_max - lr_min) * (1 + np.cos(np.pi * t / total_steps))
def cosine_warm_restarts_lr(step, lr_max, lr_min, t_0, t_mult=2): """ Cosine annealing with warm restarts (SGDR). t_0: first cycle length in steps t_mult: multiply cycle length by this after each restart """ t_cur = step t_i = t_0 while t_cur >= t_i: # find which cycle we're in t_cur -= t_i t_i = int(t_i * t_mult) # next cycle is longer return lr_min + 0.5 * (lr_max - lr_min) * (1 + np.cos(np.pi * t_cur / t_i))
# Example: 100k steps, cosine from 3e-4 → 1e-5lrs = [cosine_annealing_lr(t, 3e-4, 1e-5, 100000) for t in range(100000)]# lrs[0] = 3e-4, lrs[50000] ≈ 1.55e-4, lrs[99999] ≈ 1e-5Popular Uses
Section titled “Popular Uses”- LLM pre-training (GPT-3, LLaMA, Chinchilla): warmup + cosine is the universal schedule; Chinchilla specifically validated that cosine to near-zero is optimal
- Vision transformers (ViT, DeiT): cosine annealing replaced step decay as the default for image classification
- Diffusion model training (Stable Diffusion): long cosine schedules over millions of steps
- Fine-tuning (LoRA, full fine-tune): cosine over a short training run keeps the final LR low for stable convergence
- SGDR / snapshot ensembles: warm restarts produce multiple converged models (one per cycle) that can be ensembled cheaply
Alternatives
Section titled “Alternatives”| Alternative | When to use | Tradeoff |
|---|---|---|
| Step decay | Legacy CNN training (ResNet-style) | Simpler to implement; jarring LR drops can cause training instability |
| Linear decay | Short fine-tuning runs | Predictable, but spends too much time at very low LRs on long runs |
| Inverse square root | Original transformer schedule | Used in “Attention Is All You Need”; largely replaced by cosine |
| Constant LR | Quick experiments, RL | No schedule overhead; leaves performance on the table for long runs |
| One-cycle policy | Fast training (super-convergence) | Warmup + cosine decay in a single cycle tuned for max speed; needs careful LR range finding |
| WSD (warmup-stable-decay) | Continual pre-training (MiniCPM) | Warmup, hold constant for most of training, then rapid decay. Allows extending training without knowing total steps upfront |
Historical Context
Section titled “Historical Context”Cosine annealing was introduced by Loshchilov and Hutter (2017, “SGDR: Stochastic Gradient Descent with Warm Restarts”), originally as part of the warm restarts scheme. The cosine shape was chosen as a smooth alternative to step decay, and the restarts were inspired by simulated annealing in combinatorial optimization.
The warm restarts idea fell somewhat out of favour for large-scale training (the restart can waste compute), but the bare cosine schedule became the dominant choice. Its adoption was cemented by GPT-2/GPT-3 and later by the Chinchilla scaling laws paper, which used cosine annealing in all experiments. Today, “warmup + cosine decay to ~10% of peak LR” is effectively the default recipe for any transformer-based training run.