Skip to content

Label Smoothing

Softens one-hot target distributions by mixing with a uniform distribution: P(i)=(1α)P(i)+α/KP'(i) = (1-\alpha) \cdot P(i) + \alpha/K. Prevents the model from producing overconfident predictions by never asking it to assign probability 1.0 to any class. Improves calibration and generalisation with zero architectural cost.

Without label smoothing, the loss drives the model to put 100% probability on the correct class. Achieving this requires pushing the correct logit to ++\infty — which means the logits grow without bound, the model becomes infinitely confident, and the softmax saturates. At that point, the model has memorised “this is definitely a cat” rather than learning “this is probably a cat because of these features.”

Label smoothing says: “don’t aim for 100%. Aim for 90% on the correct class and spread the remaining 10% evenly across all others.” This caps how large the logits need to grow, keeping the model in a regime where softmax outputs are informative rather than saturated. The model learns to be confident but not certain — and that hedge turns out to generalise better.

A useful side effect: label smoothing improves calibration. A well-calibrated model’s confidence matches its accuracy — when it says “80% sure,” it’s right 80% of the time. By preventing extreme confidences during training, label smoothing nudges the model toward this property without any explicit calibration objective.

Smoothed target distribution (class yy is correct, KK classes total):

P(i)={1α+αKif i=yαKotherwiseP'(i) = \begin{cases} 1 - \alpha + \frac{\alpha}{K} & \text{if } i = y \\ \frac{\alpha}{K} & \text{otherwise} \end{cases}

This is equivalent to: P=(1α)ey+αuP' = (1 - \alpha) \cdot \mathbf{e}_y + \alpha \cdot \mathbf{u}, where ey\mathbf{e}_y is one-hot and u=[1/K,,1/K]\mathbf{u} = [1/K, \ldots, 1/K] is uniform.

Cross-entropy with smoothed targets:

L=iP(i)logQ(i)=(1α)(logQ(y))+α1Ki(logQ(i))\mathcal{L} = -\sum_i P'(i) \log Q(i) = (1 - \alpha)(-\log Q(y)) + \alpha \cdot \frac{1}{K} \sum_i (-\log Q(i))

This decomposes as: (1α)CE(ey,Q)+αCE(u,Q)(1 - \alpha) \cdot \text{CE}(\mathbf{e}_y, Q) + \alpha \cdot \text{CE}(\mathbf{u}, Q). The second term is a KL penalty that pushes the model’s output toward uniform — preventing any class from dominating too strongly.

Typical α\alpha: 0.1 (the near-universal default). Values above 0.2 hurt accuracy.

import torch
import torch.nn.functional as F
# ── Built-in PyTorch support (since 1.10) ──────────────────────
logits = model(x) # (B, K)
targets = labels # (B,)
loss = F.cross_entropy(logits, targets, label_smoothing=0.1)
# WARNING: label_smoothing expects integer targets (class indices),
# not soft targets. If you already have soft targets, don't use this
# flag — it will smooth your already-soft distribution.
# ── Manual computation (useful for understanding) ──────────────
K = logits.size(-1)
log_probs = F.log_softmax(logits, dim=-1) # (B, K)
nll = -log_probs.gather(dim=-1, index=targets.unsqueeze(-1)).squeeze(-1) # (B,)
smooth_loss = -log_probs.mean(dim=-1) # (B,)
loss = (1 - 0.1) * nll + 0.1 * smooth_loss # (B,)
loss = loss.mean() # scalar
import numpy as np
def label_smoothing_cross_entropy(logits, targets, alpha=0.1):
"""
Cross-entropy with label smoothing.
logits: (B, K) raw scores
targets: (B,) integer class indices
alpha: smoothing factor (0.0 = standard CE)
"""
B, K = logits.shape
# Stable log-softmax
shifted = logits - logits.max(axis=1, keepdims=True) # (B, K)
log_sum_exp = np.log(np.exp(shifted).sum(axis=1, keepdims=True)) # (B, 1)
log_probs = shifted - log_sum_exp # (B, K)
# NLL on correct class
nll = -log_probs[np.arange(B), targets] # (B,)
# Uniform penalty: mean of all log-probs
smooth_loss = -log_probs.mean(axis=1) # (B,)
return ((1 - alpha) * nll + alpha * smooth_loss).mean()
  • Image classification (Inception v3, EfficientNet): the original application, α=0.1\alpha = 0.1 is default in most vision recipes
  • LLM pretraining: some models use light smoothing to improve calibration of next-token predictions
  • Machine translation (Transformer, “Attention Is All You Need”): α=0.1\alpha = 0.1 was part of the original Transformer recipe and remains standard
  • Knowledge distillation: soft targets from a teacher already provide implicit smoothing; additional label smoothing is usually unnecessary
  • Speech recognition: commonly applied with α=0.1\alpha = 0.1 in CTC and attention-based models
AlternativeWhen to useTradeoff
Temperature scalingPost-hoc calibrationApplied after training, doesn’t affect training dynamics; fixes calibration but doesn’t regularise
Mixup / CutMixWhen you want stronger regularisation + data augmentationInterpolates entire inputs and labels; more powerful but changes the data distribution
Focal lossClass-imbalanced settingsDown-weights easy/confident examples; addresses imbalance rather than overconfidence
Knowledge distillationWhen a teacher model is availableSoft targets from the teacher are a richer form of smoothing tailored to the data
Confidence penaltyExplicit entropy regularisationAdds βH(Q)-\beta H(Q) to the loss directly; more flexible but another hyperparameter

Label smoothing was introduced by Szegedy et al. (2016, “Rethinking the Inception Architecture”) as one of several training tricks for the Inception v3 model. It was a minor note in that paper but became ubiquitous after Vaswani et al. (2017, “Attention Is All You Need”) included it in the Transformer training recipe.

Muller et al. (2019, “When Does Label Smoothing Help?”) provided deeper analysis, showing that label smoothing makes representations of different classes more tightly clustered and equidistant in embedding space. They also noted a surprising downside: label smoothing can hurt knowledge distillation because the teacher’s softened outputs carry less information about inter-class relationships. Despite this edge case, α=0.1\alpha = 0.1 remains a near-universal default — one of the cheapest regularisers to apply with consistent benefits.