Label Smoothing

Softens one-hot target distributions by mixing with a uniform distribution: $P'(i) = (1-\alpha) \cdot P(i) + \alpha/K$ . Prevents the model from producing overconfident predictions by never asking it to assign probability 1.0 to any class. Improves calibration and generalisation with zero architectural cost.

Intuition

Without label smoothing, the loss drives the model to put 100% probability on the correct class. Achieving this requires pushing the correct logit to $+\infty$ — which means the logits grow without bound, the model becomes infinitely confident, and the softmax saturates. At that point, the model has memorised “this is definitely a cat” rather than learning “this is probably a cat because of these features.”

Label smoothing says: “don’t aim for 100%. Aim for 90% on the correct class and spread the remaining 10% evenly across all others.” This caps how large the logits need to grow, keeping the model in a regime where softmax outputs are informative rather than saturated. The model learns to be confident but not certain — and that hedge turns out to generalise better.

A useful side effect: label smoothing improves calibration. A well-calibrated model’s confidence matches its accuracy — when it says “80% sure,” it’s right 80% of the time. By preventing extreme confidences during training, label smoothing nudges the model toward this property without any explicit calibration objective.

Math

Smoothed target distribution (class $y$ is correct, $K$ classes total):

$P'(i) = \begin{cases} 1 - \alpha + \frac{\alpha}{K} & \text{if } i = y \\ \frac{\alpha}{K} & \text{otherwise} \end{cases}$

This is equivalent to: $P' = (1 - \alpha) \cdot \mathbf{e}_y + \alpha \cdot \mathbf{u}$ , where $\mathbf{e}_y$ is one-hot and $\mathbf{u} = [1/K, \ldots, 1/K]$ is uniform.

Cross-entropy with smoothed targets:

$\mathcal{L} = -\sum_i P'(i) \log Q(i) = (1 - \alpha)(-\log Q(y)) + \alpha \cdot \frac{1}{K} \sum_i (-\log Q(i))$

This decomposes as: $(1 - \alpha) \cdot \text{CE}(\mathbf{e}_y, Q) + \alpha \cdot \text{CE}(\mathbf{u}, Q)$ . The second term is a KL penalty that pushes the model’s output toward uniform — preventing any class from dominating too strongly.

Typical $\alpha$ : 0.1 (the near-universal default). Values above 0.2 hurt accuracy.

Code

import torch
import torch.nn.functional as F

# ── Built-in PyTorch support (since 1.10) ──────────────────────
logits = model(x)                                          # (B, K)
targets = labels                                           # (B,)
loss = F.cross_entropy(logits, targets, label_smoothing=0.1)

# WARNING: label_smoothing expects integer targets (class indices),
# not soft targets. If you already have soft targets, don't use this
# flag — it will smooth your already-soft distribution.

# ── Manual computation (useful for understanding) ──────────────
K = logits.size(-1)
log_probs = F.log_softmax(logits, dim=-1)                  # (B, K)
nll = -log_probs.gather(dim=-1, index=targets.unsqueeze(-1)).squeeze(-1)  # (B,)
smooth_loss = -log_probs.mean(dim=-1)                      # (B,)
loss = (1 - 0.1) * nll + 0.1 * smooth_loss                # (B,)
loss = loss.mean()                                         # scalar

Manual Implementation

import numpy as np

def label_smoothing_cross_entropy(logits, targets, alpha=0.1):
    """
    Cross-entropy with label smoothing.
    logits:  (B, K) raw scores
    targets: (B,) integer class indices
    alpha:   smoothing factor (0.0 = standard CE)
    """
    B, K = logits.shape

    # Stable log-softmax
    shifted = logits - logits.max(axis=1, keepdims=True)            # (B, K)
    log_sum_exp = np.log(np.exp(shifted).sum(axis=1, keepdims=True))  # (B, 1)
    log_probs = shifted - log_sum_exp                                # (B, K)

    # NLL on correct class
    nll = -log_probs[np.arange(B), targets]                         # (B,)

    # Uniform penalty: mean of all log-probs
    smooth_loss = -log_probs.mean(axis=1)                           # (B,)

    return ((1 - alpha) * nll + alpha * smooth_loss).mean()

Popular Uses

Image classification (Inception v3, EfficientNet): the original application, $\alpha = 0.1$ is default in most vision recipes
LLM pretraining: some models use light smoothing to improve calibration of next-token predictions
Machine translation (Transformer, “Attention Is All You Need”): $\alpha = 0.1$ was part of the original Transformer recipe and remains standard
Knowledge distillation: soft targets from a teacher already provide implicit smoothing; additional label smoothing is usually unnecessary
Speech recognition: commonly applied with $\alpha = 0.1$ in CTC and attention-based models

Alternatives

Alternative	When to use	Tradeoff
Temperature scaling	Post-hoc calibration	Applied after training, doesn’t affect training dynamics; fixes calibration but doesn’t regularise
Mixup / CutMix	When you want stronger regularisation + data augmentation	Interpolates entire inputs and labels; more powerful but changes the data distribution
Focal loss	Class-imbalanced settings	Down-weights easy/confident examples; addresses imbalance rather than overconfidence
Knowledge distillation	When a teacher model is available	Soft targets from the teacher are a richer form of smoothing tailored to the data
Confidence penalty	Explicit entropy regularisation	Adds $-\beta H(Q)$ to the loss directly; more flexible but another hyperparameter

Historical Context

Label smoothing was introduced by Szegedy et al. (2016, “Rethinking the Inception Architecture”) as one of several training tricks for the Inception v3 model. It was a minor note in that paper but became ubiquitous after Vaswani et al. (2017, “Attention Is All You Need”) included it in the Transformer training recipe.

Muller et al. (2019, “When Does Label Smoothing Help?”) provided deeper analysis, showing that label smoothing makes representations of different classes more tightly clustered and equidistant in embedding space. They also noted a surprising downside: label smoothing can hurt knowledge distillation because the teacher’s softened outputs carry less information about inter-class relationships. Despite this edge case, $\alpha = 0.1$ remains a near-universal default — one of the cheapest regularisers to apply with consistent benefits.