Softmax
Softmax
Section titled “Softmax”Converts a vector of raw logits into a probability distribution: each output is positive and they sum to 1. The bridge between neural network outputs and probabilities. Used in attention mechanisms (to compute weights), classification heads (to predict classes), and policy networks (to select actions).
Intuition
Section titled “Intuition”A neural network’s raw outputs (logits) can be any real number — positive, negative, large, small. Softmax turns these into a valid probability distribution through two steps: exponentiation makes everything positive, then division by the sum normalises everything to sum to 1.
The exponential is the key design choice. It preserves ordering (larger logits get larger probabilities) but amplifies differences: a logit of 5 vs 3 becomes times more probable, not just times. This amplification means softmax naturally produces “peaky” distributions — one or two classes dominate, which is usually what you want for classification.
Temperature () controls this peakiness. Dividing logits by before softmax scales the amplification: makes the distribution sharper (more confident), makes it flatter (more uncertain), converges to argmax (one-hot), and converges to uniform. Temperature is used in knowledge distillation (soften teacher outputs), LLM sampling (control creativity), and contrastive learning (sharpen similarity scores).
A critical numerical issue: computing for large logits overflows. The standard fix is to subtract from all logits before exponentiating. This doesn’t change the output (it cancels in the numerator and denominator) but keeps the numbers in a safe range.
Definition:
With temperature:
Log-softmax (numerically stable, used for cross-entropy):
Jacobian (gradient of softmax output w.r.t. input):
where is the Kronecker delta. Note: the gradient of softmax depends on ALL outputs, not just the -th — this is why softmax + cross-entropy is fused for efficiency.
import torchimport torch.nn.functional as F
logits = torch.randn(B, K) # (B, K) — raw scores for K classes
# ── Classification ──────────────────────────────────────────────probs = F.softmax(logits, dim=-1) # (B, K) — sums to 1 along last dim# WARNING: Almost never use softmax directly for training loss.# Use F.cross_entropy(logits, targets) which fuses log-softmax + NLL.
# ── Attention weights ──────────────────────────────────────────# scores: (B, n_heads, T_q, T_k)attn_weights = F.softmax(scores, dim=-1) # (B, n_heads, T_q, T_k) — each query sums to 1
# ── With temperature ───────────────────────────────────────────tau = 0.07 # low temperature = sharp distributionprobs = F.softmax(logits / tau, dim=-1) # (B, K)
# ── Log-softmax (for NLL loss, numerically stable) ─────────────log_probs = F.log_softmax(logits, dim=-1) # (B, K)loss = F.nll_loss(log_probs, targets) # equivalent to F.cross_entropy
# WARNING: Always specify dim= explicitly. The default dim differs between# versions and has caused subtle bugs. dim=-1 is correct for (B, K) logits.Manual Implementation
Section titled “Manual Implementation”import numpy as np
def softmax(logits, axis=-1): """ Numerically stable softmax. logits: any shape array. Softmax is computed along `axis`. """ # Subtract max for numerical stability — doesn't change the result shifted = logits - logits.max(axis=axis, keepdims=True) # prevent overflow exp_z = np.exp(shifted) # all values now ≤ 1 return exp_z / exp_z.sum(axis=axis, keepdims=True) # normalise to sum to 1
def log_softmax(logits, axis=-1): """Log-softmax using the log-sum-exp trick.""" shifted = logits - logits.max(axis=axis, keepdims=True) log_sum_exp = np.log(np.exp(shifted).sum(axis=axis, keepdims=True)) return shifted - log_sum_exp
def softmax_with_temperature(logits, tau=1.0, axis=-1): """Softmax with temperature scaling. tau < 1 = sharper, tau > 1 = flatter.""" return softmax(logits / tau, axis=axis)Popular Uses
Section titled “Popular Uses”- Attention mechanisms (
transformer/): softmax over key-query dot products produces attention weights — this is the core of self-attention - Classification heads: final layer in image classifiers (ResNet, ViT), language models (next-token prediction), and any multi-class prediction
- Policy networks in RL (
policy-gradient/): softmax over action logits produces a categorical policy - Contrastive learning (
contrastive-self-supervising/): InfoNCE loss applies softmax with temperature over similarity scores - Knowledge distillation: softmax with high temperature (-) exposes “dark knowledge” — relative similarities between non-target classes
- LLM token sampling: temperature-scaled softmax controls randomness in text generation
Alternatives
Section titled “Alternatives”| Alternative | When to use | Tradeoff |
|---|---|---|
| Sigmoid | Multi-label classification (multiple classes can be true) | Independent per-element; doesn’t enforce mutual exclusivity or sum-to-one |
| Sparsemax | When you want truly sparse probability distributions | Outputs exact zeros for low-scoring classes; harder to differentiate through |
| Gumbel-Softmax | Differentiable sampling from categorical distributions (VAE discrete latents) | Adds Gumbel noise for reparameterised discrete sampling; temperature-annealed |
| Log-softmax | When you only need log-probabilities (NLL loss) | More numerically stable than log(softmax()); use F.log_softmax directly |
| Hardmax (argmax) | Inference only (not differentiable) | Returns one-hot vector; used at test time but not trainable |
| Normalised exponential with margin | Large-scale face recognition (ArcFace, CosFace) | Adds angular or additive margin to the correct class logit before softmax |
Historical Context
Section titled “Historical Context”Softmax originated in statistical mechanics as the Boltzmann distribution (with temperature corresponding to physical temperature) and entered machine learning through the work of Bridle (1990), who proposed it as the output layer for neural network classifiers. The name “softmax” reflects that it’s a smooth (differentiable) approximation to the argmax function.
The combination of softmax outputs with cross-entropy loss — equivalent to maximum likelihood estimation for categorical distributions — became the standard training objective for classification networks. The modern significance of softmax expanded dramatically with the Transformer (Vaswani et al., 2017), where it serves a completely different role: computing attention weights. The temperature-scaling idea, while rooted in Boltzmann distributions, was repopularised by Hinton et al. (2015, “Distilling the Knowledge in a Neural Network”) for knowledge distillation and has since become essential in contrastive learning (SimCLR, CLIP) where temperature is a critical hyperparameter.