Skip to content

Softmax

Converts a vector of raw logits into a probability distribution: each output is positive and they sum to 1. The bridge between neural network outputs and probabilities. Used in attention mechanisms (to compute weights), classification heads (to predict classes), and policy networks (to select actions).

A neural network’s raw outputs (logits) can be any real number — positive, negative, large, small. Softmax turns these into a valid probability distribution through two steps: exponentiation makes everything positive, then division by the sum normalises everything to sum to 1.

The exponential is the key design choice. It preserves ordering (larger logits get larger probabilities) but amplifies differences: a logit of 5 vs 3 becomes e5/e37.4e^5 / e^3 \approx 7.4 times more probable, not just 5/31.75/3 \approx 1.7 times. This amplification means softmax naturally produces “peaky” distributions — one or two classes dominate, which is usually what you want for classification.

Temperature (τ\tau) controls this peakiness. Dividing logits by τ\tau before softmax scales the amplification: τ<1\tau < 1 makes the distribution sharper (more confident), τ>1\tau > 1 makes it flatter (more uncertain), τ0\tau \to 0 converges to argmax (one-hot), and τ\tau \to \infty converges to uniform. Temperature is used in knowledge distillation (soften teacher outputs), LLM sampling (control creativity), and contrastive learning (sharpen similarity scores).

A critical numerical issue: computing ezie^{z_i} for large logits overflows. The standard fix is to subtract max(z)\max(z) from all logits before exponentiating. This doesn’t change the output (it cancels in the numerator and denominator) but keeps the numbers in a safe range.

Definition:

softmax(z)i=ezij=1Kezj\text{softmax}(z)_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}

With temperature:

softmax(z/τ)i=ezi/τj=1Kezj/τ\text{softmax}(z / \tau)_i = \frac{e^{z_i / \tau}}{\sum_{j=1}^{K} e^{z_j / \tau}}

Log-softmax (numerically stable, used for cross-entropy):

logsoftmax(z)i=zilogj=1Kezj\log \text{softmax}(z)_i = z_i - \log \sum_{j=1}^{K} e^{z_j}

Jacobian (gradient of softmax output w.r.t. input):

softmax(z)izj=softmax(z)i(δijsoftmax(z)j)\frac{\partial \text{softmax}(z)_i}{\partial z_j} = \text{softmax}(z)_i \bigl(\delta_{ij} - \text{softmax}(z)_j\bigr)

where δij\delta_{ij} is the Kronecker delta. Note: the gradient of softmax depends on ALL outputs, not just the ii-th — this is why softmax + cross-entropy is fused for efficiency.

import torch
import torch.nn.functional as F
logits = torch.randn(B, K) # (B, K) — raw scores for K classes
# ── Classification ──────────────────────────────────────────────
probs = F.softmax(logits, dim=-1) # (B, K) — sums to 1 along last dim
# WARNING: Almost never use softmax directly for training loss.
# Use F.cross_entropy(logits, targets) which fuses log-softmax + NLL.
# ── Attention weights ──────────────────────────────────────────
# scores: (B, n_heads, T_q, T_k)
attn_weights = F.softmax(scores, dim=-1) # (B, n_heads, T_q, T_k) — each query sums to 1
# ── With temperature ───────────────────────────────────────────
tau = 0.07 # low temperature = sharp distribution
probs = F.softmax(logits / tau, dim=-1) # (B, K)
# ── Log-softmax (for NLL loss, numerically stable) ─────────────
log_probs = F.log_softmax(logits, dim=-1) # (B, K)
loss = F.nll_loss(log_probs, targets) # equivalent to F.cross_entropy
# WARNING: Always specify dim= explicitly. The default dim differs between
# versions and has caused subtle bugs. dim=-1 is correct for (B, K) logits.
import numpy as np
def softmax(logits, axis=-1):
"""
Numerically stable softmax.
logits: any shape array. Softmax is computed along `axis`.
"""
# Subtract max for numerical stability — doesn't change the result
shifted = logits - logits.max(axis=axis, keepdims=True) # prevent overflow
exp_z = np.exp(shifted) # all values now ≤ 1
return exp_z / exp_z.sum(axis=axis, keepdims=True) # normalise to sum to 1
def log_softmax(logits, axis=-1):
"""Log-softmax using the log-sum-exp trick."""
shifted = logits - logits.max(axis=axis, keepdims=True)
log_sum_exp = np.log(np.exp(shifted).sum(axis=axis, keepdims=True))
return shifted - log_sum_exp
def softmax_with_temperature(logits, tau=1.0, axis=-1):
"""Softmax with temperature scaling. tau < 1 = sharper, tau > 1 = flatter."""
return softmax(logits / tau, axis=axis)
  • Attention mechanisms (transformer/): softmax over key-query dot products produces attention weights — this is the core of self-attention
  • Classification heads: final layer in image classifiers (ResNet, ViT), language models (next-token prediction), and any multi-class prediction
  • Policy networks in RL (policy-gradient/): softmax over action logits produces a categorical policy π(as)\pi(a|s)
  • Contrastive learning (contrastive-self-supervising/): InfoNCE loss applies softmax with temperature over similarity scores
  • Knowledge distillation: softmax with high temperature (τ=4\tau = 4-2020) exposes “dark knowledge” — relative similarities between non-target classes
  • LLM token sampling: temperature-scaled softmax controls randomness in text generation
AlternativeWhen to useTradeoff
SigmoidMulti-label classification (multiple classes can be true)Independent per-element; doesn’t enforce mutual exclusivity or sum-to-one
SparsemaxWhen you want truly sparse probability distributionsOutputs exact zeros for low-scoring classes; harder to differentiate through
Gumbel-SoftmaxDifferentiable sampling from categorical distributions (VAE discrete latents)Adds Gumbel noise for reparameterised discrete sampling; temperature-annealed
Log-softmaxWhen you only need log-probabilities (NLL loss)More numerically stable than log(softmax()); use F.log_softmax directly
Hardmax (argmax)Inference only (not differentiable)Returns one-hot vector; used at test time but not trainable
Normalised exponential with marginLarge-scale face recognition (ArcFace, CosFace)Adds angular or additive margin to the correct class logit before softmax

Softmax originated in statistical mechanics as the Boltzmann distribution (with temperature corresponding to physical temperature) and entered machine learning through the work of Bridle (1990), who proposed it as the output layer for neural network classifiers. The name “softmax” reflects that it’s a smooth (differentiable) approximation to the argmax function.

The combination of softmax outputs with cross-entropy loss — equivalent to maximum likelihood estimation for categorical distributions — became the standard training objective for classification networks. The modern significance of softmax expanded dramatically with the Transformer (Vaswani et al., 2017), where it serves a completely different role: computing attention weights. The temperature-scaling idea, while rooted in Boltzmann distributions, was repopularised by Hinton et al. (2015, “Distilling the Knowledge in a Neural Network”) for knowledge distillation and has since become essential in contrastive learning (SimCLR, CLIP) where temperature is a critical hyperparameter.