Softmax

Converts a vector of raw logits into a probability distribution: each output is positive and they sum to 1. The bridge between neural network outputs and probabilities. Used in attention mechanisms (to compute weights), classification heads (to predict classes), and policy networks (to select actions).

Intuition

A neural network’s raw outputs (logits) can be any real number — positive, negative, large, small. Softmax turns these into a valid probability distribution through two steps: exponentiation makes everything positive, then division by the sum normalises everything to sum to 1.

The exponential is the key design choice. It preserves ordering (larger logits get larger probabilities) but amplifies differences: a logit of 5 vs 3 becomes $e^5 / e^3 \approx 7.4$ times more probable, not just $5/3 \approx 1.7$ times. This amplification means softmax naturally produces “peaky” distributions — one or two classes dominate, which is usually what you want for classification.

Temperature ( $\tau$ ) controls this peakiness. Dividing logits by $\tau$ before softmax scales the amplification: $\tau < 1$ makes the distribution sharper (more confident), $\tau > 1$ makes it flatter (more uncertain), $\tau \to 0$ converges to argmax (one-hot), and $\tau \to \infty$ converges to uniform. Temperature is used in knowledge distillation (soften teacher outputs), LLM sampling (control creativity), and contrastive learning (sharpen similarity scores).

A critical numerical issue: computing $e^{z_i}$ for large logits overflows. The standard fix is to subtract $\max(z)$ from all logits before exponentiating. This doesn’t change the output (it cancels in the numerator and denominator) but keeps the numbers in a safe range.

Math

Definition:

$\text{softmax}(z)_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}$

With temperature:

$\text{softmax}(z / \tau)_i = \frac{e^{z_i / \tau}}{\sum_{j=1}^{K} e^{z_j / \tau}}$

Log-softmax (numerically stable, used for cross-entropy):

$\log \text{softmax}(z)_i = z_i - \log \sum_{j=1}^{K} e^{z_j}$

Jacobian (gradient of softmax output w.r.t. input):

$\frac{\partial \text{softmax}(z)_i}{\partial z_j} = \text{softmax}(z)_i \bigl(\delta_{ij} - \text{softmax}(z)_j\bigr)$

where $\delta_{ij}$ is the Kronecker delta. Note: the gradient of softmax depends on ALL outputs, not just the $i$ -th — this is why softmax + cross-entropy is fused for efficiency.

Code

import torch
import torch.nn.functional as F

logits = torch.randn(B, K)                    # (B, K) — raw scores for K classes

# ── Classification ──────────────────────────────────────────────
probs = F.softmax(logits, dim=-1)             # (B, K) — sums to 1 along last dim
# WARNING: Almost never use softmax directly for training loss.
# Use F.cross_entropy(logits, targets) which fuses log-softmax + NLL.

# ── Attention weights ──────────────────────────────────────────
# scores: (B, n_heads, T_q, T_k)
attn_weights = F.softmax(scores, dim=-1)      # (B, n_heads, T_q, T_k) — each query sums to 1

# ── With temperature ───────────────────────────────────────────
tau = 0.07                                     # low temperature = sharp distribution
probs = F.softmax(logits / tau, dim=-1)        # (B, K)

# ── Log-softmax (for NLL loss, numerically stable) ─────────────
log_probs = F.log_softmax(logits, dim=-1)      # (B, K)
loss = F.nll_loss(log_probs, targets)          # equivalent to F.cross_entropy

# WARNING: Always specify dim= explicitly. The default dim differs between
# versions and has caused subtle bugs. dim=-1 is correct for (B, K) logits.

Manual Implementation

import numpy as np

def softmax(logits, axis=-1):
    """
    Numerically stable softmax.
    logits: any shape array. Softmax is computed along `axis`.
    """
    # Subtract max for numerical stability — doesn't change the result
    shifted = logits - logits.max(axis=axis, keepdims=True)     # prevent overflow
    exp_z = np.exp(shifted)                                      # all values now ≤ 1
    return exp_z / exp_z.sum(axis=axis, keepdims=True)           # normalise to sum to 1

def log_softmax(logits, axis=-1):
    """Log-softmax using the log-sum-exp trick."""
    shifted = logits - logits.max(axis=axis, keepdims=True)
    log_sum_exp = np.log(np.exp(shifted).sum(axis=axis, keepdims=True))
    return shifted - log_sum_exp

def softmax_with_temperature(logits, tau=1.0, axis=-1):
    """Softmax with temperature scaling. tau < 1 = sharper, tau > 1 = flatter."""
    return softmax(logits / tau, axis=axis)

Popular Uses

Attention mechanisms (transformer/): softmax over key-query dot products produces attention weights — this is the core of self-attention
Classification heads: final layer in image classifiers (ResNet, ViT), language models (next-token prediction), and any multi-class prediction
Policy networks in RL (policy-gradient/): softmax over action logits produces a categorical policy $\pi(a|s)$
Contrastive learning (contrastive-self-supervising/): InfoNCE loss applies softmax with temperature over similarity scores
Knowledge distillation: softmax with high temperature ( $\tau = 4$ - $20$ ) exposes “dark knowledge” — relative similarities between non-target classes
LLM token sampling: temperature-scaled softmax controls randomness in text generation

Alternatives

Alternative	When to use	Tradeoff
Sigmoid	Multi-label classification (multiple classes can be true)	Independent per-element; doesn’t enforce mutual exclusivity or sum-to-one
Sparsemax	When you want truly sparse probability distributions	Outputs exact zeros for low-scoring classes; harder to differentiate through
Gumbel-Softmax	Differentiable sampling from categorical distributions (VAE discrete latents)	Adds Gumbel noise for reparameterised discrete sampling; temperature-annealed
Log-softmax	When you only need log-probabilities (NLL loss)	More numerically stable than log(softmax()); use F.log_softmax directly
Hardmax (argmax)	Inference only (not differentiable)	Returns one-hot vector; used at test time but not trainable
Normalised exponential with margin	Large-scale face recognition (ArcFace, CosFace)	Adds angular or additive margin to the correct class logit before softmax

Historical Context

Softmax originated in statistical mechanics as the Boltzmann distribution (with temperature corresponding to physical temperature) and entered machine learning through the work of Bridle (1990), who proposed it as the output layer for neural network classifiers. The name “softmax” reflects that it’s a smooth (differentiable) approximation to the argmax function.

The combination of softmax outputs with cross-entropy loss — equivalent to maximum likelihood estimation for categorical distributions — became the standard training objective for classification networks. The modern significance of softmax expanded dramatically with the Transformer (Vaswani et al., 2017), where it serves a completely different role: computing attention weights. The temperature-scaling idea, while rooted in Boltzmann distributions, was repopularised by Hinton et al. (2015, “Distilling the Knowledge in a Neural Network”) for knowledge distillation and has since become essential in contrastive learning (SimCLR, CLIP) where temperature is a critical hyperparameter.