Skip to content

Entropy

Measures the uncertainty or information content of a probability distribution. Maximum for uniform distributions (every outcome equally likely), zero for deterministic ones (outcome is certain). The foundation of information theory — cross-entropy loss, KL divergence, and mutual information are all built on top of it.

Entropy answers: “how many yes/no questions do I need to ask, on average, to determine the outcome?” If a coin is fair, you need exactly 1 bit (one question). If it has four equally likely outcomes, you need 2 bits. If the outcome is already certain, you need 0 bits — there’s nothing to learn.

The key insight is the logarithm. A rare event (probability 0.01) carries a lot of information when it occurs — “it’s snowing in July” is much more informative than “it’s sunny in July.” The log converts multiplicative probabilities into additive information content: two independent events that each need 3 bits to describe need 6 bits together. Entropy is just the expected (average) information content across all possible outcomes.

Why does this matter for deep learning? Cross-entropy loss decomposes as H(P,Q) = H(P) + D_KL(P||Q). Since H(P) is constant for fixed labels, minimising cross-entropy is equivalent to minimising KL divergence. Entropy also appears directly in reinforcement learning as an exploration bonus — adding H(pi) to the reward encourages the policy to stay stochastic and explore, which is the core idea behind SAC.

General form (discrete distribution over KK outcomes):

H(X)=i=1Kp(xi)logp(xi)H(X) = -\sum_{i=1}^{K} p(x_i) \log p(x_i)

with the convention 0log0=00 \log 0 = 0 (the limit is well-defined).

Binary entropy (Bernoulli with parameter pp):

H(p)=plogp(1p)log(1p)H(p) = -p \log p - (1-p) \log(1-p)

Maximum at p=0.5p = 0.5 where H=log20.693H = \log 2 \approx 0.693 nats (or 1 bit).

Differential entropy (continuous distribution with density p(x)p(x)):

h(X)=p(x)logp(x)dxh(X) = -\int p(x) \log p(x) \, dx

Warning: differential entropy can be negative (unlike discrete entropy). A Gaussian with small variance has negative differential entropy.

Maximum entropy: among all distributions on KK outcomes, the uniform distribution p(xi)=1/Kp(x_i) = 1/K maximises entropy at H=logKH = \log K. This is the principle behind maximum entropy models — assume maximum uncertainty subject to constraints.

Relationship to cross-entropy and KL divergence:

H(P,Q)=H(P)+DKL(PQ)H(P, Q) = H(P) + D_{KL}(P \| Q)

import torch
import torch.nn.functional as F
# ── Entropy of a categorical distribution from logits ───────────
logits = model(x) # (B, K) raw scores
probs = F.softmax(logits, dim=-1) # (B, K)
log_probs = F.log_softmax(logits, dim=-1) # (B, K) — use log_softmax, not log(softmax)
entropy = -(probs * log_probs).sum(dim=-1) # (B,) — per-sample entropy
# ── Entropy regularisation in RL (SAC-style) ────────────────────
# Add entropy bonus to encourage exploration
alpha = 0.2 # temperature coefficient
policy_loss = (alpha * log_probs - q_values).mean() # maximise entropy = minimise negative entropy
# ── Binary entropy ──────────────────────────────────────────────
p = torch.sigmoid(logits) # (B,) probabilities
binary_entropy = F.binary_cross_entropy(p, p) # H(p) = CE(p, p)
# Equivalent: -(p * p.log() + (1-p) * (1-p).log())
# WARNING: use binary_cross_entropy (not with_logits) since both args are probs here
import numpy as np
def entropy(probs):
"""
Entropy of a discrete distribution.
probs: (B, K) — each row sums to 1
Returns: (B,) — entropy in nats
"""
# Clip to avoid log(0) = -inf
safe_probs = np.clip(probs, 1e-12, 1.0) # (B, K)
return -(probs * np.log(safe_probs)).sum(axis=-1) # (B,)
def entropy_from_logits(logits):
"""
Entropy from raw logits (numerically stable).
logits: (B, K)
Returns: (B,)
"""
# Stable log-softmax: subtract max to prevent overflow
shifted = logits - logits.max(axis=-1, keepdims=True) # (B, K)
log_sum_exp = np.log(np.exp(shifted).sum(axis=-1, keepdims=True)) # (B, 1)
log_probs = shifted - log_sum_exp # (B, K)
probs = np.exp(log_probs) # (B, K)
return -(probs * log_probs).sum(axis=-1) # (B,)
def binary_entropy(p):
"""
Entropy of Bernoulli(p).
p: (B,) probabilities in [0, 1]
"""
p = np.clip(p, 1e-12, 1 - 1e-12)
return -p * np.log(p) - (1 - p) * np.log(1 - p) # (B,)
  • Entropy regularisation in RL (SAC, A2C): add H(pi) as a reward bonus to encourage exploration and prevent premature convergence to a deterministic policy
  • Cross-entropy loss decomposition: understanding that minimising CE means minimising KL (since H(P) is constant) — the fundamental insight behind why cross-entropy works for classification
  • Maximum entropy models: constrain a distribution to match observed statistics while being maximally uncertain otherwise (MaxEnt RL, exponential family distributions)
  • Bits-per-dimension evaluation: convert NLL from nats to bits using log2, then normalise by dimensionality — entropy provides the theoretical lower bound
  • Information bottleneck: compress representations by minimising mutual information with input while maximising it with labels — entropy quantifies the compression
  • Entropy coding (arithmetic coding, ANS): lossless compression that achieves rates approaching the entropy — used in neural compression models
AlternativeWhen to useTradeoff
Renyi entropyWhen you need to weight rare vs. common events differentlyParameterised by order alpha; alpha=1 recovers Shannon entropy. Higher alpha focuses on high-probability events
Gini impurityDecision trees (CART)1pi21 - \sum p_i^2. Computationally cheaper than entropy (no log), nearly identical splits in practice
VarianceContinuous distributions where you want a simple uncertainty measureOnly captures second-order information; doesn’t fully characterise the distribution shape
Calibration metrics (ECE)When you care about whether predicted probabilities are reliableMeasures calibration directly rather than information content; orthogonal to entropy

Shannon (1948) introduced entropy in “A Mathematical Theory of Communication,” borrowing the name from thermodynamics on von Neumann’s suggestion. Shannon proved that entropy is the fundamental limit on lossless compression — you cannot encode messages from a source with fewer than H(X) bits per symbol on average. This established the field of information theory.

Jaynes (1957) extended entropy to inference with the Maximum Entropy Principle: when building a probability model, choose the distribution with the highest entropy subject to your known constraints. This principle underlies exponential family distributions and connects information theory to statistical mechanics. In modern deep learning, entropy regularisation (Ziebart et al., 2008, “Maximum Entropy Inverse Reinforcement Learning”; Haarnoja et al., 2018, SAC) uses this same idea — maximise entropy subject to getting high reward — producing robust, exploration-friendly policies.