Skip to content

Mutual Information

Measures how much knowing one variable reduces uncertainty about another. I(X;Y) = 0 means X and Y are independent; higher values mean more shared information. The core quantity behind representation learning (InfoMax principle) and the theoretical motivation for contrastive losses like InfoNCE.

Mutual information asks: “how surprised would I be to learn that X and Y co-occur, compared to if they were independent?” If X is an image and Y is a caption, high MI means the caption tells you a lot about the image. If X is a noisy copy of Y, MI measures how much signal survives the noise.

Think of a Venn diagram of information. H(X) is one circle, H(Y) is the other. Their overlap is I(X;Y) — the information they share. The non-overlapping parts are what’s unique to each. Conditioning on Y removes its circle, leaving only H(X|Y) — the residual uncertainty. So I(X;Y) = H(X) - H(X|Y): knowing Y “explains away” exactly I(X;Y) nats of uncertainty about X.

The catch: MI is intractable for high-dimensional continuous variables. You’d need the full joint density p(x,y), which is exactly what you don’t have. This is why contrastive learning exists — InfoNCE provides a lower bound on MI that you can estimate from samples alone, by training a critic to distinguish real (x,y) pairs from shuffled ones. The tighter the critic, the tighter the bound.

Definition (three equivalent forms):

I(X;Y)=H(X)H(XY)=H(Y)H(YX)=H(X)+H(Y)H(X,Y)I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) = H(X) + H(Y) - H(X,Y)

As KL divergence (most fundamental form):

I(X;Y)=DKL(p(x,y)    p(x)p(y))I(X;Y) = D_{KL}\bigl(p(x,y) \;\|\; p(x)p(y)\bigr)

This measures how far the joint distribution is from the product of marginals. Zero iff X and Y are independent.

Conditional mutual information:

I(X;YZ)=H(XZ)H(XY,Z)I(X;Y|Z) = H(X|Z) - H(X|Y,Z)

InfoNCE lower bound (Oord et al., 2018):

I(X;Y)logNLInfoNCEI(X;Y) \geq \log N - \mathcal{L}_{\text{InfoNCE}}

where NN is the number of negative samples and the InfoNCE loss is:

LInfoNCE=E[logef(x,y+)j=1Nef(x,yj)]\mathcal{L}_{\text{InfoNCE}} = -\mathbb{E}\left[\log \frac{e^{f(x,y^+)}}{\sum_{j=1}^{N} e^{f(x,y_j)}}\right]

The bound is tight when f(x,y)=logp(yx)p(y)+cf(x,y) = \log \frac{p(y|x)}{p(y)} + c. In practice, the bound saturates at logN\log N — more negatives = tighter bound.

import torch
import torch.nn.functional as F
# ── InfoNCE loss (contrastive MI estimation) ────────────────────
# This is the standard way to maximise MI in practice.
# z_x and z_y are paired embeddings from two views of the same data.
z_x = encoder_x(x) # (B, D) — normalised embeddings
z_y = encoder_y(y) # (B, D)
z_x = F.normalize(z_x, dim=-1) # unit norm
z_y = F.normalize(z_y, dim=-1) # unit norm
# Cosine similarity matrix scaled by temperature
logits = z_x @ z_y.T / temperature # (B, B) — similarity scores
labels = torch.arange(B, device=logits.device) # (B,) — diagonal is positive
# InfoNCE = cross-entropy where the "correct class" is the matching pair
loss = F.cross_entropy(logits, labels) # scalar
# ── MINE estimator (Belghazi et al., 2018) ──────────────────────
# Direct MI estimation via a learned statistics network T(x,y)
joint_scores = T(x, y) # (B,) — scores for real pairs
marginal_scores = T(x, y[torch.randperm(B)]) # (B,) — scores for shuffled pairs
# Donsker-Varadhan bound: I(X;Y) >= E[T(x,y)] - log(E[exp(T(x,y_shuffled))])
mi_lower_bound = joint_scores.mean() - torch.logsumexp(marginal_scores, 0) + torch.log(torch.tensor(B, dtype=torch.float))
import numpy as np
def mutual_information_discrete(joint_probs):
"""
MI from a joint probability table.
joint_probs: (K_x, K_y) — p(x,y), must sum to 1
Returns: scalar MI in nats
"""
p_xy = np.clip(joint_probs, 1e-12, 1.0) # (K_x, K_y)
p_x = p_xy.sum(axis=1, keepdims=True) # (K_x, 1) — marginal
p_y = p_xy.sum(axis=0, keepdims=True) # (1, K_y) — marginal
# I(X;Y) = sum p(x,y) * log(p(x,y) / (p(x)*p(y)))
return (p_xy * np.log(p_xy / (p_x * p_y))).sum() # scalar
def infonce_loss(z_x, z_y, temperature=0.07):
"""
InfoNCE contrastive loss (lower bound on MI).
z_x: (B, D) — L2-normalised embeddings
z_y: (B, D) — L2-normalised embeddings
Returns: scalar loss
"""
B = z_x.shape[0]
# Cosine similarity matrix
logits = (z_x @ z_y.T) / temperature # (B, B)
# Numerically stable cross-entropy with labels = diagonal
shifted = logits - logits.max(axis=1, keepdims=True) # (B, B)
log_sum_exp = np.log(np.exp(shifted).sum(axis=1)) # (B,)
log_probs_diag = shifted[np.arange(B), np.arange(B)] # (B,) — positive pairs
return -(log_probs_diag - log_sum_exp).mean() # scalar
  • Contrastive learning (SimCLR, MoCo, CLIP): InfoNCE maximises a lower bound on MI between two augmented views — this is why contrastive learning works as a representation learning method
  • InfoMax principle (Deep InfoMax, CPC): learn representations that maximise MI between input and encoding — “preserve as much information as possible”
  • Information bottleneck (VIB): compress representations by minimising I(X;Z) while maximising I(Z;Y) — keep only the label-relevant information
  • MINE / neural MI estimation (Belghazi et al., 2018): train a neural network to estimate MI directly using variational bounds — useful for analysing learned representations
  • Disentangled representations (beta-VAE): total correlation TC(Z) = KL(q(z) || prod q(z_i)) is a multi-variable generalisation of MI — minimising it encourages independent latent dimensions
AlternativeWhen to useTradeoff
Correlation / cosine similarityQuick check for linear relationshipsOnly captures linear dependence; MI captures arbitrary nonlinear relationships
KL divergenceWhen you have two distributions over the same variable, not two variablesAsymmetric; MI is symmetric and measures shared information between different variables
CKA (Centered Kernel Alignment)Comparing learned representations across modelsPractical and stable, but no information-theoretic interpretation
Hilbert-Schmidt Independence CriterionKernel-based independence testingConsistent estimator without density estimation, but harder to optimise as a training objective
Wasserstein distanceWhen you care about geometry of distributions, not just dependenceAccounts for the metric structure of the space; MI treats all mismatches equally

Mutual information was defined by Shannon (1948) alongside entropy as part of the foundational framework of information theory. It remained primarily a theoretical tool in statistics and communications until Linsker (1988) proposed the InfoMax principle: the optimal representation of input data is one that maximises the mutual information between the input and the representation, subject to constraints.

The deep learning revival of MI came from two directions. Oord et al. (2018, “Representation Learning with Contrastive Predictive Coding”) introduced InfoNCE, showing that a contrastive classification loss provides a tractable lower bound on MI — this directly motivated SimCLR, MoCo, and CLIP. Simultaneously, Belghazi et al. (2018, MINE) showed MI could be estimated with neural networks using variational bounds. The practical difficulty of MI estimation (bounds can be loose, high-variance) has led the field to treat contrastive losses as effective objectives in their own right, somewhat independent of the MI interpretation.