Mutual Information

Measures how much knowing one variable reduces uncertainty about another. I(X;Y) = 0 means X and Y are independent; higher values mean more shared information. The core quantity behind representation learning (InfoMax principle) and the theoretical motivation for contrastive losses like InfoNCE.

Intuition

Mutual information asks: “how surprised would I be to learn that X and Y co-occur, compared to if they were independent?” If X is an image and Y is a caption, high MI means the caption tells you a lot about the image. If X is a noisy copy of Y, MI measures how much signal survives the noise.

Think of a Venn diagram of information. H(X) is one circle, H(Y) is the other. Their overlap is I(X;Y) — the information they share. The non-overlapping parts are what’s unique to each. Conditioning on Y removes its circle, leaving only H(X|Y) — the residual uncertainty. So I(X;Y) = H(X) - H(X|Y): knowing Y “explains away” exactly I(X;Y) nats of uncertainty about X.

The catch: MI is intractable for high-dimensional continuous variables. You’d need the full joint density p(x,y), which is exactly what you don’t have. This is why contrastive learning exists — InfoNCE provides a lower bound on MI that you can estimate from samples alone, by training a critic to distinguish real (x,y) pairs from shuffled ones. The tighter the critic, the tighter the bound.

Math

Definition (three equivalent forms):

$I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) = H(X) + H(Y) - H(X,Y)$

As KL divergence (most fundamental form):

$I(X;Y) = D_{KL}\bigl(p(x,y) \;\|\; p(x)p(y)\bigr)$

This measures how far the joint distribution is from the product of marginals. Zero iff X and Y are independent.

Conditional mutual information:

$I(X;Y|Z) = H(X|Z) - H(X|Y,Z)$

InfoNCE lower bound (Oord et al., 2018):

$I(X;Y) \geq \log N - \mathcal{L}_{\text{InfoNCE}}$

where $N$ is the number of negative samples and the InfoNCE loss is:

$\mathcal{L}_{\text{InfoNCE}} = -\mathbb{E}\left[\log \frac{e^{f(x,y^+)}}{\sum_{j=1}^{N} e^{f(x,y_j)}}\right]$

The bound is tight when $f(x,y) = \log \frac{p(y|x)}{p(y)} + c$ . In practice, the bound saturates at $\log N$ — more negatives = tighter bound.

Code

import torch
import torch.nn.functional as F

# ── InfoNCE loss (contrastive MI estimation) ────────────────────
# This is the standard way to maximise MI in practice.
# z_x and z_y are paired embeddings from two views of the same data.

z_x = encoder_x(x)                                   # (B, D) — normalised embeddings
z_y = encoder_y(y)                                    # (B, D)
z_x = F.normalize(z_x, dim=-1)                        # unit norm
z_y = F.normalize(z_y, dim=-1)                        # unit norm

# Cosine similarity matrix scaled by temperature
logits = z_x @ z_y.T / temperature                    # (B, B) — similarity scores
labels = torch.arange(B, device=logits.device)         # (B,) — diagonal is positive

# InfoNCE = cross-entropy where the "correct class" is the matching pair
loss = F.cross_entropy(logits, labels)                 # scalar

# ── MINE estimator (Belghazi et al., 2018) ──────────────────────
# Direct MI estimation via a learned statistics network T(x,y)
joint_scores = T(x, y)                                # (B,) — scores for real pairs
marginal_scores = T(x, y[torch.randperm(B)])           # (B,) — scores for shuffled pairs
# Donsker-Varadhan bound: I(X;Y) >= E[T(x,y)] - log(E[exp(T(x,y_shuffled))])
mi_lower_bound = joint_scores.mean() - torch.logsumexp(marginal_scores, 0) + torch.log(torch.tensor(B, dtype=torch.float))

Manual Implementation

import numpy as np

def mutual_information_discrete(joint_probs):
    """
    MI from a joint probability table.
    joint_probs: (K_x, K_y) — p(x,y), must sum to 1
    Returns: scalar MI in nats
    """
    p_xy = np.clip(joint_probs, 1e-12, 1.0)          # (K_x, K_y)
    p_x = p_xy.sum(axis=1, keepdims=True)             # (K_x, 1)  — marginal
    p_y = p_xy.sum(axis=0, keepdims=True)             # (1, K_y)  — marginal

    # I(X;Y) = sum p(x,y) * log(p(x,y) / (p(x)*p(y)))
    return (p_xy * np.log(p_xy / (p_x * p_y))).sum()  # scalar


def infonce_loss(z_x, z_y, temperature=0.07):
    """
    InfoNCE contrastive loss (lower bound on MI).
    z_x: (B, D) — L2-normalised embeddings
    z_y: (B, D) — L2-normalised embeddings
    Returns: scalar loss
    """
    B = z_x.shape[0]
    # Cosine similarity matrix
    logits = (z_x @ z_y.T) / temperature              # (B, B)

    # Numerically stable cross-entropy with labels = diagonal
    shifted = logits - logits.max(axis=1, keepdims=True)  # (B, B)
    log_sum_exp = np.log(np.exp(shifted).sum(axis=1))     # (B,)
    log_probs_diag = shifted[np.arange(B), np.arange(B)]  # (B,) — positive pairs
    return -(log_probs_diag - log_sum_exp).mean()          # scalar

Popular Uses

Contrastive learning (SimCLR, MoCo, CLIP): InfoNCE maximises a lower bound on MI between two augmented views — this is why contrastive learning works as a representation learning method
InfoMax principle (Deep InfoMax, CPC): learn representations that maximise MI between input and encoding — “preserve as much information as possible”
Information bottleneck (VIB): compress representations by minimising I(X;Z) while maximising I(Z;Y) — keep only the label-relevant information
MINE / neural MI estimation (Belghazi et al., 2018): train a neural network to estimate MI directly using variational bounds — useful for analysing learned representations
Disentangled representations (beta-VAE): total correlation TC(Z) = KL(q(z) || prod q(z_i)) is a multi-variable generalisation of MI — minimising it encourages independent latent dimensions

Alternatives

Alternative	When to use	Tradeoff
Correlation / cosine similarity	Quick check for linear relationships	Only captures linear dependence; MI captures arbitrary nonlinear relationships
KL divergence	When you have two distributions over the same variable, not two variables	Asymmetric; MI is symmetric and measures shared information between different variables
CKA (Centered Kernel Alignment)	Comparing learned representations across models	Practical and stable, but no information-theoretic interpretation
Hilbert-Schmidt Independence Criterion	Kernel-based independence testing	Consistent estimator without density estimation, but harder to optimise as a training objective
Wasserstein distance	When you care about geometry of distributions, not just dependence	Accounts for the metric structure of the space; MI treats all mismatches equally

Historical Context

Mutual information was defined by Shannon (1948) alongside entropy as part of the foundational framework of information theory. It remained primarily a theoretical tool in statistics and communications until Linsker (1988) proposed the InfoMax principle: the optimal representation of input data is one that maximises the mutual information between the input and the representation, subject to constraints.

The deep learning revival of MI came from two directions. Oord et al. (2018, “Representation Learning with Contrastive Predictive Coding”) introduced InfoNCE, showing that a contrastive classification loss provides a tractable lower bound on MI — this directly motivated SimCLR, MoCo, and CLIP. Simultaneously, Belghazi et al. (2018, MINE) showed MI could be estimated with neural networks using variational bounds. The practical difficulty of MI estimation (bounds can be loose, high-variance) has led the field to treat contrastive losses as effective objectives in their own right, somewhat independent of the MI interpretation.