InfoNCE Loss

A contrastive loss that treats representation learning as a classification problem: “which of these N candidates is the true positive?” The core training objective for SimCLR, CLIP, MoCo, and most modern contrastive learning methods. Also called NT-Xent (normalised temperature-scaled cross-entropy) in the SimCLR paper — they are the same thing.

Intuition

You have an anchor (e.g. an augmented image), one positive (a different augmentation of the same image), and many negatives (augmentations of other images). InfoNCE computes the similarity of the anchor to every candidate and asks: “can you pick the positive out of the lineup?” The loss is literally cross-entropy over this N-way classification problem, where the “classes” are the candidates and the “correct class” is the positive.

Why does this work for learning representations? Because the only way to reliably pick the positive from a large set of negatives is to encode the semantic content that makes the positive similar to the anchor. Surface-level features (colour, orientation) vary across augmentations, so they can’t help. The model must learn features that are invariant to augmentation but discriminative across different inputs.

Temperature $\tau$ controls how “sharp” the classification is. Low temperature (e.g. 0.07) makes the softmax peakier, forcing the model to produce very distinct embeddings. High temperature (e.g. 1.0) is more forgiving. SimCLR and CLIP both use learned or tuned temperatures around 0.07-0.1. Setting temperature too low causes training instability; too high makes the task too easy and representations become less discriminative.

Math

General form (anchor $q$ , positive $k^+$ , negatives $\{k^-_i\}$ ):

$\mathcal{L} = -\log \frac{\exp(\text{sim}(q, k^+) / \tau)}{\exp(\text{sim}(q, k^+) / \tau) + \sum_{i} \exp(\text{sim}(q, k^-_i) / \tau)}$

where $\text{sim}(a, b) = \frac{a \cdot b}{\|a\| \|b\|}$ (cosine similarity) and $\tau$ is the temperature.

Batch form (SimCLR-style, all pairs in a batch of $N$ pairs):

For each pair $(i, j)$ from the same source, with the remaining $2(N-1)$ samples as negatives:

$\mathcal{L}_i = -\log \frac{\exp(\text{sim}(z_i, z_j) / \tau)}{\sum_{k \neq i} \exp(\text{sim}(z_i, z_k) / \tau)}$

Total loss averages over all $2N$ anchors.

Connection to mutual information: InfoNCE is a lower bound on the mutual information between the anchor and positive representations: $I(q; k^+) \geq \log(N) - \mathcal{L}$ . More negatives give a tighter bound, which is why larger batch sizes help.

Code

import torch
import torch.nn.functional as F

# ── SimCLR-style InfoNCE (in-batch negatives) ────────────────────
def infonce_loss(z1, z2, temperature=0.07):
    """
    z1, z2: (B, D) — L2-normalised embeddings of two augmentations
    Returns scalar loss.
    """
    z1 = F.normalize(z1, dim=-1)                         # (B, D)
    z2 = F.normalize(z2, dim=-1)                         # (B, D)

    # All pairwise cosine similarities, scaled by temperature
    # Each row i should match column i (the positive pair)
    logits = z1 @ z2.T / temperature                     # (B, B)

    # Labels: sample i's positive is at index i
    labels = torch.arange(z1.size(0), device=z1.device)  # (B,)
    loss = F.cross_entropy(logits, labels)                # scalar
    return loss

# ── CLIP-style (symmetric) ──────────────────────────────────────
# CLIP averages image→text and text→image directions:
logits = image_emb @ text_emb.T / temperature             # (B, B)
labels = torch.arange(B, device=logits.device)
loss = (F.cross_entropy(logits, labels)
      + F.cross_entropy(logits.T, labels)) / 2

Manual Implementation

import numpy as np

def infonce_loss(z1, z2, temperature=0.07):
    """
    InfoNCE / NT-Xent loss for a batch of positive pairs.
    Equivalent to SimCLR's contrastive loss.
    z1: (B, D) normalised embeddings (view 1)
    z2: (B, D) normalised embeddings (view 2)
    """
    B = z1.shape[0]

    # L2 normalise
    z1 = z1 / np.linalg.norm(z1, axis=1, keepdims=True)     # (B, D)
    z2 = z2 / np.linalg.norm(z2, axis=1, keepdims=True)     # (B, D)

    # Cosine similarity matrix, scaled by temperature
    logits = z1 @ z2.T / temperature                         # (B, B)

    # Numerically stable log-softmax along rows
    shifted = logits - logits.max(axis=1, keepdims=True)     # (B, B)
    log_sum_exp = np.log(np.exp(shifted).sum(axis=1))        # (B,)
    log_probs = shifted[np.arange(B), np.arange(B)] - log_sum_exp  # (B,)

    return -log_probs.mean()

Popular Uses

SimCLR (see contrastive-self-supervising/): InfoNCE with in-batch negatives, requires large batch sizes (4096+) to provide enough negatives
CLIP (OpenAI): symmetric InfoNCE between image and text embeddings — the loss that enables zero-shot image classification
MoCo (see contrastive-self-supervising/): InfoNCE with a momentum-updated queue of negatives, decoupling batch size from negative count
Audio-visual learning (AudioCLIP, ImageBind): InfoNCE across modalities — aligning audio, image, and text in a shared embedding space
Dense retrieval (DPR, ColBERT): InfoNCE between query and document embeddings for search

Alternatives

Alternative	When to use	Tradeoff
Triplet loss	Small number of negatives, fine-grained retrieval	Only considers one negative at a time; less efficient use of batch
Contrastive loss (pairwise)	Binary same/different pairs, simple setup	No temperature, no softmax — just margin-based on pairs. Less effective with many negatives
BYOL / non-contrastive	Want to avoid needing large batches of negatives	No negatives at all — uses EMA and prediction head instead. Risk of collapse without careful design
SupCon (supervised contrastive)	Have labels, want to leverage them	Multiple positives per class in the denominator; strictly better than InfoNCE when labels are available
VICReg	Want explicit control over representation properties	Variance/invariance/covariance terms replace the implicit pressure from negatives. No temperature to tune

Historical Context

InfoNCE was introduced by van den Oord et al. (2018) in “Representation Learning with Contrastive Predictive Coding” (CPC). The name stands for “Noise-Contrastive Estimation” applied to mutual information — the loss lower-bounds the mutual information between the anchor and positive. The theoretical grounding came from noise-contrastive estimation (Gutmann & Hyvarinen, 2010).

The loss became dominant in self-supervised learning through SimCLR (Chen et al., 2020) and MoCo (He et al., 2020), which showed that InfoNCE with the right data augmentations could match supervised pretraining on ImageNet. CLIP (Radford et al., 2021) extended it to multimodal learning, applying InfoNCE between image and text to create the most widely-used vision-language model. The key practical discovery was the importance of temperature — both SimCLR and CLIP found that a very low temperature (0.07) was critical for learning good representations.