InfoNCE Loss
InfoNCE Loss
Section titled “InfoNCE Loss”A contrastive loss that treats representation learning as a classification problem: “which of these N candidates is the true positive?” The core training objective for SimCLR, CLIP, MoCo, and most modern contrastive learning methods. Also called NT-Xent (normalised temperature-scaled cross-entropy) in the SimCLR paper — they are the same thing.
Intuition
Section titled “Intuition”You have an anchor (e.g. an augmented image), one positive (a different augmentation of the same image), and many negatives (augmentations of other images). InfoNCE computes the similarity of the anchor to every candidate and asks: “can you pick the positive out of the lineup?” The loss is literally cross-entropy over this N-way classification problem, where the “classes” are the candidates and the “correct class” is the positive.
Why does this work for learning representations? Because the only way to reliably pick the positive from a large set of negatives is to encode the semantic content that makes the positive similar to the anchor. Surface-level features (colour, orientation) vary across augmentations, so they can’t help. The model must learn features that are invariant to augmentation but discriminative across different inputs.
Temperature controls how “sharp” the classification is. Low temperature (e.g. 0.07) makes the softmax peakier, forcing the model to produce very distinct embeddings. High temperature (e.g. 1.0) is more forgiving. SimCLR and CLIP both use learned or tuned temperatures around 0.07-0.1. Setting temperature too low causes training instability; too high makes the task too easy and representations become less discriminative.
General form (anchor , positive , negatives ):
where (cosine similarity) and is the temperature.
Batch form (SimCLR-style, all pairs in a batch of pairs):
For each pair from the same source, with the remaining samples as negatives:
Total loss averages over all anchors.
Connection to mutual information: InfoNCE is a lower bound on the mutual information between the anchor and positive representations: . More negatives give a tighter bound, which is why larger batch sizes help.
import torchimport torch.nn.functional as F
# ── SimCLR-style InfoNCE (in-batch negatives) ────────────────────def infonce_loss(z1, z2, temperature=0.07): """ z1, z2: (B, D) — L2-normalised embeddings of two augmentations Returns scalar loss. """ z1 = F.normalize(z1, dim=-1) # (B, D) z2 = F.normalize(z2, dim=-1) # (B, D)
# All pairwise cosine similarities, scaled by temperature # Each row i should match column i (the positive pair) logits = z1 @ z2.T / temperature # (B, B)
# Labels: sample i's positive is at index i labels = torch.arange(z1.size(0), device=z1.device) # (B,) loss = F.cross_entropy(logits, labels) # scalar return loss
# ── CLIP-style (symmetric) ──────────────────────────────────────# CLIP averages image→text and text→image directions:logits = image_emb @ text_emb.T / temperature # (B, B)labels = torch.arange(B, device=logits.device)loss = (F.cross_entropy(logits, labels) + F.cross_entropy(logits.T, labels)) / 2Manual Implementation
Section titled “Manual Implementation”import numpy as np
def infonce_loss(z1, z2, temperature=0.07): """ InfoNCE / NT-Xent loss for a batch of positive pairs. Equivalent to SimCLR's contrastive loss. z1: (B, D) normalised embeddings (view 1) z2: (B, D) normalised embeddings (view 2) """ B = z1.shape[0]
# L2 normalise z1 = z1 / np.linalg.norm(z1, axis=1, keepdims=True) # (B, D) z2 = z2 / np.linalg.norm(z2, axis=1, keepdims=True) # (B, D)
# Cosine similarity matrix, scaled by temperature logits = z1 @ z2.T / temperature # (B, B)
# Numerically stable log-softmax along rows shifted = logits - logits.max(axis=1, keepdims=True) # (B, B) log_sum_exp = np.log(np.exp(shifted).sum(axis=1)) # (B,) log_probs = shifted[np.arange(B), np.arange(B)] - log_sum_exp # (B,)
return -log_probs.mean()Popular Uses
Section titled “Popular Uses”- SimCLR (see
contrastive-self-supervising/): InfoNCE with in-batch negatives, requires large batch sizes (4096+) to provide enough negatives - CLIP (OpenAI): symmetric InfoNCE between image and text embeddings — the loss that enables zero-shot image classification
- MoCo (see
contrastive-self-supervising/): InfoNCE with a momentum-updated queue of negatives, decoupling batch size from negative count - Audio-visual learning (AudioCLIP, ImageBind): InfoNCE across modalities — aligning audio, image, and text in a shared embedding space
- Dense retrieval (DPR, ColBERT): InfoNCE between query and document embeddings for search
Alternatives
Section titled “Alternatives”| Alternative | When to use | Tradeoff |
|---|---|---|
| Triplet loss | Small number of negatives, fine-grained retrieval | Only considers one negative at a time; less efficient use of batch |
| Contrastive loss (pairwise) | Binary same/different pairs, simple setup | No temperature, no softmax — just margin-based on pairs. Less effective with many negatives |
| BYOL / non-contrastive | Want to avoid needing large batches of negatives | No negatives at all — uses EMA and prediction head instead. Risk of collapse without careful design |
| SupCon (supervised contrastive) | Have labels, want to leverage them | Multiple positives per class in the denominator; strictly better than InfoNCE when labels are available |
| VICReg | Want explicit control over representation properties | Variance/invariance/covariance terms replace the implicit pressure from negatives. No temperature to tune |
Historical Context
Section titled “Historical Context”InfoNCE was introduced by van den Oord et al. (2018) in “Representation Learning with Contrastive Predictive Coding” (CPC). The name stands for “Noise-Contrastive Estimation” applied to mutual information — the loss lower-bounds the mutual information between the anchor and positive. The theoretical grounding came from noise-contrastive estimation (Gutmann & Hyvarinen, 2010).
The loss became dominant in self-supervised learning through SimCLR (Chen et al., 2020) and MoCo (He et al., 2020), which showed that InfoNCE with the right data augmentations could match supervised pretraining on ImageNet. CLIP (Radford et al., 2021) extended it to multimodal learning, applying InfoNCE between image and text to create the most widely-used vision-language model. The key practical discovery was the importance of temperature — both SimCLR and CLIP found that a very low temperature (0.07) was critical for learning good representations.