Contrastive Loss
Contrastive Loss
Section titled “Contrastive Loss”The original pairwise contrastive loss that learns embeddings from same/different pairs. Given two inputs, the loss pulls same-class pairs together and pushes different-class pairs apart beyond a margin. Introduced before triplet loss and InfoNCE — simpler but less powerful. Not to be confused with InfoNCE, which is often loosely called “contrastive loss” in modern papers.
Intuition
Section titled “Intuition”You have two items and a label: “same” or “different.” For same-class pairs, the loss is simply the squared distance between their embeddings — pull them as close as possible. For different-class pairs, the loss activates only when they are closer than a margin : push them apart until they are at least apart, then stop.
The margin is essential. Without it, the model would try to push all negative pairs infinitely far apart, which conflicts with the goal of creating a compact, useful embedding space. The margin says “different-class items just need to be distinguishable, not maximally separated.” This creates a structured space where same-class items cluster tightly and different clusters maintain at least margin of separation.
The limitation compared to modern approaches is that each pair provides weak signal. A positive pair says “these two should be close” but gives no information about where they should be relative to everything else. InfoNCE, by contrast, positions the anchor relative to many negatives simultaneously. This is why pairwise contrastive loss requires more training iterations and careful pair selection to achieve comparable representation quality.
Standard form (Hadsell, Chopra & LeCun, 2006):
where:
- is the Euclidean distance between embeddings
- if same class (positive pair), if different class (negative pair)
- is the margin (typically 1.0-2.0)
Breakdown:
- Same-class (): loss — minimise distance, always active
- Different-class (): loss — only penalise if distance
With cosine similarity (alternative formulation):
where and is the cosine margin.
import torchimport torch.nn.functional as F
# ── Pairwise contrastive loss ────────────────────────────────────def contrastive_loss(emb1, emb2, labels, margin=1.0): """ emb1, emb2: (B, D) embeddings of the two items in each pair labels: (B,) float — 1.0 for same class, 0.0 for different margin: float — minimum distance for negative pairs """ dist = F.pairwise_distance(emb1, emb2, p=2) # (B,) Euclidean distance pos_loss = labels * dist.pow(2) # (B,) pull same together neg_loss = (1 - labels) * F.relu(margin - dist).pow(2) # (B,) push diff apart return (pos_loss + neg_loss).mean() * 0.5
# ── Using PyTorch built-in ───────────────────────────────────────# PyTorch's CosineEmbeddingLoss uses cosine similarity with labels in {-1, +1}:loss_fn = torch.nn.CosineEmbeddingLoss(margin=0.5)# labels: +1 for same, -1 for different (NOT 0/1)loss = loss_fn(emb1, emb2, labels_pm1) # scalar
# WARNING: CosineEmbeddingLoss uses {-1, +1} labels, while the# Euclidean version above uses {0, 1}. Mixing these up silently# produces garbage gradients.Manual Implementation
Section titled “Manual Implementation”import numpy as np
def contrastive_loss(emb1, emb2, labels, margin=1.0): """ Pairwise contrastive loss (Euclidean). emb1: (B, D) embeddings emb2: (B, D) embeddings labels: (B,) 1.0 = same class, 0.0 = different class margin: minimum distance for negative pairs """ # Euclidean distance between paired embeddings diff = emb1 - emb2 # (B, D) dist = np.sqrt((diff ** 2).sum(axis=1) + 1e-8) # (B,) add eps for grad stability
# Positive pairs: minimise squared distance pos_loss = labels * dist ** 2 # (B,)
# Negative pairs: penalise only if closer than margin neg_loss = (1 - labels) * np.maximum(0, margin - dist) ** 2 # (B,)
return ((pos_loss + neg_loss) * 0.5).mean()
def cosine_contrastive_loss(emb1, emb2, labels, margin=0.5): """ Pairwise contrastive loss using cosine similarity. labels: (B,) 1.0 = same, 0.0 = different """ # Cosine similarity norm1 = emb1 / (np.linalg.norm(emb1, axis=1, keepdims=True) + 1e-8) norm2 = emb2 / (np.linalg.norm(emb2, axis=1, keepdims=True) + 1e-8) sim = (norm1 * norm2).sum(axis=1) # (B,)
pos_loss = labels * (1 - sim) ** 2 # (B,) neg_loss = (1 - labels) * np.maximum(0, sim - margin) ** 2 return ((pos_loss + neg_loss) * 0.5).mean()Popular Uses
Section titled “Popular Uses”- Siamese networks (signature/face verification): the original application. Two networks with shared weights compare pairs of inputs
- Sentence similarity (Sentence-BERT): learn sentence embeddings where paraphrases are close and unrelated sentences are far
- One-shot / few-shot learning: compare a query to support examples using learned distance
- Change detection (satellite imagery): determine if two images of the same location show changes
- Duplicate detection: find near-duplicate images, documents, or products by embedding distance
Alternatives
Section titled “Alternatives”| Alternative | When to use | Tradeoff |
|---|---|---|
| InfoNCE | Have large batches, want strong representations | Uses all batch negatives simultaneously — much more efficient. The modern default |
| Triplet loss | Have anchor-positive-negative triplets | Slightly more expressive than pairs — considers relative distances. Still less efficient than InfoNCE |
| ArcFace / CosFace | Closed-set recognition with class labels | Angular margin directly on classification logits. Better when class labels are available at training |
| BYOL / SimSiam | Want to avoid needing negative pairs entirely | Only uses positive pairs with asymmetric architecture. Avoids margin tuning and negative mining |
| Cosine embedding loss | Working with normalised embeddings | Same principle but in cosine space. Built into PyTorch as nn.CosineEmbeddingLoss |
Historical Context
Section titled “Historical Context”Pairwise contrastive loss was introduced by Hadsell, Chopra, and LeCun (2006) in “Dimensionality Reduction by Learning an Invariant Mapping,” where they used Siamese networks to learn a mapping that preserved neighbourhood relationships. This was one of the first demonstrations that a neural network could learn useful general-purpose embeddings (as opposed to task-specific classifiers).
The approach was the dominant paradigm for metric learning and verification tasks (face verification, signature verification) through the early 2010s. Triplet loss (FaceNet, 2015) extended the idea from pairs to triplets, providing richer learning signal. InfoNCE (CPC, 2018) generalised further to N-way comparisons, effectively making pairwise contrastive loss obsolete for representation learning. The pairwise formulation remains useful in domains where the data naturally comes as labelled pairs (e.g. paraphrase detection, duplicate finding) or where simplicity is valued over representation quality.