Skip to content

Contrastive Loss

The original pairwise contrastive loss that learns embeddings from same/different pairs. Given two inputs, the loss pulls same-class pairs together and pushes different-class pairs apart beyond a margin. Introduced before triplet loss and InfoNCE — simpler but less powerful. Not to be confused with InfoNCE, which is often loosely called “contrastive loss” in modern papers.

You have two items and a label: “same” or “different.” For same-class pairs, the loss is simply the squared distance between their embeddings — pull them as close as possible. For different-class pairs, the loss activates only when they are closer than a margin mm: push them apart until they are at least mm apart, then stop.

The margin is essential. Without it, the model would try to push all negative pairs infinitely far apart, which conflicts with the goal of creating a compact, useful embedding space. The margin says “different-class items just need to be distinguishable, not maximally separated.” This creates a structured space where same-class items cluster tightly and different clusters maintain at least margin mm of separation.

The limitation compared to modern approaches is that each pair provides weak signal. A positive pair says “these two should be close” but gives no information about where they should be relative to everything else. InfoNCE, by contrast, positions the anchor relative to many negatives simultaneously. This is why pairwise contrastive loss requires more training iterations and careful pair selection to achieve comparable representation quality.

Standard form (Hadsell, Chopra & LeCun, 2006):

L=12[yd2+(1y)max(0,md)2]\mathcal{L} = \frac{1}{2}\bigl[y \cdot d^2 + (1 - y) \cdot \max(0, m - d)^2\bigr]

where:

  • d=f(x1)f(x2)2d = \|f(x_1) - f(x_2)\|_2 is the Euclidean distance between embeddings
  • y=1y = 1 if same class (positive pair), y=0y = 0 if different class (negative pair)
  • mm is the margin (typically 1.0-2.0)

Breakdown:

  • Same-class (y=1y=1): loss =d2= d^2 — minimise distance, always active
  • Different-class (y=0y=0): loss =max(0,md)2= \max(0, m - d)^2 — only penalise if distance <m< m

With cosine similarity (alternative formulation):

L=12[y(1s)2+(1y)max(0,sϵ)2]\mathcal{L} = \frac{1}{2}\bigl[y \cdot (1 - s)^2 + (1 - y) \cdot \max(0, s - \epsilon)^2\bigr]

where s=cos(f(x1),f(x2))s = \text{cos}(f(x_1), f(x_2)) and ϵ\epsilon is the cosine margin.

import torch
import torch.nn.functional as F
# ── Pairwise contrastive loss ────────────────────────────────────
def contrastive_loss(emb1, emb2, labels, margin=1.0):
"""
emb1, emb2: (B, D) embeddings of the two items in each pair
labels: (B,) float — 1.0 for same class, 0.0 for different
margin: float — minimum distance for negative pairs
"""
dist = F.pairwise_distance(emb1, emb2, p=2) # (B,) Euclidean distance
pos_loss = labels * dist.pow(2) # (B,) pull same together
neg_loss = (1 - labels) * F.relu(margin - dist).pow(2) # (B,) push diff apart
return (pos_loss + neg_loss).mean() * 0.5
# ── Using PyTorch built-in ───────────────────────────────────────
# PyTorch's CosineEmbeddingLoss uses cosine similarity with labels in {-1, +1}:
loss_fn = torch.nn.CosineEmbeddingLoss(margin=0.5)
# labels: +1 for same, -1 for different (NOT 0/1)
loss = loss_fn(emb1, emb2, labels_pm1) # scalar
# WARNING: CosineEmbeddingLoss uses {-1, +1} labels, while the
# Euclidean version above uses {0, 1}. Mixing these up silently
# produces garbage gradients.
import numpy as np
def contrastive_loss(emb1, emb2, labels, margin=1.0):
"""
Pairwise contrastive loss (Euclidean).
emb1: (B, D) embeddings
emb2: (B, D) embeddings
labels: (B,) 1.0 = same class, 0.0 = different class
margin: minimum distance for negative pairs
"""
# Euclidean distance between paired embeddings
diff = emb1 - emb2 # (B, D)
dist = np.sqrt((diff ** 2).sum(axis=1) + 1e-8) # (B,) add eps for grad stability
# Positive pairs: minimise squared distance
pos_loss = labels * dist ** 2 # (B,)
# Negative pairs: penalise only if closer than margin
neg_loss = (1 - labels) * np.maximum(0, margin - dist) ** 2 # (B,)
return ((pos_loss + neg_loss) * 0.5).mean()
def cosine_contrastive_loss(emb1, emb2, labels, margin=0.5):
"""
Pairwise contrastive loss using cosine similarity.
labels: (B,) 1.0 = same, 0.0 = different
"""
# Cosine similarity
norm1 = emb1 / (np.linalg.norm(emb1, axis=1, keepdims=True) + 1e-8)
norm2 = emb2 / (np.linalg.norm(emb2, axis=1, keepdims=True) + 1e-8)
sim = (norm1 * norm2).sum(axis=1) # (B,)
pos_loss = labels * (1 - sim) ** 2 # (B,)
neg_loss = (1 - labels) * np.maximum(0, sim - margin) ** 2
return ((pos_loss + neg_loss) * 0.5).mean()
  • Siamese networks (signature/face verification): the original application. Two networks with shared weights compare pairs of inputs
  • Sentence similarity (Sentence-BERT): learn sentence embeddings where paraphrases are close and unrelated sentences are far
  • One-shot / few-shot learning: compare a query to support examples using learned distance
  • Change detection (satellite imagery): determine if two images of the same location show changes
  • Duplicate detection: find near-duplicate images, documents, or products by embedding distance
AlternativeWhen to useTradeoff
InfoNCEHave large batches, want strong representationsUses all batch negatives simultaneously — much more efficient. The modern default
Triplet lossHave anchor-positive-negative tripletsSlightly more expressive than pairs — considers relative distances. Still less efficient than InfoNCE
ArcFace / CosFaceClosed-set recognition with class labelsAngular margin directly on classification logits. Better when class labels are available at training
BYOL / SimSiamWant to avoid needing negative pairs entirelyOnly uses positive pairs with asymmetric architecture. Avoids margin tuning and negative mining
Cosine embedding lossWorking with normalised embeddingsSame principle but in cosine space. Built into PyTorch as nn.CosineEmbeddingLoss

Pairwise contrastive loss was introduced by Hadsell, Chopra, and LeCun (2006) in “Dimensionality Reduction by Learning an Invariant Mapping,” where they used Siamese networks to learn a mapping that preserved neighbourhood relationships. This was one of the first demonstrations that a neural network could learn useful general-purpose embeddings (as opposed to task-specific classifiers).

The approach was the dominant paradigm for metric learning and verification tasks (face verification, signature verification) through the early 2010s. Triplet loss (FaceNet, 2015) extended the idea from pairs to triplets, providing richer learning signal. InfoNCE (CPC, 2018) generalised further to N-way comparisons, effectively making pairwise contrastive loss obsolete for representation learning. The pairwise formulation remains useful in domains where the data naturally comes as labelled pairs (e.g. paraphrase detection, duplicate finding) or where simplicity is valued over representation quality.