Contrastive Loss

The original pairwise contrastive loss that learns embeddings from same/different pairs. Given two inputs, the loss pulls same-class pairs together and pushes different-class pairs apart beyond a margin. Introduced before triplet loss and InfoNCE — simpler but less powerful. Not to be confused with InfoNCE, which is often loosely called “contrastive loss” in modern papers.

Intuition

You have two items and a label: “same” or “different.” For same-class pairs, the loss is simply the squared distance between their embeddings — pull them as close as possible. For different-class pairs, the loss activates only when they are closer than a margin $m$ : push them apart until they are at least $m$ apart, then stop.

The margin is essential. Without it, the model would try to push all negative pairs infinitely far apart, which conflicts with the goal of creating a compact, useful embedding space. The margin says “different-class items just need to be distinguishable, not maximally separated.” This creates a structured space where same-class items cluster tightly and different clusters maintain at least margin $m$ of separation.

The limitation compared to modern approaches is that each pair provides weak signal. A positive pair says “these two should be close” but gives no information about where they should be relative to everything else. InfoNCE, by contrast, positions the anchor relative to many negatives simultaneously. This is why pairwise contrastive loss requires more training iterations and careful pair selection to achieve comparable representation quality.

Math

Standard form (Hadsell, Chopra & LeCun, 2006):

$\mathcal{L} = \frac{1}{2}\bigl[y \cdot d^2 + (1 - y) \cdot \max(0, m - d)^2\bigr]$

where:

$d = \|f(x_1) - f(x_2)\|_2$ is the Euclidean distance between embeddings
$y = 1$ if same class (positive pair), $y = 0$ if different class (negative pair)
$m$ is the margin (typically 1.0-2.0)

Breakdown:

Same-class ( $y=1$ ): loss $= d^2$ — minimise distance, always active
Different-class ( $y=0$ ): loss $= \max(0, m - d)^2$ — only penalise if distance $< m$

With cosine similarity (alternative formulation):

$\mathcal{L} = \frac{1}{2}\bigl[y \cdot (1 - s)^2 + (1 - y) \cdot \max(0, s - \epsilon)^2\bigr]$

where $s = \text{cos}(f(x_1), f(x_2))$ and $\epsilon$ is the cosine margin.

Code

import torch
import torch.nn.functional as F

# ── Pairwise contrastive loss ────────────────────────────────────
def contrastive_loss(emb1, emb2, labels, margin=1.0):
    """
    emb1, emb2: (B, D) embeddings of the two items in each pair
    labels:     (B,) float — 1.0 for same class, 0.0 for different
    margin:     float — minimum distance for negative pairs
    """
    dist = F.pairwise_distance(emb1, emb2, p=2)         # (B,) Euclidean distance
    pos_loss = labels * dist.pow(2)                      # (B,) pull same together
    neg_loss = (1 - labels) * F.relu(margin - dist).pow(2)  # (B,) push diff apart
    return (pos_loss + neg_loss).mean() * 0.5

# ── Using PyTorch built-in ───────────────────────────────────────
# PyTorch's CosineEmbeddingLoss uses cosine similarity with labels in {-1, +1}:
loss_fn = torch.nn.CosineEmbeddingLoss(margin=0.5)
# labels: +1 for same, -1 for different (NOT 0/1)
loss = loss_fn(emb1, emb2, labels_pm1)                  # scalar

# WARNING: CosineEmbeddingLoss uses {-1, +1} labels, while the
# Euclidean version above uses {0, 1}. Mixing these up silently
# produces garbage gradients.

Manual Implementation

import numpy as np

def contrastive_loss(emb1, emb2, labels, margin=1.0):
    """
    Pairwise contrastive loss (Euclidean).
    emb1:   (B, D) embeddings
    emb2:   (B, D) embeddings
    labels: (B,) 1.0 = same class, 0.0 = different class
    margin: minimum distance for negative pairs
    """
    # Euclidean distance between paired embeddings
    diff = emb1 - emb2                                       # (B, D)
    dist = np.sqrt((diff ** 2).sum(axis=1) + 1e-8)          # (B,) add eps for grad stability

    # Positive pairs: minimise squared distance
    pos_loss = labels * dist ** 2                            # (B,)

    # Negative pairs: penalise only if closer than margin
    neg_loss = (1 - labels) * np.maximum(0, margin - dist) ** 2  # (B,)

    return ((pos_loss + neg_loss) * 0.5).mean()


def cosine_contrastive_loss(emb1, emb2, labels, margin=0.5):
    """
    Pairwise contrastive loss using cosine similarity.
    labels: (B,) 1.0 = same, 0.0 = different
    """
    # Cosine similarity
    norm1 = emb1 / (np.linalg.norm(emb1, axis=1, keepdims=True) + 1e-8)
    norm2 = emb2 / (np.linalg.norm(emb2, axis=1, keepdims=True) + 1e-8)
    sim = (norm1 * norm2).sum(axis=1)                        # (B,)

    pos_loss = labels * (1 - sim) ** 2                       # (B,)
    neg_loss = (1 - labels) * np.maximum(0, sim - margin) ** 2
    return ((pos_loss + neg_loss) * 0.5).mean()

Popular Uses

Siamese networks (signature/face verification): the original application. Two networks with shared weights compare pairs of inputs
Sentence similarity (Sentence-BERT): learn sentence embeddings where paraphrases are close and unrelated sentences are far
One-shot / few-shot learning: compare a query to support examples using learned distance
Change detection (satellite imagery): determine if two images of the same location show changes
Duplicate detection: find near-duplicate images, documents, or products by embedding distance

Alternatives

Alternative	When to use	Tradeoff
InfoNCE	Have large batches, want strong representations	Uses all batch negatives simultaneously — much more efficient. The modern default
Triplet loss	Have anchor-positive-negative triplets	Slightly more expressive than pairs — considers relative distances. Still less efficient than InfoNCE
ArcFace / CosFace	Closed-set recognition with class labels	Angular margin directly on classification logits. Better when class labels are available at training
BYOL / SimSiam	Want to avoid needing negative pairs entirely	Only uses positive pairs with asymmetric architecture. Avoids margin tuning and negative mining
Cosine embedding loss	Working with normalised embeddings	Same principle but in cosine space. Built into PyTorch as `nn.CosineEmbeddingLoss`

Historical Context

Pairwise contrastive loss was introduced by Hadsell, Chopra, and LeCun (2006) in “Dimensionality Reduction by Learning an Invariant Mapping,” where they used Siamese networks to learn a mapping that preserved neighbourhood relationships. This was one of the first demonstrations that a neural network could learn useful general-purpose embeddings (as opposed to task-specific classifiers).

The approach was the dominant paradigm for metric learning and verification tasks (face verification, signature verification) through the early 2010s. Triplet loss (FaceNet, 2015) extended the idea from pairs to triplets, providing richer learning signal. InfoNCE (CPC, 2018) generalised further to N-way comparisons, effectively making pairwise contrastive loss obsolete for representation learning. The pairwise formulation remains useful in domains where the data naturally comes as labelled pairs (e.g. paraphrase detection, duplicate finding) or where simplicity is valued over representation quality.