Triplet Loss

Learns embeddings by enforcing that an anchor is closer to a positive (same class) than to a negative (different class) by at least a fixed margin. The foundational loss for metric learning and face recognition (FaceNet). Operates on distances in embedding space, not class scores.

Intuition

Triplet loss works on trios: an anchor, a positive, and a negative. It says: “the distance from anchor to positive should be smaller than the distance from anchor to negative, by at least margin $m$ .” If it already is, the loss is zero and no gradient flows. If not, the model is pushed to pull the positive closer and push the negative further away.

The critical challenge is triplet selection (mining). Most triplets in a large dataset are “easy” — the negative is already far away, contributing zero loss and zero learning. Training only makes progress on “hard” or “semi-hard” triplets. Hard negatives are closer to the anchor than the positive (the model is wrong). Semi-hard negatives are farther than the positive but within the margin (the model is right but not confident enough). FaceNet showed that using semi-hard triplets produces the most stable training; hard triplets alone can cause collapse.

This is the fundamental limitation of triplet loss compared to InfoNCE: each update uses only one negative. InfoNCE uses all other samples in the batch as negatives simultaneously, extracting much more signal per forward pass. Triplet loss requires careful mining to compensate.

Math

Standard form (using Euclidean distance):

$\mathcal{L} = \max\bigl(0,\; \|f(a) - f(p)\|^2 - \|f(a) - f(n)\|^2 + m\bigr)$

where $f(\cdot)$ is the embedding function, $a$ is the anchor, $p$ is the positive, $n$ is the negative, and $m$ is the margin (typically 0.2-1.0).

Equivalently: $\mathcal{L} = \max(0,\; d_{ap} - d_{an} + m)$ where $d$ denotes squared distance.

With cosine similarity (common for normalised embeddings):

$\mathcal{L} = \max\bigl(0,\; \text{sim}(a, n) - \text{sim}(a, p) + m\bigr)$

Note the sign flip: we want similarity to be higher for the positive.

Triplet categories (for mining):

Easy: $d_{an} > d_{ap} + m$ — loss is zero, no gradient
Semi-hard: $d_{ap} < d_{an} < d_{ap} + m$ — correct but within the margin
Hard: $d_{an} < d_{ap}$ — the model is wrong, negative is closer than positive

Code

import torch
import torch.nn.functional as F

# ── Standard triplet margin loss ─────────────────────────────────
anchor_emb = model(anchors)                          # (B, D)
pos_emb = model(positives)                           # (B, D)
neg_emb = model(negatives)                           # (B, D)

loss = F.triplet_margin_loss(
    anchor_emb, pos_emb, neg_emb,
    margin=1.0, p=2                                  # p=2 → Euclidean distance
)

# ── With cosine distance (normalised embeddings) ─────────────────
loss = F.triplet_margin_with_distance_loss(
    anchor_emb, pos_emb, neg_emb,
    distance_function=lambda a, b: 1 - F.cosine_similarity(a, b),
    margin=0.3
)

# ── Online semi-hard mining within a batch ───────────────────────
# Compute all pairwise distances, then select valid triplets.
# This is the standard approach — mine from the batch, don't
# pre-construct triplets.
emb = model(images)                                  # (B, D)
dist = torch.cdist(emb, emb, p=2)                   # (B, B)
# Then select (anchor, positive, negative) indices where:
# same_label[a, p] and not same_label[a, n] and
# dist[a,p] < dist[a,n] < dist[a,p] + margin

Manual Implementation

import numpy as np

def triplet_loss(anchor, positive, negative, margin=1.0):
    """
    Triplet margin loss with Euclidean distance.
    Equivalent to F.triplet_margin_loss.
    anchor:   (B, D) embeddings
    positive: (B, D) embeddings (same class as anchor)
    negative: (B, D) embeddings (different class from anchor)
    """
    d_ap = np.sum((anchor - positive) ** 2, axis=1)      # (B,) squared dist
    d_an = np.sum((anchor - negative) ** 2, axis=1)      # (B,)
    losses = np.maximum(0, d_ap - d_an + margin)          # (B,)
    return losses.mean()


def mine_semi_hard(embeddings, labels, margin=1.0):
    """
    Find semi-hard triplets within a batch.
    embeddings: (B, D)
    labels:     (B,) integer class labels
    Returns: indices (anchors, positives, negatives)
    """
    B = embeddings.shape[0]
    dist = np.sum((embeddings[:, None] - embeddings[None, :]) ** 2, axis=2)  # (B, B)

    triplets = []
    for i in range(B):
        pos_mask = labels == labels[i]
        neg_mask = labels != labels[i]
        pos_mask[i] = False                                # exclude self

        for p in np.where(pos_mask)[0]:
            d_ap = dist[i, p]
            # Semi-hard: d_ap < d_an < d_ap + margin
            valid = neg_mask & (dist[i] > d_ap) & (dist[i] < d_ap + margin)
            if valid.any():
                n = np.where(valid)[0][0]                  # take first valid
                triplets.append((i, p, n))
    return triplets

Popular Uses

Face recognition (FaceNet): the paper that popularised triplet loss for deep learning. Learns face embeddings where same-person faces cluster together
Person re-identification: matching pedestrians across camera views in surveillance
Image retrieval (Google Landmark): find visually similar images by embedding distance
Few-shot learning (Siamese networks): learn a similarity function from very few examples per class
Speaker verification: determine if two audio clips are from the same speaker

Alternatives

Alternative	When to use	Tradeoff
InfoNCE	Large batches available, many negatives per anchor	Much more efficient — uses all batch negatives simultaneously. The modern default for contrastive learning
Contrastive loss (pairwise)	Only have pairs, not triplets	Simpler — just same/different pairs with a margin. Less expressive than triplets
ArcFace / CosFace	Closed-set recognition (known classes at train time)	Adds angular margin to classification loss. Better than triplet for closed-set; doesn’t generalise to open-set as naturally
Proxy-NCA / ProxyAnchor	Want efficiency without mining	Learns proxy embeddings per class, compares to proxies instead of mining triplets. Much faster
SupCon	Have labels, want contrastive learning	Supervised contrastive loss leverages all positives and negatives per class. Strictly better than triplet when labels are available

Historical Context

Triplet loss was introduced for metric learning in the early 2000s (Weinberger & Saul, 2006, “Distance Metric Learning for Large Margin Nearest Neighbor Classification”) but became famous through FaceNet (Schroff et al., 2015), which used it to learn face embeddings achieving near-human verification accuracy. The key practical innovation in FaceNet was online semi-hard negative mining within large batches, which made triplet loss trainable at scale.

The approach has been largely superseded by InfoNCE-based methods (SimCLR, CLIP) for self-supervised learning and by angular margin losses (ArcFace, CosFace) for face recognition. The fundamental issue is efficiency: each triplet provides one bit of information (is the negative farther than the positive?), while InfoNCE provides $\log_2(N)$ bits per anchor by classifying among N candidates. Triplet loss remains relevant in settings where the data naturally comes as triplets or where fine-grained ranking within a margin matters more than coarse discrimination.