Skip to content

Triplet Loss

Learns embeddings by enforcing that an anchor is closer to a positive (same class) than to a negative (different class) by at least a fixed margin. The foundational loss for metric learning and face recognition (FaceNet). Operates on distances in embedding space, not class scores.

Triplet loss works on trios: an anchor, a positive, and a negative. It says: “the distance from anchor to positive should be smaller than the distance from anchor to negative, by at least margin mm.” If it already is, the loss is zero and no gradient flows. If not, the model is pushed to pull the positive closer and push the negative further away.

The critical challenge is triplet selection (mining). Most triplets in a large dataset are “easy” — the negative is already far away, contributing zero loss and zero learning. Training only makes progress on “hard” or “semi-hard” triplets. Hard negatives are closer to the anchor than the positive (the model is wrong). Semi-hard negatives are farther than the positive but within the margin (the model is right but not confident enough). FaceNet showed that using semi-hard triplets produces the most stable training; hard triplets alone can cause collapse.

This is the fundamental limitation of triplet loss compared to InfoNCE: each update uses only one negative. InfoNCE uses all other samples in the batch as negatives simultaneously, extracting much more signal per forward pass. Triplet loss requires careful mining to compensate.

Standard form (using Euclidean distance):

L=max(0,  f(a)f(p)2f(a)f(n)2+m)\mathcal{L} = \max\bigl(0,\; \|f(a) - f(p)\|^2 - \|f(a) - f(n)\|^2 + m\bigr)

where f()f(\cdot) is the embedding function, aa is the anchor, pp is the positive, nn is the negative, and mm is the margin (typically 0.2-1.0).

Equivalently: L=max(0,  dapdan+m)\mathcal{L} = \max(0,\; d_{ap} - d_{an} + m) where dd denotes squared distance.

With cosine similarity (common for normalised embeddings):

L=max(0,  sim(a,n)sim(a,p)+m)\mathcal{L} = \max\bigl(0,\; \text{sim}(a, n) - \text{sim}(a, p) + m\bigr)

Note the sign flip: we want similarity to be higher for the positive.

Triplet categories (for mining):

  • Easy: dan>dap+md_{an} > d_{ap} + m — loss is zero, no gradient
  • Semi-hard: dap<dan<dap+md_{ap} < d_{an} < d_{ap} + m — correct but within the margin
  • Hard: dan<dapd_{an} < d_{ap} — the model is wrong, negative is closer than positive
import torch
import torch.nn.functional as F
# ── Standard triplet margin loss ─────────────────────────────────
anchor_emb = model(anchors) # (B, D)
pos_emb = model(positives) # (B, D)
neg_emb = model(negatives) # (B, D)
loss = F.triplet_margin_loss(
anchor_emb, pos_emb, neg_emb,
margin=1.0, p=2 # p=2 → Euclidean distance
)
# ── With cosine distance (normalised embeddings) ─────────────────
loss = F.triplet_margin_with_distance_loss(
anchor_emb, pos_emb, neg_emb,
distance_function=lambda a, b: 1 - F.cosine_similarity(a, b),
margin=0.3
)
# ── Online semi-hard mining within a batch ───────────────────────
# Compute all pairwise distances, then select valid triplets.
# This is the standard approach — mine from the batch, don't
# pre-construct triplets.
emb = model(images) # (B, D)
dist = torch.cdist(emb, emb, p=2) # (B, B)
# Then select (anchor, positive, negative) indices where:
# same_label[a, p] and not same_label[a, n] and
# dist[a,p] < dist[a,n] < dist[a,p] + margin
import numpy as np
def triplet_loss(anchor, positive, negative, margin=1.0):
"""
Triplet margin loss with Euclidean distance.
Equivalent to F.triplet_margin_loss.
anchor: (B, D) embeddings
positive: (B, D) embeddings (same class as anchor)
negative: (B, D) embeddings (different class from anchor)
"""
d_ap = np.sum((anchor - positive) ** 2, axis=1) # (B,) squared dist
d_an = np.sum((anchor - negative) ** 2, axis=1) # (B,)
losses = np.maximum(0, d_ap - d_an + margin) # (B,)
return losses.mean()
def mine_semi_hard(embeddings, labels, margin=1.0):
"""
Find semi-hard triplets within a batch.
embeddings: (B, D)
labels: (B,) integer class labels
Returns: indices (anchors, positives, negatives)
"""
B = embeddings.shape[0]
dist = np.sum((embeddings[:, None] - embeddings[None, :]) ** 2, axis=2) # (B, B)
triplets = []
for i in range(B):
pos_mask = labels == labels[i]
neg_mask = labels != labels[i]
pos_mask[i] = False # exclude self
for p in np.where(pos_mask)[0]:
d_ap = dist[i, p]
# Semi-hard: d_ap < d_an < d_ap + margin
valid = neg_mask & (dist[i] > d_ap) & (dist[i] < d_ap + margin)
if valid.any():
n = np.where(valid)[0][0] # take first valid
triplets.append((i, p, n))
return triplets
  • Face recognition (FaceNet): the paper that popularised triplet loss for deep learning. Learns face embeddings where same-person faces cluster together
  • Person re-identification: matching pedestrians across camera views in surveillance
  • Image retrieval (Google Landmark): find visually similar images by embedding distance
  • Few-shot learning (Siamese networks): learn a similarity function from very few examples per class
  • Speaker verification: determine if two audio clips are from the same speaker
AlternativeWhen to useTradeoff
InfoNCELarge batches available, many negatives per anchorMuch more efficient — uses all batch negatives simultaneously. The modern default for contrastive learning
Contrastive loss (pairwise)Only have pairs, not tripletsSimpler — just same/different pairs with a margin. Less expressive than triplets
ArcFace / CosFaceClosed-set recognition (known classes at train time)Adds angular margin to classification loss. Better than triplet for closed-set; doesn’t generalise to open-set as naturally
Proxy-NCA / ProxyAnchorWant efficiency without miningLearns proxy embeddings per class, compares to proxies instead of mining triplets. Much faster
SupConHave labels, want contrastive learningSupervised contrastive loss leverages all positives and negatives per class. Strictly better than triplet when labels are available

Triplet loss was introduced for metric learning in the early 2000s (Weinberger & Saul, 2006, “Distance Metric Learning for Large Margin Nearest Neighbor Classification”) but became famous through FaceNet (Schroff et al., 2015), which used it to learn face embeddings achieving near-human verification accuracy. The key practical innovation in FaceNet was online semi-hard negative mining within large batches, which made triplet loss trainable at scale.

The approach has been largely superseded by InfoNCE-based methods (SimCLR, CLIP) for self-supervised learning and by angular margin losses (ArcFace, CosFace) for face recognition. The fundamental issue is efficiency: each triplet provides one bit of information (is the negative farther than the positive?), while InfoNCE provides log2(N)\log_2(N) bits per anchor by classifying among N candidates. Triplet loss remains relevant in settings where the data naturally comes as triplets or where fine-grained ranking within a margin matters more than coarse discrimination.