Skip to content

Cosine Similarity

Measures directional alignment between two vectors, ignoring their magnitudes. Two vectors pointing the same way have cosine similarity 1, orthogonal vectors have 0, and opposite vectors have -1. This is THE distance metric for contrastive learning and embedding-based retrieval.

Imagine two arrows pinned at the origin. Cosine similarity measures the angle between them — not how long they are. A short arrow and a long arrow pointing in the same direction are “maximally similar.” This magnitude-invariance is crucial for learned embeddings, where the norm of a representation can vary for uninteresting reasons (layer scale, batch statistics) but the direction encodes semantic meaning.

In contrastive learning (SimCLR, CLIP, MoCo), the model learns to push representations of similar things into the same direction and dissimilar things into different directions. Cosine similarity is the natural metric because L2-normalising the embeddings onto a unit hypersphere forces the model to use direction — not magnitude — to encode information. This prevents a collapse mode where the model makes all embeddings large in the same direction.

The temperature parameter τ\tau controls how sharp the similarity distribution is. With low τ\tau, small differences in cosine similarity become large differences in the softmax — the model must be very precise. With high τ\tau, the distribution is smoother and more forgiving. Typical values: τ=0.07\tau = 0.07 (SimCLR), τ=0.07\tau = 0.07 (CLIP, learnable).

Cosine similarity:

cos(θ)=abab=iaibiiai2ibi2\cos(\theta) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|} = \frac{\sum_i a_i b_i}{\sqrt{\sum_i a_i^2} \cdot \sqrt{\sum_i b_i^2}}

With L2-normalised vectors (a^=a/a\hat{\mathbf{a}} = \mathbf{a}/\|\mathbf{a}\|):

cos(θ)=a^b^\cos(\theta) = \hat{\mathbf{a}} \cdot \hat{\mathbf{b}}

Normalise once, then all similarities are just dot products — this is why normalisation is the standard approach in practice.

Temperature-scaled (InfoNCE / contrastive loss):

i,j=logexp(sim(zi,zj)/τ)kiexp(sim(zi,zk)/τ)\ell_{i,j} = -\log \frac{\exp(\text{sim}(z_i, z_j) / \tau)}{\sum_{k \neq i} \exp(\text{sim}(z_i, z_k) / \tau)}

The temperature τ\tau scales the logits before softmax. Lower τ\tau sharpens the distribution.

Cosine distance (for use as a loss):

d=1cos(θ)[0,2]d = 1 - \cos(\theta) \in [0, 2]

import torch
import torch.nn.functional as F
# ── Built-in cosine similarity ───────────────────────────────────
a = torch.randn(B, D) # (B, D)
b = torch.randn(B, D) # (B, D)
sim = F.cosine_similarity(a, b, dim=1) # (B,) — per-pair similarity
# ── Pairwise similarity matrix (contrastive learning pattern) ────
z = F.normalize(features, dim=1) # (B, D) — L2-normalise
sim_matrix = z @ z.T # (B, B) — all-pairs cosine sim
# sim_matrix[i, j] = cosine similarity between sample i and sample j
# ── Temperature-scaled (InfoNCE) ─────────────────────────────────
tau = 0.07
logits = sim_matrix / tau # (B, B) — scaled for softmax
# ── Cosine similarity loss (push two embeddings together) ────────
loss = 1 - F.cosine_similarity(anchor, positive, dim=1).mean()
# ── CLIP-style contrastive loss ──────────────────────────────────
image_emb = F.normalize(image_encoder(images), dim=1) # (B, D)
text_emb = F.normalize(text_encoder(texts), dim=1) # (B, D)
logits = image_emb @ text_emb.T / tau # (B, B)
labels = torch.arange(B, device=logits.device) # (B,) — diagonal is positive
loss = (F.cross_entropy(logits, labels) # image→text
+ F.cross_entropy(logits.T, labels)) / 2 # text→image

Warning: Always normalise before computing the similarity matrix. Without normalisation, embeddings with large norms dominate and the similarity is no longer purely directional.

import numpy as np
def cosine_similarity(a, b):
"""
Pairwise cosine similarity between two sets of vectors.
a: (N, D), b: (M, D)
Returns: (N, M) similarity matrix
"""
a_norm = a / np.linalg.norm(a, axis=1, keepdims=True) # (N, D)
b_norm = b / np.linalg.norm(b, axis=1, keepdims=True) # (M, D)
return a_norm @ b_norm.T # (N, M)
def infonce_loss(features, tau=0.07):
"""
InfoNCE contrastive loss. Assumes features[2i] and features[2i+1]
are positive pairs (e.g., two augmented views of the same image).
features: (2B, D)
"""
B2, D = features.shape
z = features / np.linalg.norm(features, axis=1, keepdims=True) # (2B, D)
sim = z @ z.T / tau # (2B, 2B)
# Mask out self-similarity (diagonal)
np.fill_diagonal(sim, -1e9)
# For each sample i, its positive is i+1 if even, i-1 if odd
labels = np.array([i + 1 if i % 2 == 0 else i - 1 for i in range(B2)])
# Softmax cross-entropy
exp_sim = np.exp(sim - sim.max(axis=1, keepdims=True)) # (2B, 2B)
log_probs = np.log(exp_sim / exp_sim.sum(axis=1, keepdims=True)) # (2B, 2B)
return -log_probs[np.arange(B2), labels].mean()
  • SimCLR: cosine similarity between augmented views, with temperature τ=0.07\tau = 0.07, inside InfoNCE loss
  • CLIP: cosine similarity between image and text embeddings; the training objective is a symmetric cross-entropy on the similarity matrix
  • MoCo: cosine similarity between query and key embeddings from online and momentum encoders
  • Sentence embeddings (Sentence-BERT): cosine similarity for semantic text search and retrieval
  • Nearest-neighbour search (FAISS, vector databases): L2-normalised embeddings + dot product = cosine similarity at scale
  • Recommendation systems: user and item embeddings compared via cosine similarity
AlternativeWhen to useTradeoff
Euclidean (L2) distanceWhen magnitude matters (pixel space, diffusion)Sensitive to scale; doesn’t work well for high-dimensional embeddings
Dot productAlready-normalised embeddings, attention mechanismsEquivalent to cosine sim after normalisation; without it, biased toward large norms
Mahalanobis distanceCorrelated features, anomaly detectionAccounts for covariance structure; requires estimating covariance matrix
Manhattan (L1) distanceSparse features, robust to outliersLess sensitive to large differences in single dimensions
Learned distance (Siamese networks)When the right metric is unknownMore flexible but requires training; cosine sim is a strong default
Negative L2 on normalised vectorsMathematical equivalencea^b^2=2(cosθ1)-\|\hat{a} - \hat{b}\|^2 = 2(\cos\theta - 1); same ranking as cosine sim

Cosine similarity originated in information retrieval (Salton & McGill, 1983) for comparing TF-IDF document vectors, where magnitude (document length) was a nuisance factor. It became the default metric for word embeddings (Word2Vec, Mikolov et al., 2013) and was adopted wholesale by the contrastive learning revolution.

The combination of L2-normalisation + temperature scaling was formalised by SimCLR (Chen et al., 2020), which showed that both components are critical: normalisation prevents collapse and temperature controls the hardness of the contrastive task. CLIP (Radford et al., 2021) made the temperature learnable, finding that the optimal value varies during training.