Cosine Similarity
Cosine Similarity
Section titled “Cosine Similarity”Measures directional alignment between two vectors, ignoring their magnitudes. Two vectors pointing the same way have cosine similarity 1, orthogonal vectors have 0, and opposite vectors have -1. This is THE distance metric for contrastive learning and embedding-based retrieval.
Intuition
Section titled “Intuition”Imagine two arrows pinned at the origin. Cosine similarity measures the angle between them — not how long they are. A short arrow and a long arrow pointing in the same direction are “maximally similar.” This magnitude-invariance is crucial for learned embeddings, where the norm of a representation can vary for uninteresting reasons (layer scale, batch statistics) but the direction encodes semantic meaning.
In contrastive learning (SimCLR, CLIP, MoCo), the model learns to push representations of similar things into the same direction and dissimilar things into different directions. Cosine similarity is the natural metric because L2-normalising the embeddings onto a unit hypersphere forces the model to use direction — not magnitude — to encode information. This prevents a collapse mode where the model makes all embeddings large in the same direction.
The temperature parameter controls how sharp the similarity distribution is. With low , small differences in cosine similarity become large differences in the softmax — the model must be very precise. With high , the distribution is smoother and more forgiving. Typical values: (SimCLR), (CLIP, learnable).
Cosine similarity:
With L2-normalised vectors ():
Normalise once, then all similarities are just dot products — this is why normalisation is the standard approach in practice.
Temperature-scaled (InfoNCE / contrastive loss):
The temperature scales the logits before softmax. Lower sharpens the distribution.
Cosine distance (for use as a loss):
import torchimport torch.nn.functional as F
# ── Built-in cosine similarity ───────────────────────────────────a = torch.randn(B, D) # (B, D)b = torch.randn(B, D) # (B, D)sim = F.cosine_similarity(a, b, dim=1) # (B,) — per-pair similarity
# ── Pairwise similarity matrix (contrastive learning pattern) ────z = F.normalize(features, dim=1) # (B, D) — L2-normalisesim_matrix = z @ z.T # (B, B) — all-pairs cosine sim# sim_matrix[i, j] = cosine similarity between sample i and sample j
# ── Temperature-scaled (InfoNCE) ─────────────────────────────────tau = 0.07logits = sim_matrix / tau # (B, B) — scaled for softmax
# ── Cosine similarity loss (push two embeddings together) ────────loss = 1 - F.cosine_similarity(anchor, positive, dim=1).mean()
# ── CLIP-style contrastive loss ──────────────────────────────────image_emb = F.normalize(image_encoder(images), dim=1) # (B, D)text_emb = F.normalize(text_encoder(texts), dim=1) # (B, D)logits = image_emb @ text_emb.T / tau # (B, B)labels = torch.arange(B, device=logits.device) # (B,) — diagonal is positiveloss = (F.cross_entropy(logits, labels) # image→text + F.cross_entropy(logits.T, labels)) / 2 # text→imageWarning: Always normalise before computing the similarity matrix. Without normalisation, embeddings with large norms dominate and the similarity is no longer purely directional.
Manual Implementation
Section titled “Manual Implementation”import numpy as np
def cosine_similarity(a, b): """ Pairwise cosine similarity between two sets of vectors. a: (N, D), b: (M, D) Returns: (N, M) similarity matrix """ a_norm = a / np.linalg.norm(a, axis=1, keepdims=True) # (N, D) b_norm = b / np.linalg.norm(b, axis=1, keepdims=True) # (M, D) return a_norm @ b_norm.T # (N, M)
def infonce_loss(features, tau=0.07): """ InfoNCE contrastive loss. Assumes features[2i] and features[2i+1] are positive pairs (e.g., two augmented views of the same image). features: (2B, D) """ B2, D = features.shape z = features / np.linalg.norm(features, axis=1, keepdims=True) # (2B, D) sim = z @ z.T / tau # (2B, 2B) # Mask out self-similarity (diagonal) np.fill_diagonal(sim, -1e9) # For each sample i, its positive is i+1 if even, i-1 if odd labels = np.array([i + 1 if i % 2 == 0 else i - 1 for i in range(B2)]) # Softmax cross-entropy exp_sim = np.exp(sim - sim.max(axis=1, keepdims=True)) # (2B, 2B) log_probs = np.log(exp_sim / exp_sim.sum(axis=1, keepdims=True)) # (2B, 2B) return -log_probs[np.arange(B2), labels].mean()Popular Uses
Section titled “Popular Uses”- SimCLR: cosine similarity between augmented views, with temperature , inside InfoNCE loss
- CLIP: cosine similarity between image and text embeddings; the training objective is a symmetric cross-entropy on the similarity matrix
- MoCo: cosine similarity between query and key embeddings from online and momentum encoders
- Sentence embeddings (Sentence-BERT): cosine similarity for semantic text search and retrieval
- Nearest-neighbour search (FAISS, vector databases): L2-normalised embeddings + dot product = cosine similarity at scale
- Recommendation systems: user and item embeddings compared via cosine similarity
Alternatives
Section titled “Alternatives”| Alternative | When to use | Tradeoff |
|---|---|---|
| Euclidean (L2) distance | When magnitude matters (pixel space, diffusion) | Sensitive to scale; doesn’t work well for high-dimensional embeddings |
| Dot product | Already-normalised embeddings, attention mechanisms | Equivalent to cosine sim after normalisation; without it, biased toward large norms |
| Mahalanobis distance | Correlated features, anomaly detection | Accounts for covariance structure; requires estimating covariance matrix |
| Manhattan (L1) distance | Sparse features, robust to outliers | Less sensitive to large differences in single dimensions |
| Learned distance (Siamese networks) | When the right metric is unknown | More flexible but requires training; cosine sim is a strong default |
| Negative L2 on normalised vectors | Mathematical equivalence | ; same ranking as cosine sim |
Historical Context
Section titled “Historical Context”Cosine similarity originated in information retrieval (Salton & McGill, 1983) for comparing TF-IDF document vectors, where magnitude (document length) was a nuisance factor. It became the default metric for word embeddings (Word2Vec, Mikolov et al., 2013) and was adopted wholesale by the contrastive learning revolution.
The combination of L2-normalisation + temperature scaling was formalised by SimCLR (Chen et al., 2020), which showed that both components are critical: normalisation prevents collapse and temperature controls the hardness of the contrastive task. CLIP (Radford et al., 2021) made the temperature learnable, finding that the optimal value varies during training.