Cosine Similarity

Measures directional alignment between two vectors, ignoring their magnitudes. Two vectors pointing the same way have cosine similarity 1, orthogonal vectors have 0, and opposite vectors have -1. This is THE distance metric for contrastive learning and embedding-based retrieval.

Intuition

Imagine two arrows pinned at the origin. Cosine similarity measures the angle between them — not how long they are. A short arrow and a long arrow pointing in the same direction are “maximally similar.” This magnitude-invariance is crucial for learned embeddings, where the norm of a representation can vary for uninteresting reasons (layer scale, batch statistics) but the direction encodes semantic meaning.

In contrastive learning (SimCLR, CLIP, MoCo), the model learns to push representations of similar things into the same direction and dissimilar things into different directions. Cosine similarity is the natural metric because L2-normalising the embeddings onto a unit hypersphere forces the model to use direction — not magnitude — to encode information. This prevents a collapse mode where the model makes all embeddings large in the same direction.

The temperature parameter $\tau$ controls how sharp the similarity distribution is. With low $\tau$ , small differences in cosine similarity become large differences in the softmax — the model must be very precise. With high $\tau$ , the distribution is smoother and more forgiving. Typical values: $\tau = 0.07$ (SimCLR), $\tau = 0.07$ (CLIP, learnable).

Math

Cosine similarity:

$\cos(\theta) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|} = \frac{\sum_i a_i b_i}{\sqrt{\sum_i a_i^2} \cdot \sqrt{\sum_i b_i^2}}$

With L2-normalised vectors ( $\hat{\mathbf{a}} = \mathbf{a}/\|\mathbf{a}\|$ ):

$\cos(\theta) = \hat{\mathbf{a}} \cdot \hat{\mathbf{b}}$

Normalise once, then all similarities are just dot products — this is why normalisation is the standard approach in practice.

Temperature-scaled (InfoNCE / contrastive loss):

$\ell_{i,j} = -\log \frac{\exp(\text{sim}(z_i, z_j) / \tau)}{\sum_{k \neq i} \exp(\text{sim}(z_i, z_k) / \tau)}$

The temperature $\tau$ scales the logits before softmax. Lower $\tau$ sharpens the distribution.

Cosine distance (for use as a loss):

$d = 1 - \cos(\theta) \in [0, 2]$

Code

import torch
import torch.nn.functional as F

# ── Built-in cosine similarity ───────────────────────────────────
a = torch.randn(B, D)                                # (B, D)
b = torch.randn(B, D)                                # (B, D)
sim = F.cosine_similarity(a, b, dim=1)               # (B,) — per-pair similarity

# ── Pairwise similarity matrix (contrastive learning pattern) ────
z = F.normalize(features, dim=1)                      # (B, D) — L2-normalise
sim_matrix = z @ z.T                                  # (B, B) — all-pairs cosine sim
# sim_matrix[i, j] = cosine similarity between sample i and sample j

# ── Temperature-scaled (InfoNCE) ─────────────────────────────────
tau = 0.07
logits = sim_matrix / tau                             # (B, B) — scaled for softmax

# ── Cosine similarity loss (push two embeddings together) ────────
loss = 1 - F.cosine_similarity(anchor, positive, dim=1).mean()

# ── CLIP-style contrastive loss ──────────────────────────────────
image_emb = F.normalize(image_encoder(images), dim=1)   # (B, D)
text_emb = F.normalize(text_encoder(texts), dim=1)      # (B, D)
logits = image_emb @ text_emb.T / tau                    # (B, B)
labels = torch.arange(B, device=logits.device)           # (B,) — diagonal is positive
loss = (F.cross_entropy(logits, labels)                  # image→text
      + F.cross_entropy(logits.T, labels)) / 2           # text→image

Warning: Always normalise before computing the similarity matrix. Without normalisation, embeddings with large norms dominate and the similarity is no longer purely directional.

Manual Implementation

import numpy as np

def cosine_similarity(a, b):
    """
    Pairwise cosine similarity between two sets of vectors.
    a: (N, D), b: (M, D)
    Returns: (N, M) similarity matrix
    """
    a_norm = a / np.linalg.norm(a, axis=1, keepdims=True)    # (N, D)
    b_norm = b / np.linalg.norm(b, axis=1, keepdims=True)    # (M, D)
    return a_norm @ b_norm.T                                   # (N, M)

def infonce_loss(features, tau=0.07):
    """
    InfoNCE contrastive loss. Assumes features[2i] and features[2i+1]
    are positive pairs (e.g., two augmented views of the same image).
    features: (2B, D)
    """
    B2, D = features.shape
    z = features / np.linalg.norm(features, axis=1, keepdims=True)  # (2B, D)
    sim = z @ z.T / tau                                              # (2B, 2B)
    # Mask out self-similarity (diagonal)
    np.fill_diagonal(sim, -1e9)
    # For each sample i, its positive is i+1 if even, i-1 if odd
    labels = np.array([i + 1 if i % 2 == 0 else i - 1 for i in range(B2)])
    # Softmax cross-entropy
    exp_sim = np.exp(sim - sim.max(axis=1, keepdims=True))           # (2B, 2B)
    log_probs = np.log(exp_sim / exp_sim.sum(axis=1, keepdims=True)) # (2B, 2B)
    return -log_probs[np.arange(B2), labels].mean()

Popular Uses

SimCLR: cosine similarity between augmented views, with temperature $\tau = 0.07$ , inside InfoNCE loss
CLIP: cosine similarity between image and text embeddings; the training objective is a symmetric cross-entropy on the similarity matrix
MoCo: cosine similarity between query and key embeddings from online and momentum encoders
Sentence embeddings (Sentence-BERT): cosine similarity for semantic text search and retrieval
Nearest-neighbour search (FAISS, vector databases): L2-normalised embeddings + dot product = cosine similarity at scale
Recommendation systems: user and item embeddings compared via cosine similarity

Alternatives

Alternative	When to use	Tradeoff
Euclidean (L2) distance	When magnitude matters (pixel space, diffusion)	Sensitive to scale; doesn’t work well for high-dimensional embeddings
Dot product	Already-normalised embeddings, attention mechanisms	Equivalent to cosine sim after normalisation; without it, biased toward large norms
Mahalanobis distance	Correlated features, anomaly detection	Accounts for covariance structure; requires estimating covariance matrix
Manhattan (L1) distance	Sparse features, robust to outliers	Less sensitive to large differences in single dimensions
Learned distance (Siamese networks)	When the right metric is unknown	More flexible but requires training; cosine sim is a strong default
Negative L2 on normalised vectors	Mathematical equivalence	$-\\|\hat{a} - \hat{b}\\|^2 = 2(\cos\theta - 1)$ ; same ranking as cosine sim

Historical Context

Cosine similarity originated in information retrieval (Salton & McGill, 1983) for comparing TF-IDF document vectors, where magnitude (document length) was a nuisance factor. It became the default metric for word embeddings (Word2Vec, Mikolov et al., 2013) and was adopted wholesale by the contrastive learning revolution.

The combination of L2-normalisation + temperature scaling was formalised by SimCLR (Chen et al., 2020), which showed that both components are critical: normalisation prevents collapse and temperature controls the hardness of the contrastive task. CLIP (Radford et al., 2021) made the temperature learnable, finding that the optimal value varies during training.