Unified Contrastive / Self-Supervised Learning

Introduction

This one has a slightly different shape than the sibling examples because contrastive learning’s core tension — preventing collapse — manifests differently across variants rather than just swapping a loss or target calculation.

The key narrative:

SimCLR is the purest expression of the idea. Two augmentations, pull together, push everything else apart. The catch is you need a huge batch to get enough negatives — 4096 is standard, which means multi-GPU setups.

MoCo solves that with an elegant trick: maintain a FIFO queue of 65K negatives from recent batches. But stale negatives (encoded by an old network) would be inconsistent, so the negative encoder is a slowly-updating EMA copy. Now batch size and number of negatives are decoupled.

CLIP takes the same InfoNCE structure but applies it across modalities — images and text. The loss is symmetric (image→text and text→image) and the temperature is learned rather than fixed. This is what enabled zero-shot image classification, text-based image search, and the entire text-to-image pipeline (CLIP feeds into Stable Diffusion’s conditioning).

BYOL is the most surprising — it works with no negatives at all. The collapse section at the bottom explains why this doesn’t immediately degenerate. The asymmetry between the online branch (has a predictor) and the target branch (no predictor, momentum-updated) creates a bootstrapping dynamic where collapse is an unstable equilibrium rather than an attractor.

SupCon bridges self-supervised and supervised: same contrastive machinery, but labels define the positive set. Instead of just “two augmentations of image_i are positive,” it’s “all images of class c are positive for each other.” This learns much more structured embedding spaces than cross-entropy.

The collapse problem section at the bottom is arguably the most important part — it’s the single unifying concern that explains why each variant makes the choices it does.

Summary: What changes vs. what stays the same

Always the same (core loop)

Create two views of each sample
Encode both views → embeddings on the unit sphere
Compute contrastive loss (PLUGGABLE)
Gradient step
Post-step hooks (momentum, queue, …) (PLUGGABLE)

What varies by variant

Variant	View source	Negatives	Loss	Extra
SimCLR	Augmentation	In-batch	InfoNCE	—
MoCo	Augmentation	Queue (EMA)	InfoNCE	Momentum enc.
CLIP	Cross-modal	In-batch	Symmetric CE	Two encoders
BYOL	Augmentation	None	Cosine	Momentum enc. + predictor
SupCon	Augmentation	In-batch (diff class)	Supervised InfoNCE	Uses labels

Motives for each variant

Variant	Problem Solved	Intuition for Solution
SimCLR	Supervised learning needs expensive labels. Can we learn representations from unlabelled data?	Two augmentations of the same image are “positives” — pull them together. All other images are “negatives” — push apart. The encoder learns semantic features invariant to augmentation
MoCo	SimCLR needs enormous batch sizes (4096+) for enough negatives. This requires many GPUs and huge memory	Maintain a FIFO queue of negatives from recent batches, encoded by a momentum (EMA) copy of the encoder. Queue gives 65K negatives regardless of batch size. EMA keeps encodings consistent
CLIP	Image representations don’t understand language — you can’t search images by text or do zero-shot recognition	Contrastive learning across modalities: encode images and captions into a shared space where matching pairs are close. Symmetric loss ensures both directions align. Learned temperature adapts the sharpness of matching
BYOL	Contrastive methods need negatives, which means large batches or queues, and are sensitive to negative quality (false negatives hurt)	Use asymmetry instead: the online net has a predictor head, the target net doesn’t. The online net must PREDICT the target’s output — not just match a fixed point. Momentum update + stop-gradient prevent collapse without any negatives at all
SupCon	Standard cross-entropy uses labels only as one-hot targets — it doesn’t exploit within-class structure or learn a meaningful embedding space	Make all same-class samples into positives: the encoder pulls together the CLUSTER of each class, not just a single prototype. This learns a more structured embedding space and is more robust to noisy labels

The collapse problem (why contrastive learning is tricky)

The trivial solution to “pull positives together” is to map EVERYTHING to the same point. Every method must prevent this:

SimCLR, MoCo, SupCon: negatives push embeddings apart. Collapse means all negatives are identical → huge loss.
CLIP: negatives + symmetric loss. Collapse means the similarity matrix is all-ones → cross-entropy explodes.
BYOL: no negatives — relies on architectural asymmetry (predictor head + momentum + stop-gradient). The online net can’t “cheat” by collapsing because the target moves slowly and independently, and the predictor must actively transform, not just copy.

This is the central tension of contrastive learning: you need a mechanism to prevent collapse, and the choice of mechanism is what most distinguishes the variants.