Skip to content

Unified Contrastive / Self-Supervised Learning

Unified Contrastive / Self-Supervised Learning

Section titled “Unified Contrastive / Self-Supervised Learning”

This one has a slightly different shape than the sibling examples because contrastive learning’s core tension — preventing collapse — manifests differently across variants rather than just swapping a loss or target calculation.

The key narrative:

SimCLR is the purest expression of the idea. Two augmentations, pull together, push everything else apart. The catch is you need a huge batch to get enough negatives — 4096 is standard, which means multi-GPU setups.

MoCo solves that with an elegant trick: maintain a FIFO queue of 65K negatives from recent batches. But stale negatives (encoded by an old network) would be inconsistent, so the negative encoder is a slowly-updating EMA copy. Now batch size and number of negatives are decoupled.

CLIP takes the same InfoNCE structure but applies it across modalities — images and text. The loss is symmetric (image→text and text→image) and the temperature is learned rather than fixed. This is what enabled zero-shot image classification, text-based image search, and the entire text-to-image pipeline (CLIP feeds into Stable Diffusion’s conditioning).

BYOL is the most surprising — it works with no negatives at all. The collapse section at the bottom explains why this doesn’t immediately degenerate. The asymmetry between the online branch (has a predictor) and the target branch (no predictor, momentum-updated) creates a bootstrapping dynamic where collapse is an unstable equilibrium rather than an attractor.

SupCon bridges self-supervised and supervised: same contrastive machinery, but labels define the positive set. Instead of just “two augmentations of image_i are positive,” it’s “all images of class c are positive for each other.” This learns much more structured embedding spaces than cross-entropy.

The collapse problem section at the bottom is arguably the most important part — it’s the single unifying concern that explains why each variant makes the choices it does.

Summary: What changes vs. what stays the same

Section titled “Summary: What changes vs. what stays the same”
  • Create two views of each sample
  • Encode both views → embeddings on the unit sphere
  • Compute contrastive loss (PLUGGABLE)
  • Gradient step
  • Post-step hooks (momentum, queue, …) (PLUGGABLE)
VariantView sourceNegativesLossExtra
SimCLRAugmentationIn-batchInfoNCE
MoCoAugmentationQueue (EMA)InfoNCEMomentum enc.
CLIPCross-modalIn-batchSymmetric CETwo encoders
BYOLAugmentationNoneCosineMomentum enc. + predictor
SupConAugmentationIn-batch (diff class)Supervised InfoNCEUses labels
VariantProblem SolvedIntuition for Solution
SimCLRSupervised learning needs expensive labels. Can we learn representations from unlabelled data?Two augmentations of the same image are “positives” — pull them together. All other images are “negatives” — push apart. The encoder learns semantic features invariant to augmentation
MoCoSimCLR needs enormous batch sizes (4096+) for enough negatives. This requires many GPUs and huge memoryMaintain a FIFO queue of negatives from recent batches, encoded by a momentum (EMA) copy of the encoder. Queue gives 65K negatives regardless of batch size. EMA keeps encodings consistent
CLIPImage representations don’t understand language — you can’t search images by text or do zero-shot recognitionContrastive learning across modalities: encode images and captions into a shared space where matching pairs are close. Symmetric loss ensures both directions align. Learned temperature adapts the sharpness of matching
BYOLContrastive methods need negatives, which means large batches or queues, and are sensitive to negative quality (false negatives hurt)Use asymmetry instead: the online net has a predictor head, the target net doesn’t. The online net must PREDICT the target’s output — not just match a fixed point. Momentum update + stop-gradient prevent collapse without any negatives at all
SupConStandard cross-entropy uses labels only as one-hot targets — it doesn’t exploit within-class structure or learn a meaningful embedding spaceMake all same-class samples into positives: the encoder pulls together the CLUSTER of each class, not just a single prototype. This learns a more structured embedding space and is more robust to noisy labels

The collapse problem (why contrastive learning is tricky)

Section titled “The collapse problem (why contrastive learning is tricky)”

The trivial solution to “pull positives together” is to map EVERYTHING to the same point. Every method must prevent this:

  • SimCLR, MoCo, SupCon: negatives push embeddings apart. Collapse means all negatives are identical → huge loss.
  • CLIP: negatives + symmetric loss. Collapse means the similarity matrix is all-ones → cross-entropy explodes.
  • BYOL: no negatives — relies on architectural asymmetry (predictor head + momentum + stop-gradient). The online net can’t “cheat” by collapsing because the target moves slowly and independently, and the predictor must actively transform, not just copy.

This is the central tension of contrastive learning: you need a mechanism to prevent collapse, and the choice of mechanism is what most distinguishes the variants.