Hinge Loss
Hinge Loss
Section titled “Hinge Loss”A margin-based loss that penalises predictions only when they fall within or on the wrong side of a decision margin. The classic SVM loss, now widely used in GAN discriminators (hinge GAN, SNGAN) and ranking tasks. Does not produce probability estimates — only enforces a margin.
Intuition
Section titled “Intuition”Cross-entropy says “put as much probability as possible on the correct class — always.” Hinge loss says “get the correct class score at least margin M above the others, and after that I don’t care.” Once a sample is correctly classified with sufficient confidence, its loss is exactly zero and it produces no gradient. The model stops spending capacity on easy examples and focuses entirely on the hard ones near the boundary.
This “stop once you’re confident enough” property has a practical consequence: hinge loss produces sparser gradients. In a well-trained model, most samples contribute zero loss. This makes hinge loss appealing for GAN discriminators — you don’t want the discriminator to keep pushing its outputs to infinity on easy real/fake samples, which can cause vanishing gradients for the generator. The hinge formulation naturally saturates, giving the generator useful gradients throughout training.
The margin is typically set to 1. Scaling it is equivalent to scaling the regularisation, so there is no reason to tune it separately.
Binary hinge loss (SVM-style, labels ):
where is the raw score (not a probability). If , loss is zero.
Multi-class hinge loss (Weston-Watkins formulation):
Penalises every incorrect class whose score is within margin 1 of the correct class score.
GAN hinge loss (Miyato et al., SNGAN):
For the discriminator with real samples and fake samples :
For the generator:
The discriminator only trains on samples it isn’t already classifying correctly with margin 1.
import torchimport torch.nn.functional as F
# ── Multi-class hinge (SVM-style) ────────────────────────────────# PyTorch has no built-in multi-class hinge. Use multi_margin_loss:logits = model(x) # (B, C)targets = labels # (B,) integer indicesloss = F.multi_margin_loss(logits, targets, margin=1) # scalar
# ── GAN hinge loss (discriminator) ───────────────────────────────d_real = discriminator(real_images) # (B,) or (B, 1)d_fake = discriminator(fake_images.detach()) # (B,) or (B, 1)loss_d = (F.relu(1 - d_real) + F.relu(1 + d_fake)).mean()
# ── GAN hinge loss (generator) ──────────────────────────────────d_fake = discriminator(generator(z)) # (B,)loss_g = -d_fake.mean()# Note: the generator loss is just "make the discriminator output# large values for fakes." No hinge here — only the discriminator# uses the margin.Manual Implementation
Section titled “Manual Implementation”import numpy as np
def multiclass_hinge_loss(scores, targets): """ Multi-class hinge loss (Weston-Watkins). Equivalent to F.multi_margin_loss with margin=1. scores: (B, C) raw class scores targets: (B,) integer class indices """ B, C = scores.shape correct_scores = scores[np.arange(B), targets] # (B,) margins = scores - correct_scores[:, None] + 1.0 # (B, C) margins[np.arange(B), targets] = 0.0 # don't count correct class loss_per_sample = np.maximum(0, margins).sum(axis=1) # (B,) return loss_per_sample.mean()
def gan_hinge_loss_d(d_real, d_fake): """ Hinge loss for GAN discriminator. d_real: (B,) discriminator output on real samples d_fake: (B,) discriminator output on fake samples """ return (np.maximum(0, 1 - d_real) + np.maximum(0, 1 + d_fake)).mean()Popular Uses
Section titled “Popular Uses”- GAN discriminators (see
gans/): SNGAN and BigGAN use hinge loss — it stabilises training by preventing the discriminator from “over-winning” - SVMs: the original application. Still used in final layers of some classical ML pipelines
- Ranking / retrieval: hinge loss enforces that relevant items score above irrelevant ones by a margin
- Object detection (older architectures like R-CNN): multi-class hinge was used before cross-entropy became standard
- Structured prediction: max-margin approaches in NLP (structured SVMs) use hinge-style losses on structured outputs
Alternatives
Section titled “Alternatives”| Alternative | When to use | Tradeoff |
|---|---|---|
| Cross-entropy | Need probability estimates, standard classification | Smooth everywhere, always produces gradients; no margin property |
| WGAN / Wasserstein loss | GAN training with gradient penalty | Requires enforcing Lipschitz constraint (GP or spectral norm); theoretically cleaner gradients |
| Binary cross-entropy | GAN training (vanilla GAN) | Produces probabilities but can saturate — vanishing generator gradients when D is confident |
| Focal loss | Classification with severe class imbalance | Down-weights easy examples like hinge, but smoothly rather than with a hard cutoff |
| Contrastive / triplet loss | Metric learning with embeddings | Also margin-based but operates on distances in embedding space, not class scores |
Historical Context
Section titled “Historical Context”Hinge loss originated in the support vector machine (SVM) framework, formalised by Vapnik and colleagues in the 1990s. The max-margin principle — finding the decision boundary with the largest gap between classes — was the dominant paradigm in machine learning before deep learning. SVMs with hinge loss were state-of-the-art on many benchmarks through the 2000s.
Hinge loss entered deep learning primarily through GANs. Lim and Ye (2017, “Geometric GAN”) provided theoretical motivation, and Miyato et al. (2018, “Spectral Normalisation for GANs”) demonstrated that combining hinge loss with spectral normalisation produced stable, high-quality GAN training. This combination became the default for BigGAN (Brock et al., 2019) and influenced most subsequent high-resolution GAN architectures.