Skip to content

Hinge Loss

A margin-based loss that penalises predictions only when they fall within or on the wrong side of a decision margin. The classic SVM loss, now widely used in GAN discriminators (hinge GAN, SNGAN) and ranking tasks. Does not produce probability estimates — only enforces a margin.

Cross-entropy says “put as much probability as possible on the correct class — always.” Hinge loss says “get the correct class score at least margin M above the others, and after that I don’t care.” Once a sample is correctly classified with sufficient confidence, its loss is exactly zero and it produces no gradient. The model stops spending capacity on easy examples and focuses entirely on the hard ones near the boundary.

This “stop once you’re confident enough” property has a practical consequence: hinge loss produces sparser gradients. In a well-trained model, most samples contribute zero loss. This makes hinge loss appealing for GAN discriminators — you don’t want the discriminator to keep pushing its outputs to infinity on easy real/fake samples, which can cause vanishing gradients for the generator. The hinge formulation naturally saturates, giving the generator useful gradients throughout training.

The margin is typically set to 1. Scaling it is equivalent to scaling the regularisation, so there is no reason to tune it separately.

Binary hinge loss (SVM-style, labels y{1,+1}y \in \{-1, +1\}):

L=max(0,1yf(x))\mathcal{L} = \max(0, 1 - y \cdot f(x))

where f(x)f(x) is the raw score (not a probability). If yf(x)1y \cdot f(x) \geq 1, loss is zero.

Multi-class hinge loss (Weston-Watkins formulation):

L=jymax(0,fj(x)fy(x)+1)\mathcal{L} = \sum_{j \neq y} \max(0, f_j(x) - f_y(x) + 1)

Penalises every incorrect class jj whose score is within margin 1 of the correct class score.

GAN hinge loss (Miyato et al., SNGAN):

For the discriminator DD with real samples xrx_r and fake samples xfx_f:

LD=E[max(0,1D(xr))]+E[max(0,1+D(xf))]\mathcal{L}_D = \mathbb{E}[\max(0, 1 - D(x_r))] + \mathbb{E}[\max(0, 1 + D(x_f))]

For the generator:

LG=E[D(G(z))]\mathcal{L}_G = -\mathbb{E}[D(G(z))]

The discriminator only trains on samples it isn’t already classifying correctly with margin 1.

import torch
import torch.nn.functional as F
# ── Multi-class hinge (SVM-style) ────────────────────────────────
# PyTorch has no built-in multi-class hinge. Use multi_margin_loss:
logits = model(x) # (B, C)
targets = labels # (B,) integer indices
loss = F.multi_margin_loss(logits, targets, margin=1) # scalar
# ── GAN hinge loss (discriminator) ───────────────────────────────
d_real = discriminator(real_images) # (B,) or (B, 1)
d_fake = discriminator(fake_images.detach()) # (B,) or (B, 1)
loss_d = (F.relu(1 - d_real) + F.relu(1 + d_fake)).mean()
# ── GAN hinge loss (generator) ──────────────────────────────────
d_fake = discriminator(generator(z)) # (B,)
loss_g = -d_fake.mean()
# Note: the generator loss is just "make the discriminator output
# large values for fakes." No hinge here — only the discriminator
# uses the margin.
import numpy as np
def multiclass_hinge_loss(scores, targets):
"""
Multi-class hinge loss (Weston-Watkins).
Equivalent to F.multi_margin_loss with margin=1.
scores: (B, C) raw class scores
targets: (B,) integer class indices
"""
B, C = scores.shape
correct_scores = scores[np.arange(B), targets] # (B,)
margins = scores - correct_scores[:, None] + 1.0 # (B, C)
margins[np.arange(B), targets] = 0.0 # don't count correct class
loss_per_sample = np.maximum(0, margins).sum(axis=1) # (B,)
return loss_per_sample.mean()
def gan_hinge_loss_d(d_real, d_fake):
"""
Hinge loss for GAN discriminator.
d_real: (B,) discriminator output on real samples
d_fake: (B,) discriminator output on fake samples
"""
return (np.maximum(0, 1 - d_real) + np.maximum(0, 1 + d_fake)).mean()
  • GAN discriminators (see gans/): SNGAN and BigGAN use hinge loss — it stabilises training by preventing the discriminator from “over-winning”
  • SVMs: the original application. Still used in final layers of some classical ML pipelines
  • Ranking / retrieval: hinge loss enforces that relevant items score above irrelevant ones by a margin
  • Object detection (older architectures like R-CNN): multi-class hinge was used before cross-entropy became standard
  • Structured prediction: max-margin approaches in NLP (structured SVMs) use hinge-style losses on structured outputs
AlternativeWhen to useTradeoff
Cross-entropyNeed probability estimates, standard classificationSmooth everywhere, always produces gradients; no margin property
WGAN / Wasserstein lossGAN training with gradient penaltyRequires enforcing Lipschitz constraint (GP or spectral norm); theoretically cleaner gradients
Binary cross-entropyGAN training (vanilla GAN)Produces probabilities but can saturate — vanishing generator gradients when D is confident
Focal lossClassification with severe class imbalanceDown-weights easy examples like hinge, but smoothly rather than with a hard cutoff
Contrastive / triplet lossMetric learning with embeddingsAlso margin-based but operates on distances in embedding space, not class scores

Hinge loss originated in the support vector machine (SVM) framework, formalised by Vapnik and colleagues in the 1990s. The max-margin principle — finding the decision boundary with the largest gap between classes — was the dominant paradigm in machine learning before deep learning. SVMs with hinge loss were state-of-the-art on many benchmarks through the 2000s.

Hinge loss entered deep learning primarily through GANs. Lim and Ye (2017, “Geometric GAN”) provided theoretical motivation, and Miyato et al. (2018, “Spectral Normalisation for GANs”) demonstrated that combining hinge loss with spectral normalisation produced stable, high-quality GAN training. This combination became the default for BigGAN (Brock et al., 2019) and influenced most subsequent high-resolution GAN architectures.