Hinge Loss

A margin-based loss that penalises predictions only when they fall within or on the wrong side of a decision margin. The classic SVM loss, now widely used in GAN discriminators (hinge GAN, SNGAN) and ranking tasks. Does not produce probability estimates — only enforces a margin.

Intuition

Cross-entropy says “put as much probability as possible on the correct class — always.” Hinge loss says “get the correct class score at least margin M above the others, and after that I don’t care.” Once a sample is correctly classified with sufficient confidence, its loss is exactly zero and it produces no gradient. The model stops spending capacity on easy examples and focuses entirely on the hard ones near the boundary.

This “stop once you’re confident enough” property has a practical consequence: hinge loss produces sparser gradients. In a well-trained model, most samples contribute zero loss. This makes hinge loss appealing for GAN discriminators — you don’t want the discriminator to keep pushing its outputs to infinity on easy real/fake samples, which can cause vanishing gradients for the generator. The hinge formulation naturally saturates, giving the generator useful gradients throughout training.

The margin is typically set to 1. Scaling it is equivalent to scaling the regularisation, so there is no reason to tune it separately.

Math

Binary hinge loss (SVM-style, labels $y \in \{-1, +1\}$ ):

$\mathcal{L} = \max(0, 1 - y \cdot f(x))$

where $f(x)$ is the raw score (not a probability). If $y \cdot f(x) \geq 1$ , loss is zero.

Multi-class hinge loss (Weston-Watkins formulation):

$\mathcal{L} = \sum_{j \neq y} \max(0, f_j(x) - f_y(x) + 1)$

Penalises every incorrect class $j$ whose score is within margin 1 of the correct class score.

GAN hinge loss (Miyato et al., SNGAN):

For the discriminator $D$ with real samples $x_r$ and fake samples $x_f$ :

$\mathcal{L}_D = \mathbb{E}[\max(0, 1 - D(x_r))] + \mathbb{E}[\max(0, 1 + D(x_f))]$

For the generator:

$\mathcal{L}_G = -\mathbb{E}[D(G(z))]$

The discriminator only trains on samples it isn’t already classifying correctly with margin 1.

Code

import torch
import torch.nn.functional as F

# ── Multi-class hinge (SVM-style) ────────────────────────────────
# PyTorch has no built-in multi-class hinge. Use multi_margin_loss:
logits = model(x)                                     # (B, C)
targets = labels                                      # (B,) integer indices
loss = F.multi_margin_loss(logits, targets, margin=1) # scalar

# ── GAN hinge loss (discriminator) ───────────────────────────────
d_real = discriminator(real_images)                    # (B,) or (B, 1)
d_fake = discriminator(fake_images.detach())           # (B,) or (B, 1)
loss_d = (F.relu(1 - d_real) + F.relu(1 + d_fake)).mean()

# ── GAN hinge loss (generator) ──────────────────────────────────
d_fake = discriminator(generator(z))                   # (B,)
loss_g = -d_fake.mean()
# Note: the generator loss is just "make the discriminator output
# large values for fakes." No hinge here — only the discriminator
# uses the margin.

Manual Implementation

import numpy as np

def multiclass_hinge_loss(scores, targets):
    """
    Multi-class hinge loss (Weston-Watkins).
    Equivalent to F.multi_margin_loss with margin=1.
    scores:  (B, C) raw class scores
    targets: (B,) integer class indices
    """
    B, C = scores.shape
    correct_scores = scores[np.arange(B), targets]        # (B,)
    margins = scores - correct_scores[:, None] + 1.0      # (B, C)
    margins[np.arange(B), targets] = 0.0                  # don't count correct class
    loss_per_sample = np.maximum(0, margins).sum(axis=1)  # (B,)
    return loss_per_sample.mean()


def gan_hinge_loss_d(d_real, d_fake):
    """
    Hinge loss for GAN discriminator.
    d_real: (B,) discriminator output on real samples
    d_fake: (B,) discriminator output on fake samples
    """
    return (np.maximum(0, 1 - d_real) + np.maximum(0, 1 + d_fake)).mean()

Popular Uses

GAN discriminators (see gans/): SNGAN and BigGAN use hinge loss — it stabilises training by preventing the discriminator from “over-winning”
SVMs: the original application. Still used in final layers of some classical ML pipelines
Ranking / retrieval: hinge loss enforces that relevant items score above irrelevant ones by a margin
Object detection (older architectures like R-CNN): multi-class hinge was used before cross-entropy became standard
Structured prediction: max-margin approaches in NLP (structured SVMs) use hinge-style losses on structured outputs

Alternatives

Alternative	When to use	Tradeoff
Cross-entropy	Need probability estimates, standard classification	Smooth everywhere, always produces gradients; no margin property
WGAN / Wasserstein loss	GAN training with gradient penalty	Requires enforcing Lipschitz constraint (GP or spectral norm); theoretically cleaner gradients
Binary cross-entropy	GAN training (vanilla GAN)	Produces probabilities but can saturate — vanishing generator gradients when D is confident
Focal loss	Classification with severe class imbalance	Down-weights easy examples like hinge, but smoothly rather than with a hard cutoff
Contrastive / triplet loss	Metric learning with embeddings	Also margin-based but operates on distances in embedding space, not class scores

Historical Context

Hinge loss originated in the support vector machine (SVM) framework, formalised by Vapnik and colleagues in the 1990s. The max-margin principle — finding the decision boundary with the largest gap between classes — was the dominant paradigm in machine learning before deep learning. SVMs with hinge loss were state-of-the-art on many benchmarks through the 2000s.

Hinge loss entered deep learning primarily through GANs. Lim and Ye (2017, “Geometric GAN”) provided theoretical motivation, and Miyato et al. (2018, “Spectral Normalisation for GANs”) demonstrated that combining hinge loss with spectral normalisation produced stable, high-quality GAN training. This combination became the default for BigGAN (Brock et al., 2019) and influenced most subsequent high-resolution GAN architectures.