Skip to content

MSE / Huber Loss

Mean squared error (MSE) — also called L2 loss — and its robust cousin Huber loss (smooth L1). The standard losses for regression tasks — predicting continuous values like bounding box coordinates, noise in diffusion models, or Q-values in reinforcement learning. (Not to be confused with L2 regularisation / weight decay, which penalises weight magnitudes — see regularisation/weight-decay.md.)

MSE asks: “how far off were you, squared?” Squaring has two effects. First, it penalises large errors much more than small ones — an error of 10 costs 100x more than an error of 1. This makes MSE aggressively chase outliers. Second, it gives a smooth, everywhere-differentiable loss surface with a clean gradient: just the error itself (times 2).

The problem is that squaring cuts both ways. If your data has outliers or noisy targets, MSE will warp the entire model to reduce those few extreme errors. Huber loss fixes this by being quadratic for small errors (behaving like MSE) and linear for large errors (behaving like MAE/L1). The transition point, delta, controls where “small” ends and “large” begins. PyTorch’s smooth_l1_loss uses delta=1 by default. Below delta, you get MSE-like gradients that shrink as you approach zero (good for precise convergence). Above delta, you get constant-magnitude gradients that don’t explode on outliers.

In deep RL, this distinction matters: Q-learning targets are notoriously noisy (they depend on the max over a changing network), so DQN uses Huber loss instead of MSE to avoid destabilising gradient spikes.

MSE / L2 loss (mean squared error):

LMSE=1Ni=1N(yiy^i)2\mathcal{L}_{\text{MSE}} = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2

Gradient with respect to prediction y^i\hat{y}_i: Ly^i=2N(y^iyi)\frac{\partial \mathcal{L}}{\partial \hat{y}_i} = \frac{2}{N}(\hat{y}_i - y_i). The gradient is proportional to the error — large errors get large gradients.

Huber loss (smooth L1):

Lδ(e)={12e2if eδδ(e12δ)if e>δL_\delta(e) = \begin{cases} \frac{1}{2} e^2 & \text{if } |e| \leq \delta \\ \delta \cdot (|e| - \frac{1}{2}\delta) & \text{if } |e| > \delta \end{cases}

where e=yy^e = y - \hat{y}. For eδ|e| \leq \delta, the gradient is ee (like MSE). For e>δ|e| > \delta, the gradient is ±δ\pm\delta (constant magnitude, like L1).

MAE / L1 loss (for comparison):

LMAE=1Ni=1Nyiy^i\mathcal{L}_{\text{MAE}} = \frac{1}{N} \sum_{i=1}^{N} |y_i - \hat{y}_i|

Gradient is ±1\pm 1 everywhere (except at zero where it’s undefined). Robust to outliers but doesn’t converge as precisely near zero because the gradient never shrinks.

import torch
import torch.nn.functional as F
# ── MSE loss ─────────────────────────────────────────────────────
pred = model(x) # (B, D) — predictions
target = y # (B, D) — ground truth
loss = F.mse_loss(pred, target) # scalar, reduction='mean'
# ── Huber / smooth L1 (default delta=1.0) ────────────────────────
loss = F.smooth_l1_loss(pred, target) # scalar
# Custom delta (called 'beta' in PyTorch):
loss = F.smooth_l1_loss(pred, target, beta=0.5)
# ── Huber loss (explicit, equivalent but different API) ──────────
loss = F.huber_loss(pred, target, delta=1.0)
# WARNING: F.smooth_l1_loss and F.huber_loss differ by a factor of
# (1/delta) in the quadratic region when delta != 1. Use huber_loss
# if you want the standard textbook Huber; use smooth_l1_loss if
# following DQN-style papers that expect the smooth L1 convention.
import numpy as np
def mse_loss(pred, target):
"""
Equivalent to F.mse_loss(pred, target, reduction='mean').
pred: (B, D) or (B,) predictions
target: same shape as pred
"""
return ((pred - target) ** 2).mean()
def huber_loss(pred, target, delta=1.0):
"""
Equivalent to F.huber_loss(pred, target, delta=delta).
Quadratic for |error| <= delta, linear beyond.
pred: (B, D) predictions
target: (B, D) ground truth
"""
error = pred - target # (B, D)
abs_error = np.abs(error) # (B, D)
quadratic = 0.5 * error ** 2 # (B, D)
linear = delta * (abs_error - 0.5 * delta) # (B, D)
loss = np.where(abs_error <= delta, quadratic, linear) # (B, D)
return loss.mean()
  • Diffusion noise prediction (see diffusion/): MSE between predicted and actual noise is the core DDPM training objective — ϵϵθ(xt,t)2\|\epsilon - \epsilon_\theta(x_t, t)\|^2
  • Q-learning (see q-learning/): Huber loss between Q-values and bootstrap targets. DQN switched from MSE to Huber to tame noisy target gradients
  • Bounding box regression (Faster R-CNN, YOLO): smooth L1 for predicting box coordinates — robust to annotation noise
  • Autoencoders / VAEs (see variational-inference-vae/): MSE reconstruction loss when pixel-level fidelity matters (vs. cross-entropy for binary images)
  • Regression heads in multi-task models (e.g. predicting age, price, temperature)
AlternativeWhen to useTradeoff
Cross-entropyClassification (discrete targets)Probabilistic interpretation; not applicable to continuous targets
MAE / L1 lossNeed maximum outlier robustnessConstant gradient doesn’t shrink near the optimum — slower final convergence
Log-cosh lossWant smooth L1-like behaviour without the piecewise definitionApproximately MSE for small errors, L1 for large; twice differentiable everywhere
Quantile lossPredicting intervals or specific percentilesAsymmetric — penalises over/under-prediction differently based on the quantile
Cosine similarity lossComparing directions, not magnitudes (embeddings)Ignores scale entirely; only measures angular distance

MSE traces back to Gauss and Legendre (early 1800s) as the foundation of least-squares estimation. It became the default neural network loss for regression because it corresponds to maximum likelihood under Gaussian noise assumptions and has clean, well-behaved gradients.

Huber loss was introduced by Peter Huber in 1964 in robust statistics, specifically to reduce the influence of outliers in estimation. It entered deep learning through DQN (Mnih et al., 2015), where the smooth L1 variant stabilised Q-learning by capping gradient magnitudes from noisy bootstrap targets. The distinction between Huber and smooth L1 conventions (a factor of delta in the quadratic region) continues to cause minor confusion across frameworks, so always check which convention your codebase uses.