Bits-per-Dimension
Bits-per-Dimension
Section titled “Bits-per-Dimension”Normalised evaluation metric for generative models on continuous data: the negative log-likelihood in bits, divided by the number of dimensions. BPD = NLL / (D * ln 2). Allows fair comparison across different image resolutions and between different model families (diffusion, VAE, normalising flows). Lower is better — a BPD of 3.0 on 8-bit images means the model needs 3 bits per pixel on average.
Intuition
Section titled “Intuition”Imagine compressing an image pixel by pixel. Each pixel in an 8-bit image has 256 possible values, so a naive encoding needs 8 bits per pixel. If a generative model assigns higher probability to the true pixel values, it can compress more efficiently. BPD measures exactly this: how many bits per pixel (or per dimension) would an optimal arithmetic coder need, using the model’s predicted distribution as the code?
A BPD of 8.0 means the model is no better than uniform random — it assigns equal probability to all 256 values for each pixel. A BPD of 3.0 means the model has learned enough structure (edges, textures, semantics) to compress each pixel into about 3 bits on average. State-of-the-art models on CIFAR-10 achieve around 2.5 BPD, and natural images have an estimated entropy around 1-2 BPD depending on content.
The key advantage of BPD over raw NLL is resolution-independence. A 32x32 image has 3,072 dimensions; a 256x256 image has 196,608. Raw NLL scales with dimension count, making models trained on different resolutions incomparable. BPD divides out this factor, giving a per-dimension cost that can be compared directly.
Definition (converting NLL from nats to bits per dimension):
where is the total number of dimensions (e.g., for images) and the converts from nats (natural log) to bits (log base 2).
Equivalently, using log base 2 directly:
For discrete data (e.g., quantised 8-bit images, the standard setup):
Models typically output continuous log-likelihoods, so you must add a dequantisation correction. With uniform dequantisation (adding noise to each pixel):
Wait — more precisely, the convention is: compute NLL on dequantised data, then the accounts for the change of variables from [0, 255] integers to [0, 1] continuous values. Many implementations absorb this into the preprocessing.
For diffusion models (variational bound):
import torch
# ── BPD from NLL (nats) ─────────────────────────────────────────# Most common case: model outputs NLL in nats, you want BPD.
nll_nats = model.nll(x) # scalar — total NLL over all dimsD = x.shape[1:].numel() # e.g., 3*32*32 = 3072 for CIFAR-10bpd = nll_nats / (D * torch.log(torch.tensor(2.0))) # scalar — bits per dimension
# ── BPD for diffusion models (from VLB) ─────────────────────────# Diffusion models compute a variational lower bound (VLB) in nats.vlb_nats = diffusion_model.compute_vlb(x) # scalar — summed over all timestepsbpd = vlb_nats / (D * torch.log(torch.tensor(2.0)))
# ── BPD with dequantisation correction ──────────────────────────# For models trained on [0,1]-scaled 8-bit images with uniform dequantisation:# The NLL is computed on continuous data in [0, 1], but we want BPD in# the original [0, 255] discrete space.nll_continuous = model.nll(x_dequantised) # x in [0, 1] with U(0, 1/256) noiselog_256 = torch.log(torch.tensor(256.0))bpd = (nll_continuous + D * log_256) / (D * torch.log(torch.tensor(2.0)))# WARNING: the sign of the correction depends on your preprocessing convention.# If x is in [0,1]: ADD D*log(256). If x is in [0, 255]: no correction needed.Manual Implementation
Section titled “Manual Implementation”import numpy as np
def bpd_from_nll_nats(nll_nats, n_dims): """ Convert NLL (in nats) to bits-per-dimension. nll_nats: scalar or (B,) — negative log-likelihood in nats n_dims: int — total dimensions (e.g., 3*32*32 = 3072) Returns: scalar or (B,) — BPD """ return nll_nats / (n_dims * np.log(2))
def bpd_from_log_probs(log_probs, n_dims): """ BPD from log-probabilities (already in log base e). log_probs: (B,) — log p(x) per sample (negative values) n_dims: int Returns: scalar — mean BPD over batch """ nll = -log_probs # (B,) — positive values return (nll / (n_dims * np.log(2))).mean() # scalar
def bpd_with_dequant_correction(nll_continuous, n_dims, n_bins=256): """ BPD for models trained on dequantised [0,1] data from n_bins-level images. nll_continuous: scalar — NLL in nats on the continuous [0,1] data n_dims: int — total dimensions n_bins: int — quantisation levels (256 for 8-bit) """ # Change of variables: p_discrete(x) = p_continuous(x/n_bins) / n_bins^D # => -log p_discrete = -log p_continuous + D*log(n_bins) nll_discrete = nll_continuous + n_dims * np.log(n_bins) return nll_discrete / (n_dims * np.log(2))Popular Uses
Section titled “Popular Uses”- Diffusion models (DDPM, improved DDPM): BPD via the variational bound is the standard evaluation metric. Ho et al. (2020) reported 3.70 BPD on CIFAR-10; Nichol & Dhariwal (2021) improved to 2.94 BPD with learned variance
- Normalising flows (RealNVP, Glow, FFJORD): exact log-likelihood via change of variables makes BPD the natural and exact evaluation metric — no bounds needed
- VAEs (VAE, VQ-VAE): report BPD via the ELBO bound. Typically looser than flow-based models since the ELBO is only a lower bound on log-likelihood
- Autoregressive models (PixelCNN, PixelSNAIL): compute exact BPD by factoring the joint as a product of conditionals. PixelSNAIL achieved 2.85 BPD on CIFAR-10
- Lossless compression: BPD gives the theoretical compression rate achievable using the model as a prior in an arithmetic coder
Alternatives
Section titled “Alternatives”| Alternative | When to use | Tradeoff |
|---|---|---|
| FID (Frechet Inception Distance) | Evaluating perceptual quality of generated samples | Measures sample quality, not likelihood. A model can have good FID but poor BPD and vice versa |
| Raw NLL (nats or bits) | When comparing models at the same resolution only | Not comparable across resolutions; BPD normalises this away |
| ELBO | Training VAEs and diffusion models | A lower bound on log-likelihood — gap between ELBO and true NLL can be large |
| Perplexity | Language models (discrete tokens) | for bits, for nats. Same information, different scale — convention in NLP |
| IS (Inception Score) | Quick sample quality check | Only measures quality and diversity of generated samples, ignores coverage of real data |
| Bits-per-character / bits-per-token | Text generation | Same concept as BPD but D = sequence length, not pixel dimensions |
Historical Context
Section titled “Historical Context”Bits-per-dimension comes directly from Shannon’s rate-distortion theory and the connection between probabilistic models and lossless compression. The metric became standard in generative modelling through the normalising flows literature (Dinh et al., 2015, NICE; 2017, RealNVP), where exact log-likelihoods made per-dimension comparison natural and precise.
The diffusion modelling community adopted BPD as a primary metric starting with Ho et al. (2020, DDPM), who showed that diffusion models could achieve competitive BPD despite optimising a simplified surrogate loss. Nichol and Dhariwal (2021, “Improved Denoising Diffusion Probabilistic Models”) demonstrated that learning the variance schedule could close the gap between the simplified loss and the true variational bound, pushing diffusion BPD below autoregressive models. The tension between BPD and FID remains an active discussion — optimising for one does not guarantee the other, reflecting the fundamental difference between likelihood (did the model assign high probability to real data?) and sample quality (do generated samples look good?).