Bits-per-Dimension

Normalised evaluation metric for generative models on continuous data: the negative log-likelihood in bits, divided by the number of dimensions. BPD = NLL / (D * ln 2). Allows fair comparison across different image resolutions and between different model families (diffusion, VAE, normalising flows). Lower is better — a BPD of 3.0 on 8-bit images means the model needs 3 bits per pixel on average.

Intuition

Imagine compressing an image pixel by pixel. Each pixel in an 8-bit image has 256 possible values, so a naive encoding needs 8 bits per pixel. If a generative model assigns higher probability to the true pixel values, it can compress more efficiently. BPD measures exactly this: how many bits per pixel (or per dimension) would an optimal arithmetic coder need, using the model’s predicted distribution as the code?

A BPD of 8.0 means the model is no better than uniform random — it assigns equal probability to all 256 values for each pixel. A BPD of 3.0 means the model has learned enough structure (edges, textures, semantics) to compress each pixel into about 3 bits on average. State-of-the-art models on CIFAR-10 achieve around 2.5 BPD, and natural images have an estimated entropy around 1-2 BPD depending on content.

The key advantage of BPD over raw NLL is resolution-independence. A 32x32 image has 3,072 dimensions; a 256x256 image has 196,608. Raw NLL scales with dimension count, making models trained on different resolutions incomparable. BPD divides out this factor, giving a per-dimension cost that can be compared directly.

Math

Definition (converting NLL from nats to bits per dimension):

$\text{BPD} = \frac{-\log p(x)}{D \cdot \ln 2}$

where $D$ is the total number of dimensions (e.g., $H \times W \times C$ for images) and the $\ln 2$ converts from nats (natural log) to bits (log base 2).

Equivalently, using log base 2 directly:

$\text{BPD} = \frac{-\log_2 p(x)}{D}$

For discrete data (e.g., quantised 8-bit images, the standard setup):

Models typically output continuous log-likelihoods, so you must add a dequantisation correction. With uniform dequantisation (adding $U(0, 1/256)$ noise to each pixel):

$\text{BPD} = \frac{-\log p(x + u) - D \cdot \log(256)}{D \cdot \ln 2}$

Wait — more precisely, the convention is: compute NLL on dequantised data, then the $\log(256)$ accounts for the change of variables from [0, 255] integers to [0, 1] continuous values. Many implementations absorb this into the preprocessing.

For diffusion models (variational bound):

$\text{BPD} = \frac{\text{VLB}}{D \cdot \ln 2} = \frac{L_0 + \sum_{t=1}^{T} L_t + L_T}{D \cdot \ln 2}$

Code

import torch

# ── BPD from NLL (nats) ─────────────────────────────────────────
# Most common case: model outputs NLL in nats, you want BPD.

nll_nats = model.nll(x)                              # scalar — total NLL over all dims
D = x.shape[1:].numel()                               # e.g., 3*32*32 = 3072 for CIFAR-10
bpd = nll_nats / (D * torch.log(torch.tensor(2.0)))   # scalar — bits per dimension

# ── BPD for diffusion models (from VLB) ─────────────────────────
# Diffusion models compute a variational lower bound (VLB) in nats.
vlb_nats = diffusion_model.compute_vlb(x)             # scalar — summed over all timesteps
bpd = vlb_nats / (D * torch.log(torch.tensor(2.0)))

# ── BPD with dequantisation correction ──────────────────────────
# For models trained on [0,1]-scaled 8-bit images with uniform dequantisation:
# The NLL is computed on continuous data in [0, 1], but we want BPD in
# the original [0, 255] discrete space.
nll_continuous = model.nll(x_dequantised)              # x in [0, 1] with U(0, 1/256) noise
log_256 = torch.log(torch.tensor(256.0))
bpd = (nll_continuous + D * log_256) / (D * torch.log(torch.tensor(2.0)))
# WARNING: the sign of the correction depends on your preprocessing convention.
# If x is in [0,1]: ADD D*log(256). If x is in [0, 255]: no correction needed.

Manual Implementation

import numpy as np

def bpd_from_nll_nats(nll_nats, n_dims):
    """
    Convert NLL (in nats) to bits-per-dimension.
    nll_nats: scalar or (B,) — negative log-likelihood in nats
    n_dims:   int — total dimensions (e.g., 3*32*32 = 3072)
    Returns: scalar or (B,) — BPD
    """
    return nll_nats / (n_dims * np.log(2))


def bpd_from_log_probs(log_probs, n_dims):
    """
    BPD from log-probabilities (already in log base e).
    log_probs: (B,) — log p(x) per sample (negative values)
    n_dims:    int
    Returns: scalar — mean BPD over batch
    """
    nll = -log_probs                                          # (B,) — positive values
    return (nll / (n_dims * np.log(2))).mean()                # scalar


def bpd_with_dequant_correction(nll_continuous, n_dims, n_bins=256):
    """
    BPD for models trained on dequantised [0,1] data from n_bins-level images.
    nll_continuous: scalar — NLL in nats on the continuous [0,1] data
    n_dims:        int — total dimensions
    n_bins:        int — quantisation levels (256 for 8-bit)
    """
    # Change of variables: p_discrete(x) = p_continuous(x/n_bins) / n_bins^D
    # => -log p_discrete = -log p_continuous + D*log(n_bins)
    nll_discrete = nll_continuous + n_dims * np.log(n_bins)
    return nll_discrete / (n_dims * np.log(2))

Popular Uses

Diffusion models (DDPM, improved DDPM): BPD via the variational bound is the standard evaluation metric. Ho et al. (2020) reported 3.70 BPD on CIFAR-10; Nichol & Dhariwal (2021) improved to 2.94 BPD with learned variance
Normalising flows (RealNVP, Glow, FFJORD): exact log-likelihood via change of variables makes BPD the natural and exact evaluation metric — no bounds needed
VAEs (VAE, VQ-VAE): report BPD via the ELBO bound. Typically looser than flow-based models since the ELBO is only a lower bound on log-likelihood
Autoregressive models (PixelCNN, PixelSNAIL): compute exact BPD by factoring the joint as a product of conditionals. PixelSNAIL achieved 2.85 BPD on CIFAR-10
Lossless compression: BPD gives the theoretical compression rate achievable using the model as a prior in an arithmetic coder

Alternatives

Alternative	When to use	Tradeoff
FID (Frechet Inception Distance)	Evaluating perceptual quality of generated samples	Measures sample quality, not likelihood. A model can have good FID but poor BPD and vice versa
Raw NLL (nats or bits)	When comparing models at the same resolution only	Not comparable across resolutions; BPD normalises this away
ELBO	Training VAEs and diffusion models	A lower bound on log-likelihood — gap between ELBO and true NLL can be large
Perplexity	Language models (discrete tokens)	$2^{BPD}$ for bits, $e^{\text{NLL per token}}$ for nats. Same information, different scale — convention in NLP
IS (Inception Score)	Quick sample quality check	Only measures quality and diversity of generated samples, ignores coverage of real data
Bits-per-character / bits-per-token	Text generation	Same concept as BPD but D = sequence length, not pixel dimensions

Historical Context

Bits-per-dimension comes directly from Shannon’s rate-distortion theory and the connection between probabilistic models and lossless compression. The metric became standard in generative modelling through the normalising flows literature (Dinh et al., 2015, NICE; 2017, RealNVP), where exact log-likelihoods made per-dimension comparison natural and precise.

The diffusion modelling community adopted BPD as a primary metric starting with Ho et al. (2020, DDPM), who showed that diffusion models could achieve competitive BPD despite optimising a simplified surrogate loss. Nichol and Dhariwal (2021, “Improved Denoising Diffusion Probabilistic Models”) demonstrated that learning the variance schedule could close the gap between the simplified loss and the true variational bound, pushing diffusion BPD below autoregressive models. The tension between BPD and FID remains an active discussion — optimising for one does not guarantee the other, reflecting the fundamental difference between likelihood (did the model assign high probability to real data?) and sample quality (do generated samples look good?).