Information Bottleneck Tradeoff

Compression versus prediction: a representation that aggressively discards information to be compact may also discard task-relevant signal. Too much compression and you lose predictive power; too little and you retain noise and nuisance variation. Every representation learning method navigates this tension.

Intuition

Imagine summarising a 300-page novel into a 1-page summary. You must decide what to keep and what to cut. A good summary retains the key plot points and character arcs (task-relevant information) while discarding scene descriptions and dialogue (nuisance detail). But how do you know what’s “task-relevant” without knowing the task in advance? A summary optimised for “what happens in the plot?” would look different from one optimised for “what’s the author’s writing style?”

The information bottleneck formalises this: given input $X$ and target $Y$ , find a representation $Z$ that maximally preserves information about $Y$ (predictive power) while minimally preserving information about $X$ (compression). The tradeoff parameter β controls the balance — low β gives aggressive compression (small latent, potential information loss), high β gives faithful encoding (large latent, potential overfitting to noise).

This isn’t just theoretical. In VAEs, β literally controls this tradeoff via the KL weight. In VQ-VAE, the codebook size is the bottleneck. In dropout, the drop rate controls how much information flows through. Every architecture and training objective makes implicit or explicit choices about what to keep and what to discard.

Math

The information bottleneck objective (Tishby, Pereira & Bialek, 1999):

$\min_{p(z|x)} I(X; Z) - \beta \cdot I(Z; Y)$

where $I(X; Z)$ is the mutual information between input and representation (compression cost), and $I(Z; Y)$ is the mutual information between representation and target (predictive value). $\beta > 0$ controls the tradeoff.

$\beta \to 0$ : maximally compress — $Z$ carries no information (trivial, useless)
$\beta \to \infty$ : no compression — $Z$ retains everything from $X$ (no regularisation)
Optimal $\beta$ : retain only the information in $X$ that’s relevant to $Y$

Manifestation

In VAEs: increasing β (stronger KL penalty) improves disentanglement but degrades reconstruction — the latent retains less information about the input
In VQ-VAE: too few codebook entries → blurry reconstructions (over-compressed); too many → codebook collapses to unused entries
In dropout: high drop rate → underfitting (too much information destroyed); low drop rate → overfitting (too little regularisation)
In quantisation: aggressive bit reduction saves compute but degrades model quality — the information-accuracy tradeoff curve is the practical manifestation
Representation quality is task-dependent — features that are excellent for classification may be poor for generation, and vice versa

Where It Appears

VAE (variational-inference-vae/): the β in β-VAE directly controls the information bottleneck — low β (KL-AE) keeps more information, high β discards more; VQ-VAE uses a discrete bottleneck (codebook size) instead of a continuous one
Contrastive learning (contrastive-self-supervising/): the projection head in SimCLR is a form of information bottleneck — it maps the high-dimensional backbone representation down to a space where contrastive learning operates, discarding task-irrelevant information
Transformer (transformer/): the bottleneck between encoder and decoder in seq2seq models forces the representation to compress — attention mechanisms alleviate this by allowing selective information retrieval
Diffusion (diffusion/): latent diffusion (Stable Diffusion) uses a VAE/KL-AE to compress images into a latent space before applying diffusion — the latent dimensionality is the bottleneck

Solutions at a Glance

Solution	Mechanism	Where documented
β-VAE (β tuning)	Explicit tradeoff parameter in the ELBO — β controls compression vs reconstruction	`variational-inference-vae/`
VQ-VAE (codebook size)	Discrete bottleneck with tunable capacity — codebook size controls information throughput	`variational-inference-vae/`
Dropout (drop rate)	Randomly zeros activations — higher rate = stronger bottleneck	`atomic-concepts/regularisation/dropout.md`
Projection head (SimCLR)	Low-dimensional projection discards nuisance information, retaining invariances	`contrastive-self-supervising/`
Latent dimensionality	Choosing the latent space size directly controls how much information is preserved	(architectural choice)

Historical Context

The information bottleneck was formalised by Tishby, Pereira & Bialek (1999) as an information-theoretic framework for optimal representation learning. Shwartz-Ziv & Tishby (2017) proposed that deep networks naturally undergo an “information bottleneck phase” during training — first fitting the data (increasing $I(Z; Y)$ ), then compressing (decreasing $I(X; Z)$ ). This claim is debated (Saxe et al., 2018 found the compression phase depends on the activation function), but the framework remains influential for thinking about what representations should look like. The practical impact is visible in every VAE (β parameter), every contrastive method (projection head), and every architecture with an explicit bottleneck layer.