Information Bottleneck Tradeoff
Information Bottleneck Tradeoff
Section titled “Information Bottleneck Tradeoff”Compression versus prediction: a representation that aggressively discards information to be compact may also discard task-relevant signal. Too much compression and you lose predictive power; too little and you retain noise and nuisance variation. Every representation learning method navigates this tension.
Intuition
Section titled “Intuition”Imagine summarising a 300-page novel into a 1-page summary. You must decide what to keep and what to cut. A good summary retains the key plot points and character arcs (task-relevant information) while discarding scene descriptions and dialogue (nuisance detail). But how do you know what’s “task-relevant” without knowing the task in advance? A summary optimised for “what happens in the plot?” would look different from one optimised for “what’s the author’s writing style?”
The information bottleneck formalises this: given input and target , find a representation that maximally preserves information about (predictive power) while minimally preserving information about (compression). The tradeoff parameter β controls the balance — low β gives aggressive compression (small latent, potential information loss), high β gives faithful encoding (large latent, potential overfitting to noise).
This isn’t just theoretical. In VAEs, β literally controls this tradeoff via the KL weight. In VQ-VAE, the codebook size is the bottleneck. In dropout, the drop rate controls how much information flows through. Every architecture and training objective makes implicit or explicit choices about what to keep and what to discard.
The information bottleneck objective (Tishby, Pereira & Bialek, 1999):
where is the mutual information between input and representation (compression cost), and is the mutual information between representation and target (predictive value). controls the tradeoff.
- : maximally compress — carries no information (trivial, useless)
- : no compression — retains everything from (no regularisation)
- Optimal : retain only the information in that’s relevant to
Manifestation
Section titled “Manifestation”- In VAEs: increasing β (stronger KL penalty) improves disentanglement but degrades reconstruction — the latent retains less information about the input
- In VQ-VAE: too few codebook entries → blurry reconstructions (over-compressed); too many → codebook collapses to unused entries
- In dropout: high drop rate → underfitting (too much information destroyed); low drop rate → overfitting (too little regularisation)
- In quantisation: aggressive bit reduction saves compute but degrades model quality — the information-accuracy tradeoff curve is the practical manifestation
- Representation quality is task-dependent — features that are excellent for classification may be poor for generation, and vice versa
Where It Appears
Section titled “Where It Appears”- VAE (
variational-inference-vae/): the β in β-VAE directly controls the information bottleneck — low β (KL-AE) keeps more information, high β discards more; VQ-VAE uses a discrete bottleneck (codebook size) instead of a continuous one - Contrastive learning (
contrastive-self-supervising/): the projection head in SimCLR is a form of information bottleneck — it maps the high-dimensional backbone representation down to a space where contrastive learning operates, discarding task-irrelevant information - Transformer (
transformer/): the bottleneck between encoder and decoder in seq2seq models forces the representation to compress — attention mechanisms alleviate this by allowing selective information retrieval - Diffusion (
diffusion/): latent diffusion (Stable Diffusion) uses a VAE/KL-AE to compress images into a latent space before applying diffusion — the latent dimensionality is the bottleneck
Solutions at a Glance
Section titled “Solutions at a Glance”| Solution | Mechanism | Where documented |
|---|---|---|
| β-VAE (β tuning) | Explicit tradeoff parameter in the ELBO — β controls compression vs reconstruction | variational-inference-vae/ |
| VQ-VAE (codebook size) | Discrete bottleneck with tunable capacity — codebook size controls information throughput | variational-inference-vae/ |
| Dropout (drop rate) | Randomly zeros activations — higher rate = stronger bottleneck | atomic-concepts/regularisation/dropout.md |
| Projection head (SimCLR) | Low-dimensional projection discards nuisance information, retaining invariances | contrastive-self-supervising/ |
| Latent dimensionality | Choosing the latent space size directly controls how much information is preserved | (architectural choice) |
Historical Context
Section titled “Historical Context”The information bottleneck was formalised by Tishby, Pereira & Bialek (1999) as an information-theoretic framework for optimal representation learning. Shwartz-Ziv & Tishby (2017) proposed that deep networks naturally undergo an “information bottleneck phase” during training — first fitting the data (increasing ), then compressing (decreasing ). This claim is debated (Saxe et al., 2018 found the compression phase depends on the activation function), but the framework remains influential for thinking about what representations should look like. The practical impact is visible in every VAE (β parameter), every contrastive method (projection head), and every architecture with an explicit bottleneck layer.