KV Cache Memory
KV Cache Memory
Section titled “KV Cache Memory”Autoregressive transformer inference stores key-value pairs for all past tokens at every layer. Memory scales as O(layers × heads × seq_len × d_k), which becomes the primary bottleneck for long sequences. A 70B model generating 100K tokens can require hundreds of gigabytes of KV cache alone.
Intuition
Section titled “Intuition”Every time a transformer generates a new token, it needs to attend to all previous tokens. To avoid recomputing the key and value projections for every past token at every step, we cache them — store them once, reuse them forever. This is the KV cache, and it’s essential for making autoregressive inference tractable.
The problem is that this cache grows linearly with sequence length and is allocated for every layer and every attention head. For a 70B model with 80 layers, 64 KV heads, and d_k = 128 in bf16: the KV cache per token is 80 × 64 × 128 × 2 (K and V) × 2 bytes = 2.6 MB. At 100K tokens, that’s 260 GB — just for the cache, not counting model weights. This exceeds the memory of most GPU clusters.
The KV cache is why you can’t just “make the context longer” for free. Every token in the context costs memory that scales with model depth and width. Batch size is limited by KV cache memory, not compute — you might have compute to process 64 sequences in parallel, but only enough memory for 8.
Manifestation
Section titled “Manifestation”- GPU memory exhaustion (OOM) during long-sequence generation — the model fits in memory but the KV cache doesn’t
- Batch size is memory-limited, not compute-limited — increasing batch size for throughput is constrained by per-sequence KV cache
- Latency increases with sequence length even for single tokens, because attention must read the full cache
- Quantising the KV cache (e.g., to int8 or int4) gives dramatic memory savings with minimal quality loss — a sign that the cache is the bottleneck
- Generation speed drops noticeably past certain sequence lengths as the cache fills GPU memory hierarchy
Where It Appears
Section titled “Where It Appears”- Transformer (
transformer/): GQA (Grouped Query Attention) and MQA (Multi-Query Attention) reduce KV cache by sharing KV heads across query heads — LLaMA 2 70B and Mistral use GQA specifically for this reason - Diffusion (
diffusion/): cross-attention in latent diffusion models (Stable Diffusion) has a KV cache for the text conditioning — smaller than in autoregressive models but still significant at high resolution
Solutions at a Glance
Section titled “Solutions at a Glance”| Solution | Mechanism | Where documented |
|---|---|---|
| GQA (Grouped Query Attention) | Share KV heads across groups of query heads — reduces KV cache by 4-8x | transformer/ |
| MQA (Multi-Query Attention) | Single KV head shared across all query heads — maximum KV cache reduction | transformer/ |
| KV cache quantisation | Quantise cached keys/values to int8 or int4 with minimal quality loss | (inference optimisation) |
| Sliding window attention | Only cache the last W tokens, discarding older ones — trades long-range recall for bounded memory | (Mistral, Longformer) |
| Paged attention (vLLM) | Memory-efficient KV cache allocation using virtual memory pages — reduces fragmentation | (Kwon et al., 2023) |
| RoPE + position interpolation | Extend context length without proportionally growing the cache by interpolating position embeddings | transformer/ (RoPE variant) |
Historical Context
Section titled “Historical Context”The KV cache became a critical concern with the scaling of large language models beyond 1B parameters (2020-2022). In smaller models, the KV cache was a minor fraction of total memory. But as models grew to 70B+ parameters and context lengths extended from 2K to 100K+ tokens, the KV cache became the dominant memory consumer during inference. Shazeer (2019) proposed Multi-Query Attention specifically to address KV cache memory, though it took several years for the idea to be widely adopted. GQA (Ainslie et al., 2023) offered a practical middle ground that was quickly adopted by LLaMA 2, Mistral, and most subsequent large models. The development of vLLM’s paged attention (Kwon et al., 2023) was another landmark — it applied operating systems concepts (virtual memory, paging) to the KV cache allocation problem, dramatically improving serving throughput.