What is KV Cache?
During autoregressive generation, a transformer must compute attention over all previous tokens for every new token it generates. The KV Cache stores the Key and Value projections from previous tokens so they don't need to be recomputed. This turns generation from O(n²) to O(n) in computation per step — a massive speedup for long sequences.
Key Cache
Stores the Key projections for each token at each layer. These are used to compute attention scores between the new token and all previous tokens.
Value Cache
Stores the Value projections for each token at each layer. Once attention weights are computed, these cached values are used to produce the output.
Why It Matters
Without KV caching, generating each new token would require reprocessing the entire sequence through every attention layer. For a 4096-token sequence, that means 4096× redundant computation per token.
Speed
Avoids recomputing attention for all previous tokens at every step. Generation goes from quadratic to linear.
Incremental
Each new token only needs to compute its own Q, K, V and attend to the cached K, V from previous positions.
Trade-off
Trades GPU memory for compute time. The cache grows linearly with sequence length and model depth.
Interactive KV Cache Explorer
Watch the cache grow token by token
Step through autoregressive generation to see how the KV cache accumulates. Compare the computation cost with and without caching — the savings become dramatic as sequences get longer.
| Pos | Token | Key Vector | Value Vector |
|---|---|---|---|
| Press Play or Step to start generating tokens... | |||
Memory Implications
The KV cache is the primary memory bottleneck during inference. For large models with long contexts, it can consume tens of gigabytes of GPU memory.
KV Cache Memory = 2 × num_layers × seq_len × d_head × num_kv_heads × dtype_sizeFactor of 2 for K and V, dtype_size is 2 bytes for FP16. For Llama 3 70B (80 layers, 8 KV heads, 128 d_head) at 8K context: ~2.5 GB per request.
Optimization Techniques
Multi-Query & Grouped-Query Attention (MQA/GQA)
Instead of separate K/V heads per attention head, MQA shares a single K/V head across all query heads, while GQA uses a few shared groups. This reduces KV cache size by 4-32× with minimal quality loss. Llama 3 and Mistral use GQA. See the Attention Mechanism article for more details.
Sliding Window Attention
Instead of caching all tokens, only keep the most recent W tokens in the cache. Used by Mistral models. Reduces memory from O(seq_len) to O(W), but limits the model's ability to attend to very early tokens.
Paged Attention (vLLM)
Inspired by virtual memory in operating systems. Instead of allocating contiguous memory for each sequence's KV cache, vLLM manages cache in fixed-size "pages" that can be allocated and freed dynamically. This eliminates memory fragmentation and enables efficient batching of requests with different sequence lengths.
Key Takeaways
- 1KV cache stores Key and Value projections from previous tokens, avoiding redundant computation during generation
- 2Without KV cache, each new token requires O(n) attention computation over the full sequence; with it, only the new token's K/V is computed
- 3KV cache memory grows linearly with sequence length × layers × KV heads — this is the main inference memory bottleneck
- 4Techniques like GQA, sliding window, and paged attention address the memory cost while preserving generation speed