KV Cache | Learn AI

What is KV Cache?

During autoregressive generation, a transformer must compute attention over all previous tokens for every new token it generates. The KV Cache stores the Key and Value projections from previous tokens so they don't need to be recomputed. This turns generation from O(n²) to O(n) in computation per step — a massive speedup for long sequences.

Key Cache

Stores the Key projections for each token at each layer. These are used to compute attention scores between the new token and all previous tokens.

Value Cache

Stores the Value projections for each token at each layer. Once attention weights are computed, these cached values are used to produce the output.

Why It Matters

Without KV caching, generating each new token would require reprocessing the entire sequence through every attention layer. For a 4096-token sequence, that means 4096× redundant computation per token.

⚡

Speed

Avoids recomputing attention for all previous tokens at every step. Generation goes from quadratic to linear.

🔁

Incremental

Each new token only needs to compute its own Q, K, V and attend to the cached K, V from previous positions.

📉

Trade-off

Trades GPU memory for compute time. The cache grows linearly with sequence length and model depth.

Pos	Token	Key Vector	Value Vector
Press Play or Step to start generating tokens...

Pos

Token

Key Vector

Value Vector

Press Play or Step to start generating tokens...

Memory Implications

The KV cache is the primary memory bottleneck during inference. For large models with long contexts, it can consume tens of gigabytes of GPU memory.

KV Cache Memory = 2 × num_layers × seq_len × d_head × num_kv_heads × dtype_size

Factor of 2 for K and V, dtype_size is 2 bytes for FP16. For Llama 3 70B (80 layers, 8 KV heads, 128 d_head) at 8K context: ~2.5 GB per request.

Optimization Techniques

Multi-Query & Grouped-Query Attention (MQA/GQA)

Instead of separate K/V heads per attention head, MQA shares a single K/V head across all query heads, while GQA uses a few shared groups. This reduces KV cache size by 4-32× with minimal quality loss. Llama 3 and Mistral use GQA. See the Attention Mechanism article for more details.

Sliding Window Attention

Instead of caching all tokens, only keep the most recent W tokens in the cache. Used by Mistral models. Reduces memory from O(seq_len) to O(W), but limits the model's ability to attend to very early tokens.

Paged Attention (vLLM)

Inspired by virtual memory in operating systems. Instead of allocating contiguous memory for each sequence's KV cache, vLLM manages cache in fixed-size "pages" that can be allocated and freed dynamically. This eliminates memory fragmentation and enables efficient batching of requests with different sequence lengths.

Key Takeaways

1KV cache stores Key and Value projections from previous tokens, avoiding redundant computation during generation
2Without KV cache, each new token requires O(n) attention computation over the full sequence; with it, only the new token's K/V is computed
3KV cache memory grows linearly with sequence length × layers × KV heads — this is the main inference memory bottleneck
4Techniques like GQA, sliding window, and paged attention address the memory cost while preserving generation speed