What is Attention?
In the context of Neural Networks, Attention is a mechanism that allows a model to focus on specific parts of its input when producing an output.Attention is the core mechanism that allows transformers to weigh the importance of different parts of the input when generating each output token. It enables the model to "focus" on relevant context.
"Attention is all you need."
â The title of the 2017 paper that introduced the Transformer architecture
Interactive Attention Map
Hover to explore attention patterns
Hover over different words in the sentences below. The highlighting shows where the model is "looking" to understand that specific word.
Attention Map Simulator
Hover to see attention weights
The Three Keys: Query, Key, and Value
Query
"What am I looking for?" - Represents the current word seeking context.
Key
"What do I contain?" - A label for every word in the sequence to check against the query.
Value
"What information do I offer?" - The actual content that gets passed forward if the Query and Key match.
The model calculates a score by multiplying Q and K. This score determines how much of V to keep.
Why it Changed Everything
Parallel Processing
Unlike older models (RNNs), Transformers can process all words in a sentence at the same time, making training much faster.
Long-Range Dependencies
Attention can link two words even if they are thousands of tokens apart, as long as they are within the same context window.
Dynamic Context
The model doesn't just look at words; it learns which words are important *for each other* based on the specific sentence.
The Quadratic Problem
Standard attention computes scores between every pair of tokens, resulting in O(n²) complexity. Doubling the context length quadruples memory usage and compute. This is why extending context windows is so challenging.
Attention Optimizations
Several techniques have been developed to make attention more efficient, enabling longer contexts and faster inference.
Flash Attention
Rewrites the attention algorithm to be IO-aware, computing attention in blocks that fit in GPU fast memory (SRAM) rather than constantly reading/writing to slow HBM. The math is identicalâjust smarter memory access patterns.
Multi-Query Attention (MQA)
Instead of separate Key and Value heads for each Query head, all Query heads share a single K and V. Reduces the KV cache size dramatically, speeding up inference at the cost of some quality.
Grouped-Query Attention (GQA)
A middle ground between standard Multi-Head Attention and MQA. Groups of Query heads share K/V heads. Maintains most of the quality while still reducing memory.
Sliding Window Attention
Each token only attends to a fixed window of nearby tokens (e.g., 4096) rather than the full context. Information propagates through layers, so distant tokens can still influence each other indirectly.
Ring Attention
Distributes the sequence across multiple devices in a ring topology. Each device computes attention for its chunk while passing KV states around the ring, enabling context lengths in the millions.
Key Takeaways
- 1Attention enables transformers to capture long-range dependencies
- 2The quadratic complexity of attention limits context window size
- 3Different attention heads learn to focus on different linguistic patterns
- 4Attention visualization can help interpret model behavior