Attention Mechanism

Explore how transformers focus on relevant parts of input through the powerful attention mechanism.

What is Attention?

In the context of Neural Networks, Attention is a mechanism that allows a model to focus on specific parts of its input when producing an output.Attention is the core mechanism that allows transformers to weigh the importance of different parts of the input when generating each output token. It enables the model to "focus" on relevant context.

"Attention is all you need."

— The title of the 2017 paper that introduced the Transformer architecture

👁️

Interactive Attention Map

Hover to explore attention patterns

Hover over different words in the sentences below. The highlighting shows where the model is "looking" to understand that specific word.

Attention Map Simulator

Hover to see attention weights

The

cat

sat

the

mat

because

was

warm.

Focus:pronoun 'it' → cat or mat?

Strong Connection (40%+)

Medium (20-40%)

Weak Connection (10-20%)

The Three Keys: Query, Key, and Value

Query

"What am I looking for?" - Represents the current word seeking context.

Key

"What do I contain?" - A label for every word in the sequence to check against the query.

Value

"What information do I offer?" - The actual content that gets passed forward if the Query and Key match.

The model calculates a score by multiplying Q and K. This score determines how much of V to keep.

Why it Changed Everything

Parallel Processing

Unlike older models (RNNs), Transformers can process all words in a sentence at the same time, making training much faster.

Long-Range Dependencies

Attention can link two words even if they are thousands of tokens apart, as long as they are within the same context window.

Dynamic Context

The model doesn't just look at words; it learns which words are important *for each other* based on the specific sentence.

The Quadratic Problem

Standard attention computes scores between every pair of tokens, resulting in O(n²) complexity. Doubling the context length quadruples memory usage and compute. This is why extending context windows is so challenging.

4K tokens

16M ops

32K tokens

1B ops

128K tokens

16B ops

1M tokens

1T ops

Attention Optimizations

Several techniques have been developed to make attention more efficient, enabling longer contexts and faster inference.

⚡

Flash Attention

Rewrites the attention algorithm to be IO-aware, computing attention in blocks that fit in GPU fast memory (SRAM) rather than constantly reading/writing to slow HBM. The math is identical—just smarter memory access patterns.

2-4x faster5-20x less memoryIO-aware

🔀

Multi-Query Attention (MQA)

Instead of separate Key and Value heads for each Query head, all Query heads share a single K and V. Reduces the KV cache size dramatically, speeding up inference at the cost of some quality.

Faster inferenceSmaller KV cache

📊

Grouped-Query Attention (GQA)

A middle ground between standard Multi-Head Attention and MQA. Groups of Query heads share K/V heads. Maintains most of the quality while still reducing memory.

Best of bothUsed in Llama 2/3

🪟

Sliding Window Attention

Each token only attends to a fixed window of nearby tokens (e.g., 4096) rather than the full context. Information propagates through layers, so distant tokens can still influence each other indirectly.

O(n) complexityUsed in Mistral

💍

Ring Attention

Distributes the sequence across multiple devices in a ring topology. Each device computes attention for its chunk while passing KV states around the ring, enabling context lengths in the millions.

Million+ tokensDistributed

Key Takeaways

1Attention enables transformers to capture long-range dependencies
2The quadratic complexity of attention limits context window size
3Different attention heads learn to focus on different linguistic patterns
4Attention visualization can help interpret model behavior