What is the Transformer?
The Transformer is a neural network architecture introduced in the landmark 2017 paper "Attention is All You Need" by Vaswani et al. It replaced recurrent and convolutional approaches with a purely attention-based mechanism, enabling massive parallelization during training and capturing long-range dependencies far more effectively. Nearly every modern large language model -- GPT, BERT, LLaMA, Claude -- is built on the Transformer.
“Attention is all you need.”
-- Vaswani et al., "Attention Is All You Need" (2017, Google Brain)
LLM Visualization by Brendan Bycroft
The best interactive 3D visualization of transformer internals available. Explore how GPT-style models process tokens through embedding, attention, and feed-forward layers -- step by step, parameter by parameter. Highly recommended.
Transformer Layer Stack
A Transformer is built from a stack of identical layers. Click through each component to understand what it does and how data flows through the architecture.
Layer-by-Layer Explorer
Click each layer to see its role in the architecture
Input Embedding
Converts each input token (an integer ID) into a dense vector of dimension d_model (e.g. 768 or 4096). This learned lookup table is the model's vocabulary representation -- each word gets a unique high-dimensional vector.
Each token ID maps to a learned dense vector
Architecture Variants
The original Transformer used both an encoder and decoder. Modern models often use just one. Toggle between the three main variants to see which components each uses.
Encoder-Decoder Variants
Toggle to compare architectures
Example models:
The original architecture. The encoder processes the full input bidirectionally, then the decoder generates output tokens one at a time, attending to the encoder's representations via cross-attention. Used for translation, summarization, and sequence-to-sequence tasks.
Token Dataflow
Follow a single token as it flows through the entire Transformer pipeline, from raw text to output probabilities. Watch how the tensor shape changes at each step.
Step-by-Step Dataflow
Play or scrub through the processing pipeline
Key Concepts
Residual Connections
Skip connections that add the input of each sublayer directly to its output. They solve the vanishing gradient problem and allow training of very deep networks. Without them, transformers with 50+ layers would be impossible to train.
Layer Normalization
Normalizes activations across the feature dimension to stabilize training. Applied after each residual addition. Pre-norm (normalize before sublayer) has become more common than post-norm in modern architectures.
Positional Encoding
Since attention has no inherent notion of order, position must be explicitly injected. The original paper used fixed sinusoidal functions; modern models typically use learned position embeddings or relative position encodings like RoPE.
Why It Matters
The Transformer architecture is arguably the most impactful innovation in AI of the past decade. It unlocked the scaling laws that make modern LLMs possible.
- 1The Transformer replaced RNNs and LSTMs by enabling full parallelization during training, reducing training time from weeks to days
- 2Its attention mechanism captures long-range dependencies that sequential models struggled with, enabling understanding of entire documents
- 3The architecture scales remarkably well -- from 100M parameter BERT to 1.8T parameter GPT-4, performance improves predictably with scale
- 4Every major LLM today (GPT, Claude, Gemini, LLaMA, Mistral) is built on the Transformer, making it the foundation of modern AI