Transformer Architecture

What is the Transformer?

The Transformer is a neural network architecture introduced in the landmark 2017 paper "Attention is All You Need" by Vaswani et al. It replaced recurrent and convolutional approaches with a purely attention-based mechanism, enabling massive parallelization during training and capturing long-range dependencies far more effectively. Nearly every modern large language model -- GPT, BERT, LLaMA, Claude -- is built on the Transformer.

“Attention is all you need.”

-- Vaswani et al., "Attention Is All You Need" (2017, Google Brain)

Transformer Layer Stack

A Transformer is built from a stack of identical layers. Click through each component to understand what it does and how data flows through the architecture.

Layer-by-Layer Explorer

Click each layer to see its role in the architecture

Layers 3-6 repeat N times

Input Embedding

Converts each input token (an integer ID) into a dense vector of dimension d_model (e.g. 768 or 4096). This learned lookup table is the model's vocabulary representation -- each word gets a unique high-dimensional vector.

The

cat

sat

Each token ID maps to a learned dense vector

Architecture Variants

The original Transformer used both an encoder and decoder. Modern models often use just one. Toggle between the three main variants to see which components each uses.

Encoder-Decoder Variants

Toggle to compare architectures

Encoder

Self-Attention

Add & Norm

FFN

Add & Norm

Cross-Attention

Decoder

Masked Self-Attn

Cross-Attention

Add & Norm

FFN

Add & Norm

Example models:

T5BARTmBART

The original architecture. The encoder processes the full input bidirectionally, then the decoder generates output tokens one at a time, attending to the encoder's representations via cross-attention. Used for translation, summarization, and sequence-to-sequence tasks.

Token Dataflow

Follow a single token as it flows through the entire Transformer pipeline, from raw text to output probabilities. Watch how the tensor shape changes at each step.

Step-by-Step Dataflow

Play or scrub through the processing pipeline

Tokenize

Raw text is split into token IDs using BPE or similar. Each token maps to an integer.

[batch, seq_len]

Embed

[batch, seq_len, d_model]

Add Position

[batch, seq_len, d_model]

Compute Attention

[batch, heads, seq_len, seq_len]

Attention Output

[batch, seq_len, d_model]

Feed-Forward

[batch, seq_len, d_ff]

FFN Output

[batch, seq_len, d_model]

Output Logits

[batch, seq_len, vocab_size]

TokenizeOutput Logits

Key Concepts

Residual Connections

Skip connections that add the input of each sublayer directly to its output. They solve the vanishing gradient problem and allow training of very deep networks. Without them, transformers with 50+ layers would be impossible to train.

Layer Normalization

Normalizes activations across the feature dimension to stabilize training. Applied after each residual addition. Pre-norm (normalize before sublayer) has become more common than post-norm in modern architectures.

Positional Encoding

Since attention has no inherent notion of order, position must be explicitly injected. The original paper used fixed sinusoidal functions; modern models typically use learned position embeddings or relative position encodings like RoPE.

Why It Matters

The Transformer architecture is arguably the most impactful innovation in AI of the past decade. It unlocked the scaling laws that make modern LLMs possible.

1The Transformer replaced RNNs and LSTMs by enabling full parallelization during training, reducing training time from weeks to days
2Its attention mechanism captures long-range dependencies that sequential models struggled with, enabling understanding of entire documents
3The architecture scales remarkably well -- from 100M parameter BERT to 1.8T parameter GPT-4, performance improves predictably with scale
4Every major LLM today (GPT, Claude, Gemini, LLaMA, Mistral) is built on the Transformer, making it the foundation of modern AI

What is the Transformer?

LLM Visualization by Brendan Bycroft

Transformer Layer Stack

Layer-by-Layer Explorer

Input Embedding

Architecture Variants

Encoder-Decoder Variants

Token Dataflow

Step-by-Step Dataflow

Key Concepts

Residual Connections

Layer Normalization

Positional Encoding

Why It Matters