Transformer Architecture

Intermediate

The foundational architecture behind GPT, BERT, and all modern large language models.

Last updated: Feb 9, 2026

What is the Transformer?

The Transformer is a neural network architecture introduced in the landmark 2017 paper "Attention is All You Need" by Vaswani et al. It replaced recurrent and convolutional approaches with a purely attention-based mechanism, enabling massive parallelization during training and capturing long-range dependencies far more effectively. Nearly every modern large language model -- GPT, BERT, LLaMA, Claude -- is built on the Transformer.

“Attention is all you need.”

-- Vaswani et al., "Attention Is All You Need" (2017, Google Brain)

LLM Visualization by Brendan Bycroft

The best interactive 3D visualization of transformer internals available. Explore how GPT-style models process tokens through embedding, attention, and feed-forward layers -- step by step, parameter by parameter. Highly recommended.

3D InteractiveBy Brendan Bycroftbbycroft.net/llm

Transformer Layer Stack

A Transformer is built from a stack of identical layers. Click through each component to understand what it does and how data flows through the architecture.

Layer-by-Layer Explorer

Click each layer to see its role in the architecture

Layers 3-6 repeat N times

Input Embedding

Converts each input token (an integer ID) into a dense vector of dimension d_model (e.g. 768 or 4096). This learned lookup table is the model's vocabulary representation -- each word gets a unique high-dimensional vector.

The
cat
sat

Each token ID maps to a learned dense vector

Architecture Variants

The original Transformer used both an encoder and decoder. Modern models often use just one. Toggle between the three main variants to see which components each uses.

Encoder-Decoder Variants

Toggle to compare architectures

Encoder
Self-Attention
Add & Norm
FFN
Add & Norm
Cross-Attention
Decoder
Masked Self-Attn
Cross-Attention
Add & Norm
FFN
Add & Norm

Example models:

T5BARTmBART

The original architecture. The encoder processes the full input bidirectionally, then the decoder generates output tokens one at a time, attending to the encoder's representations via cross-attention. Used for translation, summarization, and sequence-to-sequence tasks.

Token Dataflow

Follow a single token as it flows through the entire Transformer pipeline, from raw text to output probabilities. Watch how the tensor shape changes at each step.

Step-by-Step Dataflow

Play or scrub through the processing pipeline

1
Tokenize
Raw text is split into token IDs using BPE or similar. Each token maps to an integer.
[batch, seq_len]
2
Embed
[batch, seq_len, d_model]
3
Add Position
[batch, seq_len, d_model]
4
Compute Attention
[batch, heads, seq_len, seq_len]
5
Attention Output
[batch, seq_len, d_model]
6
Feed-Forward
[batch, seq_len, d_ff]
7
FFN Output
[batch, seq_len, d_model]
8
Output Logits
[batch, seq_len, vocab_size]
TokenizeOutput Logits

Key Concepts

Residual Connections

Skip connections that add the input of each sublayer directly to its output. They solve the vanishing gradient problem and allow training of very deep networks. Without them, transformers with 50+ layers would be impossible to train.

Layer Normalization

Normalizes activations across the feature dimension to stabilize training. Applied after each residual addition. Pre-norm (normalize before sublayer) has become more common than post-norm in modern architectures.

Positional Encoding

Since attention has no inherent notion of order, position must be explicitly injected. The original paper used fixed sinusoidal functions; modern models typically use learned position embeddings or relative position encodings like RoPE.

Why It Matters

The Transformer architecture is arguably the most impactful innovation in AI of the past decade. It unlocked the scaling laws that make modern LLMs possible.

  • 1The Transformer replaced RNNs and LSTMs by enabling full parallelization during training, reducing training time from weeks to days
  • 2Its attention mechanism captures long-range dependencies that sequential models struggled with, enabling understanding of entire documents
  • 3The architecture scales remarkably well -- from 100M parameter BERT to 1.8T parameter GPT-4, performance improves predictably with scale
  • 4Every major LLM today (GPT, Claude, Gemini, LLaMA, Mistral) is built on the Transformer, making it the foundation of modern AI