Tokenization

How LLMs break text into tokens—the fundamental units of language understanding.

What is Tokenization?

Tokenization is the process of converting raw text into a sequence of tokens—the basic units that LLMs process. Tokens can be words, subwords, or even individual characters, depending on the tokenizer.

Why Tokenization Matters

Understanding tokenization is crucial because it directly impacts context limits, costs, and model behavior. The same text can have very different token counts across different models.

How It Works

Most modern LLMs use subword tokenization algorithms like BPE (Byte Pair Encoding) or SentencePiece. These algorithms learn common character sequences from training data.

Byte Pair Encoding (BPE)

BPE iteratively merges the most frequent character pairs into single tokens. Common words become single tokens, while rare words are split into subwords.

Token Types

Whole Words

Common words like "the", "and", "is" are often single tokens.

Subwords

Less common words are split: "unhappiness" → "un" + "happiness".

Special Tokens

Markers like <|endoftext|> or [CLS] for model control.

🔤

Interactive Demo

Type text to see how it gets tokenized

Enter text to tokenize

The quick brown fox jumps over the lazy dog.

o200k_baseGPT-4o / GPT-4.1 tokenizer

Tokens

44

Characters

Tokens per character

Token Breakdown

Common tokens are single pieces

·spacenewlinetab·wordleading space

Common tokens are single pieces

Rare words get split into subwords

Cost Implications

API pricing is typically per-token. Efficient prompts use fewer tokens.

Key Takeaways

  • 1Tokens are the atomic units LLMs process—not characters or words
  • 2Different models have different tokenizers and vocabularies
  • 3Non-English text and code often use more tokens than English
  • 4Token count directly affects cost and context window usage