Learn AI Concepts | Interactive Guide

What is Tokenization?

Tokenization is the process of converting raw text into a sequence of tokens—the basic units that LLMs process. Tokens can be words, subwords, or even individual characters, depending on the tokenizer.

Why Tokenization Matters

Understanding tokenization is crucial because it directly impacts context limits, costs, and model behavior. The same text can have very different token counts across different models.

How It Works

Most modern LLMs use subword tokenization algorithms like BPE (Byte Pair Encoding) or SentencePiece. These algorithms learn common character sequences from training data.

Byte Pair Encoding (BPE)

BPE iteratively merges the most frequent character pairs into single tokens. Common words become single tokens, while rare words are split into subwords.

Token Types

Whole Words

Common words like "the", "and", "is" are often single tokens.

Subwords

Less common words are split: "unhappiness" → "un" + "happiness".

Special Tokens

Markers like <|endoftext|> or [CLS] for model control.

🔤

Interactive Demo

Type text to see how it gets tokenized

Enter text to tokenize

The quick brown fox jumps over the lazy dog.

o200k_baseGPT-4o / GPT-4.1 tokenizer

Tokens

Characters

Tokens per character

Token Breakdown

Common tokens are single pieces

·space↵newline→tab·wordleading space

Common tokens are single pieces

Rare words get split into subwords

Cost Implications

API pricing is typically per-token. Efficient prompts use fewer tokens.

Key Takeaways

1Tokens are the atomic units LLMs process—not characters or words
2Different models have different tokenizers and vocabularies
3Non-English text and code often use more tokens than English
4Token count directly affects cost and context window usage