What is Tokenization?
Tokenization is the process of converting raw text into a sequence of tokens—the basic units that LLMs process. Tokens can be words, subwords, or even individual characters, depending on the tokenizer.
Why Tokenization Matters
Understanding tokenization is crucial because it directly impacts context limits, costs, and model behavior. The same text can have very different token counts across different models.
How It Works
Most modern LLMs use subword tokenization algorithms like BPE (Byte Pair Encoding) or SentencePiece. These algorithms learn common character sequences from training data.
Byte Pair Encoding (BPE)
BPE iteratively merges the most frequent character pairs into single tokens. Common words become single tokens, while rare words are split into subwords.
Token Types
Whole Words
Common words like "the", "and", "is" are often single tokens.
Subwords
Less common words are split: "unhappiness" → "un" + "happiness".
Special Tokens
Markers like <|endoftext|> or [CLS] for model control.
Interactive Demo
Type text to see how it gets tokenized
Enter text to tokenize
The quick brown fox jumps over the lazy dog.
Tokens
Characters
Tokens per character
Token Breakdown
Common tokens are single pieces
Common tokens are single pieces
Rare words get split into subwords
Cost Implications
API pricing is typically per-token. Efficient prompts use fewer tokens.
Key Takeaways
- 1Tokens are the atomic units LLMs process—not characters or words
- 2Different models have different tokenizers and vocabularies
- 3Non-English text and code often use more tokens than English
- 4Token count directly affects cost and context window usage