What is Speculative Decoding?
Speculative decoding is an inference optimization technique that accelerates text generation from large language models. Instead of generating tokens one at a time with the large model, a smaller "draft" model quickly proposes multiple candidate tokens, which the larger "target" model then verifies in a single forward pass.
The key insight is that verification is much cheaper than generation. The target model can check multiple tokens in parallel because it processes all positions simultaneously during a forward pass, while autoregressive generation requires one forward pass per token.
The Inference Bottleneck
Standard autoregressive decoding is inherently slow because each token depends on all previous tokens, forcing sequential generation.
Sequential Dependency
Each new token requires a full forward pass through the model. For a 70B parameter model, generating 100 tokens means 100 separate forward passes.
Memory Bandwidth Bound
LLM inference is often limited by how fast we can load model weights from memory, not by computation. The GPU sits idle waiting for data.
How It Works
Speculative decoding follows a draft-then-verify pattern that exploits the parallel nature of transformer verification.
Draft Generation
A small, fast draft model (e.g., 7B parameters) generates K candidate tokens autoregressively. This is quick because the draft model is small.
Parallel Verification
The target model processes the prompt plus all K draft tokens in a single forward pass, computing probabilities for each position.
Token Acceptance
Each draft token is accepted or rejected by comparing draft and target probabilities. A rejection sampling scheme ensures the output distribution matches the target model exactly.
Continue or Resample
Accepted tokens are kept. At the first rejection, the target model samples a correction token. The process repeats from the new position.
Visual Example
Here's how speculative decoding processes a simple continuation:
Draft Model Requirements
The choice of draft model significantly impacts the speedup achieved. The ideal draft model balances speed with alignment to the target.
Much Smaller
The draft model should be 5-10x smaller than the target. A 7B draft for a 70B target, or a 1B draft for a 7B target.
Similar Distribution
Higher acceptance rates come from draft models trained on similar data or distilled from the target model.
Same Vocabulary
Draft and target must share the same tokenizer to ensure token-level compatibility during verification.
Fast Inference
The draft model must be fast enough that drafting K tokens takes less time than K target model forward passes.
What Affects Speedup?
Typical speedups range from 2-3x, but several factors influence the actual improvement.
Variants and Extensions
Researchers have developed several variations to improve upon basic speculative decoding.
Self-Speculative Decoding
Uses early exit from the target model itself as the draft, eliminating the need for a separate draft model.
Medusa
Adds multiple prediction heads to the target model to generate draft tokens in parallel, avoiding sequential draft generation.
Lookahead Decoding
Generates multiple parallel speculation branches using n-gram patterns from the context, no draft model needed.
Staged Speculative Decoding
Uses a cascade of increasingly larger draft models for better acceptance rates on difficult tokens.
Limitations
While speculative decoding offers significant speedups, it has important constraints that limit when and how it can be applied effectively.
Draft Model Overhead
You need to run and maintain a separate draft model. This adds memory overhead (the draft model must fit in GPU memory alongside the target) and operational complexity.
Diminishing Returns with Batch Size
Speculative decoding shines for single-sequence inference. With larger batch sizes, the target model becomes compute-bound rather than memory-bound, reducing the benefit.
Variable Speedup
Speedup depends heavily on acceptance rate, which varies by task. Creative writing with high temperature may see little benefit, while structured code generation benefits greatly.
Implementation Complexity
Correct rejection sampling is tricky to implement. Naive implementations can produce outputs that differ from the target model's true distribution.
Not Always Faster
If the draft model is too slow, too inaccurate, or the target model is already fast enough, speculative decoding can actually be slower than standard decoding.
Interactive Simulation
Watch speculative decoding in action
The draft model quickly proposes tokens (purple). The target model verifies them in a single pass (cyan). Accepted tokens are green, corrections are orange. Higher draft-target alignment means more accepted tokens and better speedup.
Key Takeaways
- 1Speculative decoding uses a small draft model to propose tokens that are verified in parallel by the target model
- 2It provides 2-3x speedups while producing identical outputs to standard decoding
- 3The technique exploits the fact that transformer verification is parallel while generation is sequential
- 4Effectiveness depends on draft-target alignment: similar models and predictable tasks yield higher acceptance rates