Speculative Decoding

A technique to speed up LLM inference by using a small draft model to propose tokens that are verified in parallel by the target model.

What is Speculative Decoding?

Speculative decoding is an inference optimization technique that accelerates text generation from large language models. Instead of generating tokens one at a time with the large model, a smaller "draft" model quickly proposes multiple candidate tokens, which the larger "target" model then verifies in a single forward pass.

The key insight is that verification is much cheaper than generation. The target model can check multiple tokens in parallel because it processes all positions simultaneously during a forward pass, while autoregressive generation requires one forward pass per token.

The Inference Bottleneck

Standard autoregressive decoding is inherently slow because each token depends on all previous tokens, forcing sequential generation.

Sequential Dependency

Each new token requires a full forward pass through the model. For a 70B parameter model, generating 100 tokens means 100 separate forward passes.

Memory Bandwidth Bound

LLM inference is often limited by how fast we can load model weights from memory, not by computation. The GPU sits idle waiting for data.

How It Works

Speculative decoding follows a draft-then-verify pattern that exploits the parallel nature of transformer verification.

1

Draft Generation

A small, fast draft model (e.g., 7B parameters) generates K candidate tokens autoregressively. This is quick because the draft model is small.

2

Parallel Verification

The target model processes the prompt plus all K draft tokens in a single forward pass, computing probabilities for each position.

3

Token Acceptance

Each draft token is accepted or rejected by comparing draft and target probabilities. A rejection sampling scheme ensures the output distribution matches the target model exactly.

4

Continue or Resample

Accepted tokens are kept. At the first rejection, the target model samples a correction token. The process repeats from the new position.

Visual Example

Here's how speculative decoding processes a simple continuation:

Prompt:
"The quick brown fox"
Draft model proposes 4 tokens:
"jumps" → "over" → "the" → "lazy"
Target model verifies in one pass:
"jumps" - accepted
"over" - accepted
"the" - accepted
"lazy" → "sleeping" - rejected, target prefers "sleeping"
Final output:
"The quick brown fox jumps over the sleeping"
3 tokens accepted + 1 correction = 4 tokens from 2 forward passes instead of 4

Draft Model Requirements

The choice of draft model significantly impacts the speedup achieved. The ideal draft model balances speed with alignment to the target.

Much Smaller

The draft model should be 5-10x smaller than the target. A 7B draft for a 70B target, or a 1B draft for a 7B target.

Similar Distribution

Higher acceptance rates come from draft models trained on similar data or distilled from the target model.

Same Vocabulary

Draft and target must share the same tokenizer to ensure token-level compatibility during verification.

Fast Inference

The draft model must be fast enough that drafting K tokens takes less time than K target model forward passes.

What Affects Speedup?

Typical speedups range from 2-3x, but several factors influence the actual improvement.

Acceptance RateHigher = more tokens per verification pass
Draft Model SpeedFaster draft = more attempts possible
Target Model SizeLarger targets benefit more (more memory-bound)
Task PredictabilityPredictable text (code, structured) = higher acceptance

Variants and Extensions

Researchers have developed several variations to improve upon basic speculative decoding.

Self-Speculative Decoding

Uses early exit from the target model itself as the draft, eliminating the need for a separate draft model.

Medusa

Adds multiple prediction heads to the target model to generate draft tokens in parallel, avoiding sequential draft generation.

Lookahead Decoding

Generates multiple parallel speculation branches using n-gram patterns from the context, no draft model needed.

Staged Speculative Decoding

Uses a cascade of increasingly larger draft models for better acceptance rates on difficult tokens.

Limitations

While speculative decoding offers significant speedups, it has important constraints that limit when and how it can be applied effectively.

Draft Model Overhead

You need to run and maintain a separate draft model. This adds memory overhead (the draft model must fit in GPU memory alongside the target) and operational complexity.

Diminishing Returns with Batch Size

Speculative decoding shines for single-sequence inference. With larger batch sizes, the target model becomes compute-bound rather than memory-bound, reducing the benefit.

Variable Speedup

Speedup depends heavily on acceptance rate, which varies by task. Creative writing with high temperature may see little benefit, while structured code generation benefits greatly.

Implementation Complexity

Correct rejection sampling is tricky to implement. Naive implementations can produce outputs that differ from the target model's true distribution.

Not Always Faster

If the draft model is too slow, too inaccurate, or the target model is already fast enough, speculative decoding can actually be slower than standard decoding.

Interactive Simulation

Watch speculative decoding in action

Low matchHigh match
SlowerFaster
Prompt:
"The quick brown fox"
Draft Model (small, fast)
Waiting to generate...
Target Model (large, accurate)
No verified tokens yet
IdleDraftingVerifyComplete
Accepted
0
Rejected
0
Target Passes
0
vs. standard: 0
Speedup
0.00x
Actual rate: 0%

The draft model quickly proposes tokens (purple). The target model verifies them in a single pass (cyan). Accepted tokens are green, corrections are orange. Higher draft-target alignment means more accepted tokens and better speedup.

Key Takeaways

  • 1Speculative decoding uses a small draft model to propose tokens that are verified in parallel by the target model
  • 2It provides 2-3x speedups while producing identical outputs to standard decoding
  • 3The technique exploits the fact that transformer verification is parallel while generation is sequential
  • 4Effectiveness depends on draft-target alignment: similar models and predictable tasks yield higher acceptance rates