Speculative Decoding for Production LLMs
Speculative decoding uses a small 'draft' model to propose multiple tokens that a larger model verifies in parallel, cutting inference latency 2–3x. A practical guide to production deployment.
Speculative Decoding for Production LLMs
LLM inference latency comes from one brutal fact: generation is autoregressive. You predict token N, then N+1, then N+2, serially. You cannot parallelize across tokens within a single request.
Speculative decoding breaks this constraint. A small “draft” model proposes K future tokens; the main model verifies all K in a single forward pass. Most proposed tokens are accepted; a few are corrected. Net result: 2–3x lower latency for the same quality.
It’s one of the most important inference optimizations of the last two years. This post covers what it is, when it helps, and how to deploy it.
The Intuition
In normal LLM generation:
- Model sees prompt, generates token 1 (one forward pass)
- Model sees prompt + token 1, generates token 2 (one forward pass)
- …
- Each new token requires one full forward pass
For a 70B model, each forward pass is ~40ms. Generating 100 tokens takes 4 seconds. The GPU isn’t the bottleneck — memory bandwidth is (weights cross the memory bus every step).
Speculative decoding observes: the bottleneck isn’t compute, it’s the serial dependency. If we had multiple token candidates ready, we could verify them in parallel.
Concretely:
- A small draft model (e.g., a 1B model) generates 5 candidate tokens fast
- The main model does a single forward pass with all 5 proposed tokens
- For each proposed position, compare the main model’s distribution with the draft’s
- Accept as many tokens as “agree”; reject the first disagreement and proceed
The main model runs once for up to 5 tokens. If the draft’s predictions are mostly right (they often are — most tokens are easy), you get 3–5x the tokens per main-model forward pass.
The Math
Let K = number of tokens proposed per iteration, α = average acceptance rate per position.
Expected tokens accepted per iteration: sum over i from 1 to K of α^i ≈ (1 - α^K) / (1 - α) for α close to 1.
For typical α = 0.7, K = 5:
- Accepted per iteration: ~2.5 tokens
- Speedup vs baseline: 2.5x
For α = 0.8, K = 8:
- Accepted per iteration: ~3.3 tokens
- Speedup vs baseline: 3.3x
In practice, speedup is between 1.5x and 3x on real workloads. The win depends heavily on the draft model quality and the specific content being generated.
Why Quality Is Preserved
The key insight — and why this is not “lossy generation” — is that the main model has the final say. It only accepts the draft’s tokens that match its own distribution. The output is mathematically identical to what the main model would have produced alone.
No quality loss. This isn’t a tradeoff. It’s a pure speedup.
The only thing that changes is latency. Throughput per GPU can go up or down depending on the batching interaction (more below).
Picking a Draft Model
The draft model needs two properties:
- Fast — much faster than the main model, or the speedup doesn’t materialize
- Agreeable — predicts the same tokens the main model would, as often as possible
Typical combinations that work well:
| Main model | Draft model | Acceptance rate |
|---|---|---|
| Llama-3-70B | Llama-3-8B | ~65–75% |
| Llama-3-70B | Llama-3.2-1B | ~60–70% |
| Llama-3.1-405B | Llama-3.1-70B | ~70–80% |
| Mistral Large 2 | Mistral-7B | ~60–70% |
The smaller the draft, the faster it generates proposals but the lower the acceptance rate. The sweet spot depends on your workload.
Medusa and EAGLE are alternative approaches — they add extra “heads” to the main model that propose tokens without needing a separate model. Tighter integration, but requires training. vLLM supports both.
Production Deployment
In vLLM
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--speculative-model meta-llama/Llama-3.2-1B-Instruct \
--num-speculative-tokens 5 \
--tensor-parallel-size 2
vLLM manages the draft model automatically. It runs on the same GPU(s) by default.
In TensorRT-LLM
TensorRT-LLM supports speculative decoding via the Medusa and EAGLE modes. Engine build requires extra flags:
build --speculative_decoding_mode medusa \
--max_draft_len 5
More complex than vLLM; more tunable.
In TGI
TGI supports a limited set of draft models. Check their docs for the current supported list.
The Batching Interaction
Here’s where it gets interesting.
In single-request mode: speculative decoding is a pure latency win. 2–3x faster, no downside.
In high-concurrency mode: speculative decoding can hurt throughput. Why?
- You’re doing more compute per iteration (main + draft + verification)
- If the GPU is already saturated with batched requests, the extra compute just delays things
Measured on our benchmarks (Llama-3.1-70B, 4x H100):
| Concurrency | Baseline TPOT | Speculative TPOT | Latency speedup |
|---|---|---|---|
| 1 | 44ms | 18ms | 2.4x |
| 4 | 46ms | 22ms | 2.1x |
| 16 | 52ms | 38ms | 1.4x |
| 64 | 72ms | 75ms | 0.96x |
| 128 | 120ms | 145ms | 0.83x |
At low concurrency, huge wins. At high concurrency, breaks even or loses.
Practically: speculative decoding is a latency optimization, not a throughput one. Use it when latency matters more than tokens-per-dollar. Examples: interactive chat, coding assistants, voice applications.
Dynamic Speculative Decoding
Some systems (including recent vLLM versions) support dynamic speculation — turn it off when batches are full, on when they’re empty. Gives you best-of-both: low latency at low load, high throughput at peak.
This is increasingly the default configuration we deploy. The gateway-level signal is usually queue depth or GPU utilization.
What Affects Acceptance Rate
Things that help acceptance:
- Well-matched draft model (same family, same training data)
- Predictable content (common English text, code, structured output)
- Low-temperature / deterministic sampling
Things that hurt acceptance:
- High-temperature sampling
- Very long-tail content (rare languages, novel tasks)
- Chat vs instruct mismatch between main and draft
- Heavy RAG-grounded generation where tokens depend on retrieved context the draft hasn’t seen
If your acceptance rate is below 50%, speculative decoding is probably a net negative. Our threshold for deploying: ≥60% acceptance sustained on eval workload.
Alternatives And Complements
Speculative decoding isn’t the only latency optimization:
- Prefix caching — caches KV for repeated system prompts (see PagedAttention)
- Chunked prefill — interleaves prefill with decode steps to smooth latency
- Disaggregated prefill/decode — separates prefill and decode onto different hardware
- Multi-query attention / grouped-query attention — reduces KV cache size, speeding decode
These compose. We regularly run FP8 + speculative decoding + prefix caching + continuous batching together. Each contributes independently.
When To Not Deploy It
- Your workload is pure throughput (batch processing, bulk classification). Turn it off.
- Your acceptance rate on your workload is below 50%. Turn it off.
- Your GPU memory is already tight and the draft model pushes you over.
- You need deterministic behavior across batch sizes (acceptance variance makes output timing variable).
- Your main model is already small (<7B). Speculative decoding has less to optimize.
Operational Notes
1. Draft model quality matters. A poorly-matched draft can actually slow you down. Benchmark.
2. The draft model needs its own GPU memory. Factor it into sizing. For Llama-3-70B + 1B draft, expect ~3 GB extra per replica.
3. Acceptance monitoring. Add acceptance rate as a metric. When it drops (e.g., after a model update), you’ll want to know immediately.
4. Structured output interaction. JSON mode and tool calling work but acceptance rates can drop for structured tokens. Test your specific setup.
5. Latency variance increases. Best case: 3x faster. Worst case: same as baseline. P50 gets better, but P99 - P50 widens. UX-sensitive apps may care.
Summary
Speculative decoding is the single most effective latency optimization for interactive LLM workloads on modern hardware. Turn it on for:
- Chat and assistant use cases
- Voice-driven applications
- Coding copilots
- Any single-user, latency-sensitive path
Turn it off for:
- Batch processing
- High-concurrency throughput-bound workloads
- Workloads where acceptance is below 60%
vLLM makes it trivial to enable. Test on your workload. Measure acceptance and latency. Keep it in your production bag of tricks.
Further Reading
- vLLM: The Open-Source Inference Engine
- PagedAttention Explained: How vLLM Achieves 24x Throughput
- Disaggregated Inference: Prefill, Decode, and the New Serving Topology
Exploring speculative decoding for your workload? Reach out — we’ll benchmark it with your actual traffic in a day.
Related Posts
vLLM and SGLang Are Converging — and That Changes the Inference Stack
Both engines now share NVIDIA's FlashInfer kernels and expose identical OpenAI-compatible APIs. Meanwhile, SGLang spun out as RadixArk with $100M in seed funding, and vLLM hit 2M weekly installs. The inference layer is consolidating faster than anyone expected — here's what that means for teams building on top of it.
KV Cache Optimization Techniques for LLM Serving
KV cache dominates memory and cost in LLM serving. Paged, compressed, offloaded, and shared — serve 2–4x more concurrent requests.
PagedAttention Explained: How vLLM Achieves 24x Throughput
PagedAttention borrows OS virtual-memory ideas to fix the biggest efficiency problem in LLM serving: fragmented KV caches. Here's how it works and why it changed LLM inference.