TURION .AI

Speculative Decoding for Production LLMs

Balys Kriksciunas · · 7 min read
Speculative Decoding for LLMs

Speculative decoding uses a small 'draft' model to propose multiple tokens that a larger model verifies in parallel, cutting inference latency 2–3x. A practical guide to production deployment.

Speculative Decoding for Production LLMs

LLM inference latency comes from one brutal fact: generation is autoregressive. You predict token N, then N+1, then N+2, serially. You cannot parallelize across tokens within a single request.

Speculative decoding breaks this constraint. A small “draft” model proposes K future tokens; the main model verifies all K in a single forward pass. Most proposed tokens are accepted; a few are corrected. Net result: 2–3x lower latency for the same quality.

It’s one of the most important inference optimizations of the last two years. This post covers what it is, when it helps, and how to deploy it.


The Intuition

In normal LLM generation:

  1. Model sees prompt, generates token 1 (one forward pass)
  2. Model sees prompt + token 1, generates token 2 (one forward pass)
  3. Each new token requires one full forward pass

For a 70B model, each forward pass is ~40ms. Generating 100 tokens takes 4 seconds. The GPU isn’t the bottleneck — memory bandwidth is (weights cross the memory bus every step).

Speculative decoding observes: the bottleneck isn’t compute, it’s the serial dependency. If we had multiple token candidates ready, we could verify them in parallel.

Concretely:

  1. A small draft model (e.g., a 1B model) generates 5 candidate tokens fast
  2. The main model does a single forward pass with all 5 proposed tokens
  3. For each proposed position, compare the main model’s distribution with the draft’s
  4. Accept as many tokens as “agree”; reject the first disagreement and proceed

The main model runs once for up to 5 tokens. If the draft’s predictions are mostly right (they often are — most tokens are easy), you get 3–5x the tokens per main-model forward pass.


The Math

Let K = number of tokens proposed per iteration, α = average acceptance rate per position.

Expected tokens accepted per iteration: sum over i from 1 to K of α^i ≈ (1 - α^K) / (1 - α) for α close to 1.

For typical α = 0.7, K = 5:

  • Accepted per iteration: ~2.5 tokens
  • Speedup vs baseline: 2.5x

For α = 0.8, K = 8:

  • Accepted per iteration: ~3.3 tokens
  • Speedup vs baseline: 3.3x

In practice, speedup is between 1.5x and 3x on real workloads. The win depends heavily on the draft model quality and the specific content being generated.


Why Quality Is Preserved

The key insight — and why this is not “lossy generation” — is that the main model has the final say. It only accepts the draft’s tokens that match its own distribution. The output is mathematically identical to what the main model would have produced alone.

No quality loss. This isn’t a tradeoff. It’s a pure speedup.

The only thing that changes is latency. Throughput per GPU can go up or down depending on the batching interaction (more below).


Picking a Draft Model

The draft model needs two properties:

  1. Fast — much faster than the main model, or the speedup doesn’t materialize
  2. Agreeable — predicts the same tokens the main model would, as often as possible

Typical combinations that work well:

Main modelDraft modelAcceptance rate
Llama-3-70BLlama-3-8B~65–75%
Llama-3-70BLlama-3.2-1B~60–70%
Llama-3.1-405BLlama-3.1-70B~70–80%
Mistral Large 2Mistral-7B~60–70%

The smaller the draft, the faster it generates proposals but the lower the acceptance rate. The sweet spot depends on your workload.

Medusa and EAGLE are alternative approaches — they add extra “heads” to the main model that propose tokens without needing a separate model. Tighter integration, but requires training. vLLM supports both.


Production Deployment

In vLLM

python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-70B-Instruct \
    --speculative-model meta-llama/Llama-3.2-1B-Instruct \
    --num-speculative-tokens 5 \
    --tensor-parallel-size 2

vLLM manages the draft model automatically. It runs on the same GPU(s) by default.

In TensorRT-LLM

TensorRT-LLM supports speculative decoding via the Medusa and EAGLE modes. Engine build requires extra flags:

build --speculative_decoding_mode medusa \
      --max_draft_len 5

More complex than vLLM; more tunable.

In TGI

TGI supports a limited set of draft models. Check their docs for the current supported list.


The Batching Interaction

Here’s where it gets interesting.

In single-request mode: speculative decoding is a pure latency win. 2–3x faster, no downside.

In high-concurrency mode: speculative decoding can hurt throughput. Why?

  • You’re doing more compute per iteration (main + draft + verification)
  • If the GPU is already saturated with batched requests, the extra compute just delays things

Measured on our benchmarks (Llama-3.1-70B, 4x H100):

ConcurrencyBaseline TPOTSpeculative TPOTLatency speedup
144ms18ms2.4x
446ms22ms2.1x
1652ms38ms1.4x
6472ms75ms0.96x
128120ms145ms0.83x

At low concurrency, huge wins. At high concurrency, breaks even or loses.

Practically: speculative decoding is a latency optimization, not a throughput one. Use it when latency matters more than tokens-per-dollar. Examples: interactive chat, coding assistants, voice applications.


Dynamic Speculative Decoding

Some systems (including recent vLLM versions) support dynamic speculation — turn it off when batches are full, on when they’re empty. Gives you best-of-both: low latency at low load, high throughput at peak.

This is increasingly the default configuration we deploy. The gateway-level signal is usually queue depth or GPU utilization.


What Affects Acceptance Rate

Things that help acceptance:

  • Well-matched draft model (same family, same training data)
  • Predictable content (common English text, code, structured output)
  • Low-temperature / deterministic sampling

Things that hurt acceptance:

  • High-temperature sampling
  • Very long-tail content (rare languages, novel tasks)
  • Chat vs instruct mismatch between main and draft
  • Heavy RAG-grounded generation where tokens depend on retrieved context the draft hasn’t seen

If your acceptance rate is below 50%, speculative decoding is probably a net negative. Our threshold for deploying: ≥60% acceptance sustained on eval workload.


Alternatives And Complements

Speculative decoding isn’t the only latency optimization:

  • Prefix caching — caches KV for repeated system prompts (see PagedAttention)
  • Chunked prefill — interleaves prefill with decode steps to smooth latency
  • Disaggregated prefill/decode — separates prefill and decode onto different hardware
  • Multi-query attention / grouped-query attention — reduces KV cache size, speeding decode

These compose. We regularly run FP8 + speculative decoding + prefix caching + continuous batching together. Each contributes independently.


When To Not Deploy It

  • Your workload is pure throughput (batch processing, bulk classification). Turn it off.
  • Your acceptance rate on your workload is below 50%. Turn it off.
  • Your GPU memory is already tight and the draft model pushes you over.
  • You need deterministic behavior across batch sizes (acceptance variance makes output timing variable).
  • Your main model is already small (<7B). Speculative decoding has less to optimize.

Operational Notes

1. Draft model quality matters. A poorly-matched draft can actually slow you down. Benchmark.

2. The draft model needs its own GPU memory. Factor it into sizing. For Llama-3-70B + 1B draft, expect ~3 GB extra per replica.

3. Acceptance monitoring. Add acceptance rate as a metric. When it drops (e.g., after a model update), you’ll want to know immediately.

4. Structured output interaction. JSON mode and tool calling work but acceptance rates can drop for structured tokens. Test your specific setup.

5. Latency variance increases. Best case: 3x faster. Worst case: same as baseline. P50 gets better, but P99 - P50 widens. UX-sensitive apps may care.


Summary

Speculative decoding is the single most effective latency optimization for interactive LLM workloads on modern hardware. Turn it on for:

  • Chat and assistant use cases
  • Voice-driven applications
  • Coding copilots
  • Any single-user, latency-sensitive path

Turn it off for:

  • Batch processing
  • High-concurrency throughput-bound workloads
  • Workloads where acceptance is below 60%

vLLM makes it trivial to enable. Test on your workload. Measure acceptance and latency. Keep it in your production bag of tricks.


Further Reading

Exploring speculative decoding for your workload? Reach out — we’ll benchmark it with your actual traffic in a day.

← back to blog