TURION .AI

KV Cache Optimization Techniques for LLM Serving

Balys Kriksciunas · · 7 min read
KV Cache Optimization

KV cache dominates memory and cost in LLM serving. Paged, compressed, offloaded, and shared — serve 2–4x more concurrent requests.

KV Cache Optimization Techniques for LLM Serving

Ask an inference engineer what limits concurrency on their GPU and the answer is almost always “KV cache.” Model weights are fixed. Activations are transient. The KV cache — the stored attention keys and values for every token in every active request — grows linearly with concurrent requests and context length, and eats the remaining GPU memory.

If you can fit more KV cache, you can serve more concurrent requests. That’s the whole game. This post surveys the techniques that make that possible in 2025.


The Baseline: How Much KV Cache Do You Actually Need?

For a transformer model:

KV cache per token = 2 (K+V) × num_layers × num_heads × head_dim × bytes_per_element

For Llama-3-70B:

  • 80 layers, 8 KV heads (GQA), 128 head_dim
  • FP16: 2 × 80 × 8 × 128 × 2 = 327,680 bytes ≈ 320 KB per token

For 8K context and 128 concurrent requests: 8,192 × 320 KB × 128 = ~336 GB. That’s more than any single GPU.

Strategies to fit more in the same memory:

  1. Reduce per-token size (quantization, GQA/MQA)
  2. Reduce fragmentation (PagedAttention)
  3. Reduce duplication (prefix sharing)
  4. Move some off-GPU (offloading)
  5. Evict or compress cold blocks

Let’s cover each.


1. Architectural Reductions: GQA and MQA

Multi-Head Attention (MHA) — every head has its own K and V. Max quality, max cache.

Grouped-Query Attention (GQA) — K and V are shared across groups of query heads. 4–8x smaller KV cache vs MHA, minor quality loss.

Multi-Query Attention (MQA) — single K and V shared across all query heads. Smallest cache, biggest quality hit.

Llama-3-70B uses GQA with 8 KV heads for 64 query heads — an 8x KV cache reduction over pure MHA. Mistral, Qwen, and most modern models also use GQA.

If you’re training a new model from scratch, use GQA. If you’re serving an existing one, its architecture is fixed — check the config to know what you’re dealing with.


2. KV Cache Quantization

Storing KV cache in lower precision cuts its size proportionally.

FP8 KV cache:

  • 2x smaller than FP16
  • Negligible quality loss on most workloads
  • Supported in vLLM, TGI, TRT-LLM
vllm serve ... --kv-cache-dtype fp8

INT4 KV cache:

  • 4x smaller than FP16
  • Small but measurable quality loss
  • Supported by some servers (SGLang’s KV cache quantization, custom)

FP4 KV cache:

  • 8x smaller on B200 hardware
  • Early 2025, quality on long contexts still being validated

For production on H100: FP8 KV cache is essentially free savings. Turn it on.


3. PagedAttention: Eliminating Fragmentation

Covered in depth in PagedAttention Explained. The short version: traditional allocators reserve worst-case contiguous blocks per request, wasting 40–60% of KV memory. PagedAttention uses fixed-size blocks (typically 16 tokens each) that any request can use.

In practice: KV memory utilization goes from ~50% (traditional) to ~92% (PagedAttention). Nearly 2x the concurrent requests from the same hardware.

Every modern inference server supports this. If yours doesn’t, switch.


4. Prefix Caching: Sharing KV Across Requests

If many requests share a prefix (system prompt, retrieved documents, few-shot examples), the KV cache for that prefix can be computed once and pointed to by multiple requests.

Gains are workload-dependent but often large:

WorkloadPrefix shareSpeedup
RAG with 2K system prompt100% of requests1.5–2x
Coding assistant with docs~90%1.4x
Open chatVariable1.1–1.3x
Agent with tool schemas100%1.6–2x

Enable on vLLM:

vllm serve ... --enable-prefix-caching

Combined with PagedAttention, this is essentially free for most production workloads.


5. SGLang’s RadixAttention

SGLang extends prefix caching with a radix tree over all seen prefixes. Unlike linear prefix caching, RadixAttention handles prefixes that branch: system prompt → few-shot example A → user query, and system prompt → few-shot example B → user query share cache up to the branch point.

For agent workloads with heavy branching (planning trees, tool-calling loops), this can add another 1.3–2x on top of linear prefix caching.

If your workload has heavy prefix branching, consider SGLang.


6. KV Cache Offloading

When GPU memory fills, older KV blocks can be paged to CPU memory or even NVMe, fetched back when needed.

Offload to CPU:

  • 10–50x slower than GPU memory, but far bigger (hundreds of GB)
  • vLLM supports CPU offload; SGLang and TGI partial support
  • Best for workloads with long idle periods between turns (chat apps)

Offload to disk (NVMe):

  • Another 10x slower than CPU memory
  • Used for very long context or multi-day conversations
  • Research-stage; emerging in production systems

Offloading trades latency for capacity. Works best when the “hot” working set fits in GPU while the full history can live elsewhere.


7. Eviction Policies

When memory fills and you don’t want to offload, you evict. Options:

  • LRU: evict the least-recently-used request
  • Preempt and recompute: drop a request’s KV cache; re-prefill when it resumes
  • Priority-based: keep premium users’ requests, evict free-tier first

vLLM’s --preemption-mode recompute is the default in recent versions. It drops blocks and recomputes when the request resumes. Simpler and usually faster than swap-to-CPU.


8. Sliding-Window and Chunked Attention

Some models (Mistral, Gemma, Qwen) were trained with sliding-window attention — they only attend to the last N tokens, so KV cache is bounded regardless of context length.

At serving time, you can also chunk attention artificially, at the cost of some quality for very long contexts. Useful when you need to fit 128K+ context in limited memory.


9. Disaggregated Prefill and Decode

Prefill (processing the prompt) and decode (generating tokens) have different resource profiles. Separating them onto different node pools lets each be optimized:

  • Prefill nodes: compute-bound, smaller KV needed
  • Decode nodes: bandwidth-bound, larger KV cache

KV cache is transferred from prefill to decode when prompt processing finishes. See Disaggregated Inference.


10. Compression

Research techniques compress KV cache beyond quantization:

  • H2O — keeps “heavy hitter” tokens, evicts others
  • StreamingLLM — keeps only the start tokens and a recent window
  • KIVI — 2-bit KV quantization with calibrated grouping
  • CachGen — compresses at the block level with context-aware policies

These are mostly 2024–2025 research. Some are in vLLM or SGLang experimental branches. Worth tracking; not yet table stakes.


Tuning KV Cache For Your Workload

Three knobs in any good inference server:

--gpu-memory-utilization: what fraction of GPU memory the server can use for KV cache (weights take the rest). Push to 0.92–0.95 on dedicated nodes.

--max-num-seqs: max concurrent active requests. Higher = more concurrency, less KV per request, potential for eviction. Tune based on acceptable eviction rate.

--max-model-len: max context length you’ll allow. Smaller = more slots in the same KV budget. Set to actual production max, not theoretical.

Example for production Llama-3-70B on H100 TP=2:

vllm serve meta-llama/Llama-3.1-70B-Instruct \
    --tensor-parallel-size 2 \
    --quantization fp8 \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.94 \
    --max-num-seqs 256 \
    --max-model-len 8192 \
    --enable-prefix-caching

That configuration gives us ~250 concurrent active requests with ~8K context, sustained on 2x H100s. Without FP8 KV cache, it would be ~130.


Measuring Success

Three metrics to watch:

  1. KV cache utilization: how full your KV cache is on average. vLLM exports this. >85% means you’re pushing hard; >95% sustained means you’re eviction-thrashing.
  2. Preemption rate: how often active requests get evicted. >1% means your concurrency cap is too high or you need more memory.
  3. TTFT vs queue depth correlation: if TTFT spikes when queue depth rises, you’re capacity-bound on KV, not GPU compute.

The Path Forward

KV cache management is still evolving. 2025–2026 directions:

  • Cross-request KV sharing beyond prefix: mid-sequence sharing when multiple requests converge on similar generations.
  • Persistent KV across sessions: caching a user’s conversation KV in external storage.
  • Cluster-wide KV pools: shared blocks across inference nodes, enabled by fast interconnects.
  • Learned cache policies: RL-trained eviction.

For 2025: quantize (FP8), page (PagedAttention), share (prefix caching), and tune. That covers 95% of the gains.


Further Reading

Running into KV cache pressure in production? We can help — profiling and tuning in under a week.

← back to blog