KV Cache Optimization for LLM Inference: A Deep Dive

When teams first move a large language model from a notebook demo to a production endpoint, they almost always discover the same surprise: the model weights are not what fill the GPU. The KV cache is. A 70B model in FP16 weighs roughly 140 GB, but at long context lengths and high concurrency, the key-value cache can rival or exceed that — and unlike weights, it grows with every token, every request, and every concurrent user. KV cache optimization is therefore the single highest-leverage lever you have over serving cost and throughput, because the cache is what actually caps how many requests you can batch together.

This post treats the KV cache not as an implementation detail but as the central resource you are managing in an inference server. We will derive the memory math, then work through the techniques that attack each term in that equation — and, just as importantly, where each one stops helping.

What this post covers: the memory formula, prefill vs decode dynamics, PagedAttention, prefix caching, KV quantization, GQA/MQA, eviction and CPU offload, and sliding-window attention — with the trade-offs that decide which to reach for.

Context: why the KV cache exists at all

The KV cache exists to avoid recomputing attention over tokens the model has already seen. During autoregressive decoding, each new token attends to every previous token’s keys and values; caching them turns an O(n²) recompute into an O(n) lookup per step. Without it, generating token 1,000 would re-process the first 999 every time.

A transformer’s attention layer projects each token into a query, a key, and a value vector. To predict the next token, the current query attends across all prior keys and values. Those prior keys and values do not change once computed — token 50’s key is identical whether you are decoding token 51 or token 5,000. So we store them. This is the cache: a per-layer, per-head store of key and value tensors for every token in the sequence so far. The trade is the classic one in systems engineering — we spend memory to save compute, and that memory cost is exactly what this article is about.

The reason this matters more in 2026 than it did two years ago is context length. Models routinely advertise 128K or longer context windows, agentic workloads replay enormous tool-call histories, and retrieval-augmented prompts stuff thousands of tokens into every request. Each of those tokens occupies cache in every layer. The result is that for long-context serving, the KV cache — not the weights and not the compute — is the binding constraint. Understanding it is foundational to everything downstream, including speculative decoding, which only pays off once your cache and batching are already efficient. For the canonical treatment of the attention mechanism that produces these tensors, the original transformer paper remains the reference (Vaswani et al., “Attention Is All You Need”, arXiv:1706.03762).

It helps to frame the whole problem economically. Inference cost is dominated by how many sequences you can keep on a GPU at once, because each device-hour is fixed regardless of utilization. If the cache lets you serve 8 concurrent requests, your per-request cost is roughly four times higher than if it let you serve 32. Every technique in this article is, at bottom, an attempt to raise that concurrency number without buying more hardware. That reframing — cache footprint as the denominator of your cost-per-request — is the lens I want you to carry through the rest of the post.

The memory math: sizing the KV cache

The KV cache size for a single sequence is approximately 2 × num_layers × num_kv_heads × head_dim × seq_len × bytes_per_element, and multiplying by batch size gives total footprint. The leading 2 accounts for storing both keys and values. Every other term is a knob, and every optimization in this article reduces one of them — so it is worth internalizing the formula before touching any technique.

Figure 1 — Each KV cache optimization attacks a specific term in the memory equation; knowing which term tells you when a technique helps.

Walking a worked example

Let us put illustrative numbers through the formula. These parameters are representative of a mid-to-large dense model; treat them as labeled placeholders, not a spec for any specific model:

num_layers = 80
num_kv_heads = 8 (already reduced by grouped-query attention; more on that below)
head_dim = 128
seq_len = 8,192 tokens
bytes_per_element = 2 (FP16)

Per-token, per-sequence the cache costs 2 × 80 × 8 × 128 × 2 bytes ≈ 327,680 bytes, about 320 KiB per token. For an 8,192-token sequence that is roughly 2.5 GiB. The headline takeaway is the linearity: at batch size 32, you are looking at roughly 80 GiB of cache — on the same device that already holds the model weights. This is why a server with plenty of compute headroom still hits an out-of-memory wall as concurrency climbs.

Run the same arithmetic against the wrong architecture and the numbers turn hostile fast. If this model used full multi-head attention with 64 KV heads instead of 8, every figure above multiplies by eight: 2.5 KiB per token becomes 20 KiB, the 8,192-token sequence becomes roughly 20 GiB, and a batch of 32 would demand 640 GiB of cache — far beyond any single accelerator. That single architectural term is the difference between a model you can serve on one device and one that forces multi-GPU sharding purely to hold the cache. It is worth keeping a back-of-envelope version of this formula in your head, because it lets you sanity-check a serving plan in seconds: layers times KV heads times head-dim, doubled, times two bytes, gives per-token cost; multiply by your target context and concurrency and compare to free HBM. If that product exceeds what is left after weights, you have a cache problem before you have written a line of serving code.

Why this caps batch size and throughput

The practical consequence is a budget equation. Take your total HBM, subtract model weights and activation working memory, and divide the remainder by per-sequence cache cost — that quotient is your maximum concurrent batch. Because decode throughput on modern accelerators scales with batch size (the GPU is memory-bandwidth bound during decode, not compute bound, so larger batches amortize each weight read across more sequences), anything that shrinks per-sequence cache directly raises the batch ceiling and therefore tokens-per-second. Cache optimization is throughput optimization. This relationship is documented in vLLM’s own analysis of memory-bound serving (vLLM documentation).

Prefill versus decode

The two phases of inference stress the cache differently. Prefill processes the entire prompt in one parallel pass, is compute-bound, and writes the initial cache in bulk. Decode then generates one token at a time, is memory-bandwidth-bound, and appends one cache entry per step per sequence. Prefill is a burst; decode is a long, steady drip that dominates total latency for any meaningful generation length. Most cache optimizations target decode, because that is where you spend the most wall-clock time and where the cache footprint grows unbounded.

This split has a direct consequence for how you reason about each technique. Anything that reduces prefill cost — prefix caching is the prime example — improves time-to-first-token and frees compute, but does nothing to slow the per-token cache growth during decode. Anything that reduces per-token footprint — quantization, fewer KV heads, sliding windows — directly raises the concurrency ceiling that decode throughput depends on. When you evaluate a technique, the first question to ask is which phase it touches, because that tells you which metric it will move: time-to-first-token, or sustained tokens-per-second. Confusing the two is the most common reason a promising optimization fails to show up where someone expected it.

The optimization toolkit: shrinking each term

The cache optimizations divide cleanly by which they attack: memory layout (PagedAttention), redundant work (prefix caching), precision (quantization), the head count (GQA/MQA), and the sequence length itself (eviction, offload, sliding windows). The best production stacks combine several, because they are largely orthogonal. Below we take them in roughly that order.

PagedAttention and block-based allocation

The naïve approach reserves one contiguous buffer per request sized for the maximum possible sequence length. This wastes enormous memory to internal fragmentation: a request that generates 200 tokens but was allocated for 8,192 leaves the rest reserved and idle. PagedAttention, introduced in vLLM, borrows the operating-system idea of paging. It splits the cache into fixed-size blocks (for example, 16 tokens each) and allocates blocks on demand, maintaining a block table that maps a sequence’s logical positions to physical blocks anywhere in memory.

Figure 2 — A block table indirection lets PagedAttention pack KV blocks densely, eliminating the reserved-but-unused waste of contiguous allocation.

The payoff is near-elimination of internal fragmentation, which in practice lets you fit substantially more concurrent sequences into the same HBM. The block-table indirection also enables copy-on-write sharing: two sequences with an identical prefix can point at the same physical blocks until they diverge, which is the foundation for prefix caching below. The cost is a layer of indirection and a custom attention kernel that can gather from non-contiguous blocks — complexity that frameworks now hide, but that you should understand when debugging memory behavior. The original design is described in the PagedAttention paper (Kwon et al., arXiv:2309.06180).

It is worth being precise about which kind of waste paging fixes. Contiguous allocation suffers from two problems at once. Internal fragmentation is the reserved-but-unused tail of an over-provisioned buffer — the 7,992 tokens of room you booked for a request that generated 200. External fragmentation is the opposite: enough free memory exists in total to admit a new request, but it is scattered in gaps too small to hold one contiguous buffer, so the request is rejected anyway. Paging dissolves both, because a block can live anywhere and a sequence simply collects whatever free blocks exist. This is exactly why the operating-systems analogy is more than a marketing flourish — virtual memory solved the identical problem for process address spaces decades ago, and the KV cache is, structurally, the same allocation problem wearing a different hat. The one parameter that does not disappear is block size: it sets the granularity of allocation, and because a sequence rounds up to a whole block, a too-large block reintroduces a smaller dose of the internal fragmentation you were trying to kill. Most deployments leave it at the framework default and only revisit it when profiling says to.

Prefix caching and prompt reuse

Many requests share a prefix: a long system prompt, a few-shot template, a document being queried repeatedly, or a conversation’s accumulated history. Prefix caching — sometimes called automatic prefix caching in vLLM — keeps the KV blocks for those shared tokens resident and reuses them across requests instead of recomputing them. When a new request arrives whose opening tokens hash-match cached blocks, the server skips prefill for the matched span entirely.

# Pseudocode: prefix-cache lookup at request admission
def admit_request(token_ids, block_size=16):
    cached_blocks, matched_len = [], 0
    # Walk the prompt block-by-block, matching the longest cached prefix.
    for i in range(0, len(token_ids), block_size):
        block = tuple(token_ids[i:i + block_size])
        block_hash = hash_prefix(cached_blocks, block)  # chain hash with prior blocks
        if block_hash in kv_block_pool:
            cached_blocks.append(kv_block_pool[block_hash])  # reuse, skip prefill
            matched_len += len(block)
        else:
            break  # first miss ends the matchable prefix
    return cached_blocks, matched_len  # prefill only token_ids[matched_len:]

The win is largest for workloads with heavy prefix overlap — multi-turn chat, agents that replay tool histories, or RAG over a stable corpus — where it can remove most prefill cost. The catch is that prefix caching helps prefill, not decode, so it does nothing for the per-token cache growth that dominates long generations. It also consumes memory to keep prefixes resident, so it competes with active sequences for the same blocks; eviction policy on the prefix pool matters.

There is also a design lesson hiding in why the hashing is chained. A block’s cache identity must depend not just on its own tokens but on every token before it, because attention keys and values are position- and context-dependent — the KV entries for the phrase “summarize the following” are different when they follow a 4,000-token document than when they open a request. Chaining the hash across prior blocks guarantees a reuse only happens when the entire preceding context matches, which is what makes the reuse correct rather than merely plausible. This is why prefix caching shines for stable system prompts and templated few-shot scaffolding, where the leading tokens are byte-identical request after request, and why it quietly stops helping the moment something varies near the front of the prompt — a per-request timestamp or user ID placed before the shared body will invalidate the whole match. Order your prompts so the stable, shareable content comes first and the variable content comes last; that single layout decision can be the difference between a 5% and a 90% hit rate.

KV cache quantization

The cleanest way to attack bytes_per_element is to store keys and values in lower precision than the compute path. KV cache quantization to FP8 halves cache footprint versus FP16 with typically negligible quality loss; INT8 is similar in size with slightly more care needed around outliers; INT4 quarters the original footprint but is where accuracy risk becomes real and per-channel or per-group scaling becomes essential.

Figure 3 — Lowering KV precision trades accuracy headroom for cache capacity; FP8 is usually free, INT4 demands careful scaling.

A subtlety worth knowing: keys and values often quantize differently. Keys feed the attention score computation and tend to be more sensitive to error than values, so some schemes keep keys at higher precision than values. The trade-off is not free — quantizing and dequantizing on the fly adds kernel work, and aggressive precision can degrade long-context reasoning before it shows up on short benchmarks. The right precision is workload-dependent, which is why you should measure quality on your own evaluation set rather than trust a generic claim. For the quality impact of low-bit caches the KIVI work is a useful reference (Liu et al., arXiv:2402.02750).

GQA, MQA, and reducing KV heads

Grouped-query attention (GQA) and multi-query attention (MQA) reduce num_kv_heads, the term that often dominates the formula. Multi-head attention gives every query head its own key and value heads. MQA collapses all key/value heads to one shared pair; GQA is the middle ground, letting groups of query heads share a smaller set of KV heads. Reducing KV heads from, say, 64 to 8 cuts the cache by 8× along that axis while keeping the full set of query heads for expressiveness.

Figure 4 — GQA and MQA shrink the KV-head term of the memory formula by sharing key/value heads across query heads.

This is an architectural decision baked in at training time, not a serving-side knob — you cannot retrofit GQA onto a model trained with full multi-head attention without retraining or conversion. But it is the reason modern models are so much cheaper to serve at long context, and it explains why the num_kv_heads in our worked example was already 8 rather than 64. When you choose a base model, its KV-head count is one of the most consequential serving-economics decisions you will inherit. GQA is described in the GQA paper (Ainslie et al., arXiv:2305.13245).

The reason GQA became near-universal rather than a niche trick is that it sits on a favorable point of the quality-versus-cost curve. MQA’s single shared KV head is the most aggressive reduction, but collapsing everything to one pair can cost measurable quality on harder tasks, and it can make distributed inference awkward because there is only one KV head to shard across devices. Full multi-head attention preserves quality but pays the full cache bill. GQA’s group count is the dial between them: a model with 64 query heads and 8 KV groups shares each KV pair across 8 query heads, recovering most of MQA’s memory savings while keeping enough KV diversity that quality barely moves. From a serving standpoint the practical lesson is simple — when two candidate models are otherwise comparable, the one with fewer KV heads is dramatically cheaper to run at long context, and that often outweighs small differences on a leaderboard. The cache term you inherit at model-selection time is harder to fix later than almost any serving-side setting.

Eviction, CPU offload, and sliding-window attention

When the cache will not fit, the last family of techniques attacks seq_len itself or moves the cache off the critical device. Eviction policies drop KV entries judged least useful — heavy-hitter approaches keep the tokens that historically received the most attention plus a recent window, on the bet that distant low-attention tokens contribute little. CPU offload spills cold cache blocks to host memory and pages them back over the PCIe or NVLink interconnect when needed, trading bandwidth and latency for capacity; it is most viable when offloaded blocks are accessed rarely, since the transfer can otherwise dominate decode time.

Sliding-window attention and attention sinks attack the length structurally. A sliding window bounds attention to the most recent w tokens, so the cache stops growing past w regardless of total sequence length — converting unbounded cache growth into a constant. The wrinkle, identified in the StreamingLLM work, is that naïvely dropping the very first tokens destroys quality; models dump attention onto those initial “sink” tokens, so keeping a few sink tokens plus the recent window preserves stability. The trade is real information loss — anything outside the window is simply gone — so sliding windows suit streaming and chat far better than tasks needing precise long-range recall. The attention-sink phenomenon is documented in StreamingLLM (Xiao et al., arXiv:2309.17453).

The right way to think about this last family is as a spectrum of how much you are willing to forget versus how much you are willing to slow down. Eviction forgets selectively and keeps everything fast, betting that an importance heuristic can identify the safe-to-drop tokens; it fails when the heuristic is wrong about which distant token mattered. Offload forgets nothing but pays in bandwidth and latency every time a cold block is needed; it fails when “cold” was a bad prediction and the block is needed constantly. Sliding windows forget by position, predictably and cheaply, accepting that anything old is unrecoverable. None of the three is strictly better — they trade the same two currencies, accuracy and time, in different proportions. The decision is therefore a workload question, not a technology one: how far back does your task genuinely need to see, and how much latency can you spend to keep distant context reachable? Answer those honestly and the choice usually makes itself.

Figure 5 — A workload-driven decision path: prefix overlap points to prefix caching, long context to quantization and windowing, capacity pressure to offload.

How the cache interacts with parallelism

The cache does not live on one device in large deployments, and how it is split shapes its cost. Under tensor parallelism, attention heads are partitioned across GPUs, so the KV cache for those heads is partitioned too — each device holds the keys and values for the heads it owns. This divides the cache footprint by the tensor-parallel degree, which is one of the quieter reasons large models are sharded even when the weights alone might fit. The interaction with GQA is worth flagging: a model with very few KV heads can have fewer KV heads than GPUs, forcing some heads to be replicated rather than cleanly divided, which slightly blunts the memory benefit of sharding. Pipeline parallelism splits the model by layers instead, so each stage holds the cache for only its layers; the cache divides by pipeline depth but every request now occupies cache on every stage for its whole lifetime, which constrains how the stages can be balanced.

A newer pattern, often called disaggregated or prefill-decode-separated serving, takes the prefill-versus-decode distinction to its logical conclusion by running the two phases on different pools of hardware. Prefill, being compute-bound and bursty, runs on one set of GPUs; decode, being memory-bandwidth-bound and long-lived, runs on another tuned for cache capacity and bandwidth. The freshly computed KV cache is transferred from the prefill pool to the decode pool over the interconnect. The appeal is that each phase gets hardware suited to its bottleneck, and you can scale them independently — but the cache transfer itself becomes a real cost, and it only pays off at scale where the phases would otherwise interfere with each other’s batching. It is a sign of how central the cache has become that an entire serving topology now exists mainly to manage where the cache is computed versus where it is consumed.

Trade-offs, gotchas, and what goes wrong

The most common mistake is stacking techniques without measuring their interaction. Quantization and aggressive eviction both erode accuracy independently; applied together on a long-context reasoning task, the combined degradation can be larger than either alone and invisible on short evals. Always validate on context lengths and task types that mirror production.

Prefix caching’s failure mode is silent memory pressure: a large resident prefix pool starves active sequences, dropping your effective batch size and throughput even as cache “hit rate” looks healthy. Watch the eviction interplay between the prefix pool and the active pool. CPU offload’s failure mode is bandwidth saturation — if your access pattern brings offloaded blocks back too often, decode latency spikes and you would have been better off simply admitting fewer requests. Sliding windows quietly break tasks that need the dropped tokens; a summarizer that worked in testing can hallucinate when given a document longer than the window. And PagedAttention’s block size is a real tuning parameter: too large reintroduces fragmentation, too small inflates block-table overhead. None of these are reasons to avoid the techniques — they are reasons to instrument them.

Two further traps deserve naming because they fool experienced teams. The first is benchmarking on short sequences. Almost every degradation introduced by quantization, eviction, and windowing is length-dependent: it is invisible at 512 tokens and severe at 32K, because the errors and the dropped context only accumulate over distance. If your eval suite is short, you will ship a regression you never saw. The second is optimizing for average throughput while ignoring the tail. Cache pressure does not produce graceful slowdowns; it produces cliffs. As concurrency rises and free blocks run out, the scheduler must preempt or recompute sequences, and tail latency jumps discontinuously even though the average looks fine. The metric that actually governs user experience under load is the high-percentile latency at your real concurrency, not the mean at a comfortable one. Instrument both, and treat any optimization that improves the average while worsening the tail with suspicion.

Practical recommendations

Approach the cache as a budget you measure, not a default you accept. A pragmatic order of operations:

Start by measuring per-sequence cache cost with the formula and your real model parameters; know your batch ceiling before optimizing.
Use a framework with PagedAttention (vLLM, SGLang, or TensorRT-LLM) — block allocation is table stakes in 2026, not an advanced feature.
Turn on prefix caching if your workload has meaningful prefix overlap (chat, agents, RAG over stable corpora). Measure hit rate and watch pool memory.
Quantize the cache to FP8 as a near-free first step; validate before going to INT8/INT4, and consider keeping keys at higher precision than values.
Prefer base models with GQA/MQA for long-context serving — it is the cheapest term to have already been reduced for you.
Reach for eviction, offload, or sliding windows last, only when capacity-bound, and only after confirming your task tolerates the information loss.
Benchmark on production-like context lengths and concurrency — see the vLLM, SGLang, and TensorRT-LLM comparison for how these choices play out across engines.

Frequently asked questions

What is the KV cache in LLM inference?

The KV cache stores the key and value tensors that each transformer attention layer computes for every token already processed. During autoregressive decoding, every new token attends to all prior tokens; caching their keys and values avoids recomputing them at each step, turning quadratic recompute into linear lookup. The trade is memory — the cache grows with sequence length, batch size, and model depth, and it often becomes the binding constraint on serving.

How do I calculate KV cache size?

Use 2 × num_layers × num_kv_heads × head_dim × seq_len × batch × bytes_per_element. The leading 2 covers keys and values. Plug in your model’s architecture for layers, KV heads, and head dimension, your serving context length and batch, and your storage precision in bytes. The result is total cache footprint in bytes — compare it against HBM left after weights to find your maximum concurrent batch.

Does KV cache quantization hurt accuracy?

It depends on precision and workload. FP8 is usually near-lossless for most tasks. INT8 needs mild care around outliers. INT4 can meaningfully degrade long-context reasoning unless you use per-channel or per-group scaling, and keys are often more sensitive than values. The honest answer is to evaluate on your own long-context tasks rather than trust a generic accuracy claim, because short benchmarks hide the regressions.

What is PagedAttention and why does it matter?

PagedAttention applies operating-system paging to the KV cache. Instead of one contiguous per-request buffer sized for the worst case, it allocates fixed-size blocks on demand and maps them through a block table. This nearly eliminates the internal fragmentation that wastes memory in naïve allocation, letting you batch more sequences, and it enables prefix sharing via copy-on-write. It is the foundation of modern high-throughput serving.

How does the KV cache limit batch size?

After subtracting model weights and activation memory from total HBM, the remainder divided by per-sequence cache cost is your maximum concurrent batch. Because decode is memory-bandwidth-bound, larger batches mean higher throughput — so a bigger cache per sequence directly lowers your batch ceiling and your tokens-per-second. Every cache optimization that shrinks per-sequence footprint raises that ceiling, which is why cache work is throughput work.

When should I use sliding-window attention?

Use it for streaming and long-running chat where recent context matters most and unbounded cache growth is the problem. A sliding window caps the cache at a fixed size regardless of total length. Keep a few attention-sink tokens at the start to preserve stability. Avoid it for tasks needing precise recall of distant tokens — summarization of long documents, multi-hop retrieval — since anything outside the window is permanently dropped.

Is the KV cache the same as model weights in memory?

No, and conflating them is a common sizing mistake. Weights are fixed — they load once and do not grow. The KV cache is dynamic: it scales with sequence length, batch size, and the number of concurrent requests, growing one entry per token per layer during decode. On a long-context, high-concurrency workload the cache can match or exceed the weights, which is why you must budget HBM for both separately and treat the cache as the variable cost that determines your concurrency ceiling.

Does prefix caching help every workload?

No. It only helps when requests share leading tokens — stable system prompts, few-shot templates, repeated documents, or multi-turn conversation history. Workloads where every prompt is unique from the first token see little benefit and still pay the memory cost of keeping prefixes resident. To benefit, place stable shared content at the front of the prompt and variable content like timestamps or user IDs at the end, because any variation near the start invalidates the cached match.

KV Cache Optimization for LLM Inference: A Deep Dive

KV Cache Optimization for LLM Inference: A Deep Dive

Context: why the KV cache exists at all

The memory math: sizing the KV cache

Walking a worked example

Why this caps batch size and throughput

Prefill versus decode

The optimization toolkit: shrinking each term

PagedAttention and block-based allocation

Prefix caching and prompt reuse

KV cache quantization

GQA, MQA, and reducing KV heads

Eviction, CPU offload, and sliding-window attention

How the cache interacts with parallelism

Trade-offs, gotchas, and what goes wrong

Practical recommendations

Frequently asked questions

What is the KV cache in LLM inference?

How do I calculate KV cache size?

Does KV cache quantization hurt accuracy?

What is PagedAttention and why does it matter?

How does the KV cache limit batch size?

When should I use sliding-window attention?

Is the KV cache the same as model weights in memory?

Does prefix caching help every workload?

Further reading

Related

Comments

Leave a Reply Cancel reply

Tag Cloud

Categories