LLM Prompt Caching: Architecture and Economics (2026)
Most production LLM applications send the same long prefix on every call. A 4,000-token system prompt, a tool schema, a retrieved document, a few-shot block — all of it recomputed from scratch, request after request, while only the final user question changes. LLM prompt caching is the pattern that stops paying for that repetition. It lets the model reuse the work it already did on a shared prefix, cutting both latency and cost on the part of the prompt that never moves.
In 2026 this is no longer a niche optimization. It is the default expectation for any agent, RAG pipeline, or chat product with non-trivial prompts. Both the major API providers and the open-source serving stacks now ship prefix reuse as a first-class feature. The trap is treating it as a free toggle. Get the prompt layout wrong, or misread the pricing, and caching can quietly cost you more than it saves.
What this post covers: the mechanics of the KV cache and why caching exists, the difference between provider-side caching and self-hosted KV reuse, cache-aware prompt design, a worked economic model with illustrative token math, the failure modes nobody warns you about, and a practical checklist.
Context: why prompt caching exists
To understand prompt caching you have to understand what an LLM actually does with your prompt. Inference splits into two phases with very different cost profiles. The first is prefill: the model reads every token of the input prompt at once and computes a key/value (KV) pair for each token at every attention layer. The second is decode: the model generates output tokens one at a time, each new token attending back over all the keys and values that came before it.
Prefill is compute-bound and parallel. Decode is memory-bandwidth-bound and sequential. The intermediate keys and values from prefill are collectively called the KV cache, and they are the single most expensive artifact in the whole pipeline to produce. For a long prompt, prefill can dominate the time-to-first-token and a large share of the total cost.
The cost scales with prompt length in a way that punishes long prefixes hardest. Prefill work grows with the number of input tokens, so an 8,000-token system prompt is far more expensive to process than the 50-token question that follows it. When that 8,000-token prefix is identical on every request, you are paying the most expensive part of the bill repeatedly for zero new information. Caching attacks exactly this asymmetry: it amortizes the heavy, repeated prefix work across many requests while leaving the cheap, unique tail untouched.
How attention reuse actually works
Here is the structural insight that makes caching possible. In a causal transformer, the KV pair computed for token i depends only on tokens 0 through i. It does not depend on anything that comes after it. So if two requests begin with the exact same sequence of tokens, the KV pairs for that shared prefix are byte-for-byte identical. There is no reason to compute them twice.
This is why caching keys on the prefix, not on the whole prompt. If request A is [system][docs][question_A] and request B is [system][docs][question_B], the [system][docs] portion produces identical KV state in both. Cache it once, and request B skips prefill for everything up to the point where it diverges. The shared work happens once; only the unique suffix pays full price.
The catch is that this only works while the prefix is a contiguous match from token zero. Change one token near the front and every KV pair after it is invalidated, because attention is causal and each downstream value depended on the thing you changed. That single constraint drives almost every design decision later in this post. For a deeper treatment of the underlying mechanics, see our guide to KV cache optimization for LLM inference.
Two workload shapes make this pay off the most, and they cover a large fraction of real applications. The first is repeated queries against a long document: a user asks many different questions about the same annual report, manual, or codebase. The document is prefilled once, and every later question reuses that KV state instead of re-reading the whole thing. The second is multi-turn conversation: each turn appends to a growing transcript, so turn N shares its entire history with turn N+1. Without caching, a long chat re-prefills the full conversation on every message, and cost grows quadratically with turn count. With caching, each turn only prefills its own new tokens. These two patterns — long-document Q&A and multi-turn chat — are the bread and butter of prompt caching, and they are exactly where agents and RAG systems live.
How prompt caching actually works — the core
There are two fundamentally different worlds of LLM prompt caching, and conflating them causes most of the confusion in the field. Provider-side caching is what you get from a hosted API: you send prompts, the provider transparently reuses prefix state behind a billing meter you do not control. Self-hosted KV reuse is what you build when you run the model yourself on vLLM or SGLang: you own the cache, the memory, and the eviction policy. The mechanics rhyme, but the economics and the failure modes differ sharply. Both are forms of the same underlying prompt caching architecture — store the KV state of a shared prefix, look it up on the next matching request, skip the recompute. 
The diagram above shows the universal control flow shared by both worlds. A request arrives. The system checks whether its prefix already exists in the KV cache. On a cache miss, it runs full prefill, stores the prefix KV blocks keyed by a hash, then decodes. On a cache hit, it loads the stored prefix KV and skips straight to decoding the new suffix. The stored blocks then sit available for the next request that shares the prefix.
Provider-side prompt caching
Hosted APIs from Anthropic, OpenAI, and Google all expose some form of prompt caching, and they share a common shape even though the surface details differ. The provider hashes a prefix of your prompt, stores the corresponding KV state in a fast tier, and looks it up on subsequent requests. When the prefix matches, you get a cache read instead of a full recompute.
Three properties define the provider-side model. First, a time-to-live (TTL). Cached prefixes expire after an idle window — often a few minutes by default, with longer-lived tiers available at a premium. If no request touches a cached prefix within the window, it evicts and the next call pays full prefill again. Second, explicit or implicit cache scope. Some providers cache automatically; others require you to mark a cache breakpoint in the prompt so they know where the stable prefix ends.
Third, and most important for your budget, cache-write versus cache-read pricing. Writing a prefix into the cache typically costs more than a normal input token, because the provider has to do the prefill and persist the KV state. Reading from the cache costs dramatically less than a normal input token — often a small fraction of the base rate. This asymmetry is the heart of prompt cache economics, and we model it explicitly below. The implementation details and pricing tiers differ by vendor, so always check the current docs; Anthropic’s prompt caching documentation is a representative starting point.
The vendors also differ on how much you control. Some treat caching as fully automatic: send the same prefix and you transparently get reads, with no markup or breakpoint to manage. Others make it explicit, requiring you to flag the boundary of the cacheable region so the provider knows where your stable content ends. Explicit control is more work but gives you a sharper lever over what gets cached and how the write premium is incurred. Either way, the billing response usually tells you, per request, how many tokens were cache writes versus cache reads versus uncached. That telemetry is the ground truth for your economics — instrument it from day one rather than reasoning about hit rate in the abstract.
Self-hosted KV reuse
When you run open-weight models yourself, prefix caching is a property of the serving engine, and you control it end to end. Two implementations dominate in 2026.
vLLM automatic prefix caching reuses KV blocks across requests without any code changes to your prompts. You enable it with enable_prefix_caching=True, and the engine handles the rest:
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
enable_prefix_caching=True,
gpu_memory_utilization=0.90,
)
shared_prefix = SYSTEM_PROMPT + RETRIEVED_DOCS # stable across requests
for question in questions:
prompt = shared_prefix + question # only the tail changes
out = llm.generate(prompt, SamplingParams(max_tokens=256))
Under the hood, vLLM maps each block of tokens to a hash of its content and the content before it, then maintains a global hash table of physical KV blocks. Identical prefixes hash to the same blocks, so they are shared automatically — no tree structure required, and blocks can be allocated and freed independently. Crucially, the vLLM automatic prefix caching docs are explicit that this only speeds up prefill. If your workload spends most of its time decoding long outputs, caching the prefix barely moves the needle.
SGLang’s RadixAttention takes a more structured approach. It keeps the KV cache of all in-flight and recent requests in a radix tree — a prefix tree where edges are token sequences — and runs an LRU eviction policy over it. 
The tree in the figure above is the mental model. A shared system prompt sits near the root. It branches into separate document contexts, and each of those branches again into individual user queries. Every node represents KV state that is computed once and reused by all of its descendants. When two requests share a system prompt and a document but ask different questions, they share the entire path down to where they diverge. The RadixAttention design, introduced in the SGLang paper, reports up to several-fold throughput gains on workloads heavy in shared prefixes. The radix structure makes it especially strong for branching agent and few-shot patterns, where many requests share long common ancestors.
The practical difference: vLLM’s hash-table approach is simple and robust for flat prefix sharing; SGLang’s radix tree excels when your prompts form a branching hierarchy. Both reclaim the exact same fundamental win — never prefill the same prefix twice.
There is one more design axis worth naming: where the cache lives. Self-hosted prefix caching keeps KV blocks in GPU memory, which is fast but scarce. The working set you can keep warm is bounded by how much VRAM is left after your in-flight requests claim theirs. Some 2026 stacks extend this with a tiered cache that spills cold KV blocks to CPU memory or local NVMe, then pages them back on a hit. That trades a slower hit for a much larger effective cache, which helps workloads with many distinct-but-recurring prefixes. The right choice depends on your traffic shape: a handful of hot system prompts fits comfortably in VRAM, while a long tail of per-customer contexts may justify a tiered design.
When prompt caching does not help
It is worth being explicit about the workloads where caching earns nothing, because shipping it there adds complexity for no return. Three patterns stand out.
The first is all-unique prompts. Suppose every request is a fresh, one-shot prompt with no shared prefix. A batch classifier that embeds the full input inline, with no common system block, is the classic case. There is nothing to reuse, and the write premium is pure waste. The second is decode-dominated workloads. Caching only saves prefill time. If your prompts are short but your outputs are long — a creative-writing or long-form generation task — prefill is a rounding error and caching barely registers. The third is constantly-mutating prefixes. A prompt that legitimately changes its leading content every call, such as one that opens with live market data, will miss on every request no matter how the cache is configured. In all three cases, the honest answer is to leave caching off and spend the engineering effort elsewhere.
Cache-aware prompt design
Caching is only as good as your hit rate, and hit rate is a design choice, not luck. The prefix match must be contiguous from token zero. So the single most important rule is simple: put everything static at the front and everything variable at the back. 
The contrast in the figure is the whole discipline in one picture. The bad layout opens with a timestamp and a request ID — both of which change on every single call. That mutation at position zero invalidates the entire downstream cache, so nothing is ever reused. The good layout front-loads the stable material: system instructions, tool definitions, the reference document. Only after a clean cache breakpoint does the variable content appear — the user’s question, then the per-request metadata.
A few concrete rules follow from this:
- Never put a timestamp, UUID, or session ID at the top of the prompt. Move it to the very end, after the question, or pass it out of band if the model does not strictly need it inline.
- Keep the system prompt and tool schema byte-stable. Reordering JSON keys, re-serializing with different whitespace, or A/B-testing wording all break the prefix. Version your system prompt and change it deliberately, not incidentally.
- Order RAG context from most-stable to least-stable. A pinned policy document belongs above freshly retrieved chunks. If retrieval order is non-deterministic, sort it before assembling the prompt so identical retrievals produce identical token sequences.
- Place the cache breakpoint after the largest stable block. On providers that require an explicit breakpoint, mark it right where reusable content ends, so you cache the maximum and pay write cost only once.
Semantic-level reuse is a complementary technique that catches near-duplicate requests rather than shared prefixes; the two compose well. We cover that pattern in our piece on semantic caching for LLM applications.
Version the prefix deliberately
A subtle discipline separates teams that get sustained value from LLM prompt caching from those that watch their hit rate decay. The cached prefix is effectively a versioned artifact, and it should be treated like one. Every time you tweak the system prompt, add a tool, or change the document set, you mint a new prefix and pay a fresh wave of cache writes as traffic migrates to it. That is fine when the change is intentional. It is a quiet tax when it happens by accident — a config refactor that reorders fields, a library upgrade that changes JSON formatting, an A/B test that injects variant wording.
The fix is to make prefix changes explicit and rare. Keep the cacheable region behind a single, reviewed code path. Log a prefix version identifier so you can correlate a hit-rate dip with the deploy that caused it. When you must change the prefix, do it on a deliberate schedule and expect a transient cost bump as caches warm. Teams that treat the prefix as a stable contract, rather than something the application assembles freshly each time, keep their LLM cost optimization gains intact over months instead of watching them erode after every release.
Measure hit rate, do not assume it
Cache-aware design is a hypothesis until production traffic confirms it. The metric that matters is the share of prefix tokens served as reads, not the share of requests that were cached at all. A request can be a partial hit — its prefix matched up to a point, then diverged — so token-level accounting is more honest than a binary hit/miss count.
On hosted APIs, sum the cache-read tokens against the total cacheable tokens from the usage fields on each response. On vLLM or SGLang, the engine exposes cache-hit metrics you can scrape. Watch the rate over a full traffic cycle, including off-peak troughs where TTL expiry bites hardest. A hit rate that looks healthy at midday peak can collapse overnight, and your monthly bill reflects the average, not the best hour. Set an alert if the measured rate drops below the break-even threshold your cost model implies.
The economics — a worked cost model
This is where prompt caching earns its keep or quietly betrays you. Let me work a concrete example. Every number below is an illustrative assumption chosen to show the mechanics — these are not vendor-quoted prices, and you must substitute your provider’s current rates.
Assume an agent with a stable prefix of 8,000 tokens (system prompt, tools, a pinned document) and a variable tail of 500 tokens per request. We adopt these illustrative per-token rates, expressed relative to a base input-token price of 1.0×:
| Token type | Illustrative rate (relative to base input) |
|---|---|
| Normal input token | 1.0× |
| Cache-write token (first time) | 1.25× |
| Cache-read token (hit) | 0.1× |
The first request — a forced cache write. It pays the write premium on the 8,000-token prefix plus the normal rate on the 500-token tail:
write cost = 8,000 × 1.25 = 10,000 units
tail cost = 500 × 1.0 = 500 units
first call = 10,500 units
That first call is more expensive than no caching at all (which would be 8,500 units). This is the break-even trap: a single cached request loses money. Caching only pays off across repeated hits.
This is why the write premium matters so much for low-traffic prefixes. If a prefix is touched only once before its TTL expires, you have paid the premium and captured none of the savings — a strict loss. Prefixes that are unique to a single short-lived session can quietly do this at scale, eroding the savings your popular prefixes earn. The healthy pattern is a small number of hot prefixes — a shared system prompt, a handful of pinned documents — that are hit thousands of times each, dwarfing the write cost. Concentrating traffic onto a few stable prefixes is itself an optimization.
A subsequent request — a cache hit. It reads the 8,000-token prefix at the cheap rate and pays normal rate on its fresh tail:
read cost = 8,000 × 0.1 = 800 units
tail cost = 500 × 1.0 = 500 units
hit call = 1,300 units
Against the no-cache baseline of 8,500 units per call, that is roughly an 85% reduction on the input side. The break-even point is the number of hits needed to recover the write premium. The write cost us an extra 1,500 units versus a plain call (10,500 − 8,500 + the 500 tail accounting nets to a 2,000-unit premium over the cached steady state). Each hit then saves 7,200 units (8,500 − 1,300). So even a single hit after the write puts you well ahead. With illustrative numbers like these, two cache reads already pay back the write.
The sensitivity that matters is hit rate — the fraction of requests that find a live prefix. 
The curve in the figure is conceptual, not measured, but its shape is the point. At a 0% hit rate (every call a write), caching costs slightly more than baseline. The effective cost per request falls steeply as hit rate climbs, crossing below the no-cache line at a low threshold and approaching the cheap-read floor near 100%. The economic question is never “is caching cheaper?” in the abstract. It is “what hit rate will my actual traffic achieve?” A chat product with long-lived sessions might sit at 70–90%. A fleet of one-shot, all-unique prompts might sit near zero — and there, caching is a net loss. For self-hosted economics whe
