AI Inference Cost Optimization: GPU FinOps in 2026

AI Inference Cost Optimization: GPU FinOps in 2026

AI Inference Cost Optimization: GPU FinOps in 2026

The bill that kills a successful AI product is rarely training. It is the steady, compounding drip of serving the model after launch — and that is exactly where AI inference cost optimization earns its keep. A model that delights users in a demo can quietly burn six figures a month once traffic is real, because every token generated is a forward pass on a graphics processing unit (GPU) that you are renting by the second. The cruel part is that most teams overpay by a factor of three to ten not because they chose the wrong model, but because their serving stack leaves the GPU idle, holds memory it never uses, and runs at a precision the task does not need. This post treats inference economics as an architecture decision record: the context, the options, the decision, and the consequences you will live with.

What this covers: the real cost drivers of inference, the serving techniques that move the needle (continuous batching, paged KV-cache, quantization, speculative decoding, routing), a worked cost-per-million-tokens model, the trade-offs that bite, and a practical FinOps checklist.

Context and Background

For a decade the dominant cost question in machine learning was “how do we afford to train this?” In 2026 that question has flipped. Foundation models are pre-trained once and served billions of times, so for any product with real usage the lifetime cost is dominated by inference, not training. GPU FinOps — the discipline of measuring, attributing, and optimizing GPU spend the way cloud FinOps did for general compute — has moved from a nice-to-have to a board-level line item.

The state of the art has consolidated around a small set of open serving engines. vLLM popularized PagedAttention and continuous batching and is now the default for most self-hosted deployments. NVIDIA’s TensorRT-LLM squeezes maximum throughput from NVIDIA silicon with kernel-level fusion and FP8 paths. SGLang adds aggressive prefix caching and a structured runtime for agentic and tool-calling workloads. All three converge on the same physics: the GPU is fast at matrix math and slow at moving memory, so the game is keeping the compute units busy while the key-value cache (the memory of the conversation) stays small and well-packed.

It helps to make that physics concrete. A modern data-center GPU can perform on the order of a thousand floating-point operations for every byte it reads from high-bandwidth memory; this ratio is the arithmetic intensity the hardware wants. Decoding one token for one sequence reads the entire weight matrix and the sequence’s KV-cache but performs only a thin slice of math against them, so the arithmetic intensity is far below what the silicon can sustain. The result is a memory-bandwidth-bound regime where the expensive tensor cores sit idle waiting on memory. Every optimization in this record is ultimately an attempt to climb that intensity curve — to amortize each costly memory read across more useful work — which is why they compound rather than compete.

What changed recently is that these techniques stopped being research and became table stakes. The serving frameworks ship them on by default, the quantization formats are production-stable, and managed platforms expose them as flags. The differentiator now is operational: knowing which knob to turn for your traffic shape, and proving the saving in dollars per million tokens. That operational gap is what this decision record addresses. For teams routing across several providers, an LLM gateway architecture is the control plane where many of these decisions are enforced.

The Decision: Optimize the Serving Path, Not Just the Model

The decision is this: treat inference cost as a property of the serving path — batching, memory, precision, routing, and capacity — rather than a property of the model you picked. Choosing a smaller model helps, but a well-served large model often beats a poorly served small one on cost per useful answer. Optimize the path first; right-size the model second.

Decision tree for AI inference cost optimization showing batching, KV cache, quantization, and routing branches

Figure 1: The inference cost optimization decision path — from raw request to a served token, each stage is a lever on cost.

Figure 1 frames the levers in the order they should be evaluated. A request arrives; a router decides whether a small model or cache can answer it; the chosen engine batches it with concurrent requests; the KV-cache subsystem allocates memory; the model runs at a chosen precision; and an autoscaler decides how much hardware backs the whole thing. Each box is a cost decision, and they multiply rather than add. Getting batching wrong wastes the GPU you already paid for; getting precision wrong doubles the hardware you need; getting routing wrong sends a trivial query to your most expensive tier.

Cost drivers: tokens, GPU-hours, and utilization

Three numbers determine your bill. The first is tokens — both input (prompt) tokens you must process and output tokens you must generate, with output generation being the expensive autoregressive part. The second is GPU-hours — how long you rent the silicon. The third, and the one teams ignore, is utilization — what fraction of those GPU-hours actually did useful work. A GPU at 20% utilization is not 20% of the cost; it is the full cost for one-fifth of the value. Cost per million tokens collapses out of these three: rent dollars divided by tokens served, where utilization is the multiplier that separates a good stack from a wasteful one.

Prefill versus decode: two different workloads on one card

A subtlety hides inside “tokens.” A request has two phases with opposite cost characteristics. The prefill phase ingests the whole prompt in a single parallel forward pass — every prompt token is processed at once, so the matrix multiplies are large and dense and the GPU runs at high arithmetic intensity, often near compute-bound. The decode phase then emits output tokens one at a time, each a tiny forward pass that re-reads the full weights and KV-cache for a single new position — deeply memory-bound and inefficient per token. The consequence is economic: a 4,000-token prompt that yields a 100-token answer spends most of its FLOPs in a fast, efficient prefill, while a 100-token prompt that yields a 4,000-token answer spends most of its wall-clock time in slow, expensive decode. Cost-per-token is therefore not a single number; it depends on the input-to-output ratio of your traffic. Chunked prefill — splitting a long prompt into pieces and interleaving them with ongoing decodes — keeps the decode stream from stalling behind a giant prefill and is one reason modern engines sustain steadier throughput under mixed load.

Why the GPU sits idle

Naive serving processes one request at a time. The model generates a token, waits for the next request slot, generates another. Because generation is memory-bandwidth bound at small batch sizes, a single-stream server leaves most of the GPU’s compute idle between memory reads. The fix is to run many sequences through the same forward pass — batching — so each expensive weight load serves dozens of requests at once. This is the single highest-leverage change available, and it is why the rest of this record orbits around keeping batches full.

The original thesis

Here is the claim the survey posts miss: the cheapest token is the one you never generate, and the second cheapest is the one you generate in a batch you already paid for. Most cost programs chase quantization (cheaper-per-token) while leaving batches half-empty and caches cold (fewer-tokens-per-dollar squandered). Order of operations matters: fill the batch, warm the cache, and route away work before you reach for a smaller number format.

Deeper Analysis: The Techniques That Move the Needle

This section walks the major techniques in roughly descending order of typical impact, then builds a cost model so the savings are concrete. Every absolute number below is labelled illustrative unless it carries a citation; the point is the shape of the savings, which is stable across hardware generations.

Continuous (in-flight) batching

Static batching waits to collect N requests, runs them together, and returns when the slowest finishes — so a 500-token reply blocks behind a 50-token reply that finished long ago. Continuous batching, also called in-flight batching, operates at the iteration level: every forward pass re-forms the batch, evicting finished sequences and admitting new ones immediately. The GPU never waits for the slowest member of a fixed cohort. According to vLLM’s own analysis, continuous batching combined with PagedAttention and chunked prefill is what lets the engine serve roughly three to five times more traffic than a naive PyTorch loop on the same H100. In vLLM this is the default; the knobs that matter are --max-num-seqs (peak concurrency) and --max-num-batched-tokens (how much work per step).

It is worth understanding what the scheduler actually does each iteration, because that mechanism is the source of the gain. On every decode step the engine holds a set of running sequences and a waiting queue. It first decides how many new sequences it can admit without exhausting the KV-cache block pool, then forms a batch up to the --max-num-batched-tokens budget, blending the single-token decode steps of running sequences with chunks of prefill from newly admitted ones. After the forward pass, any sequence that emitted an end-of-sequence token or hit its length limit is retired, its KV blocks are returned to the pool, and the freed capacity lets a waiting request enter on the very next step rather than at the end of a fixed window. This iteration-level admission and eviction is precisely why a long generation no longer holds the batch hostage, and why effective occupancy stays high even when reply lengths vary wildly. The failure mode to watch is queueing: if the cache pool is too small or --max-num-seqs is set conservatively, requests pile up in the waiting queue and time-to-first-token climbs even though the GPU looks busy — a signal that you are memory-gated, not compute-gated, and should free cache (shorter max length, quantized KV, more memory headroom) rather than add raw FLOPs.

Paged attention and KV-cache management

Every token a model has seen is stored as key and value tensors — the KV-cache — so it does not recompute the whole context each step. The cache grows with conversation length and dominates memory for long contexts. Classic serving pre-allocated a contiguous block per sequence sized to the maximum length, wasting memory on every request that did not run to the limit. PagedAttention treats KV memory like an operating system treats RAM: fixed-size blocks allocated on demand, no contiguous reservation, near-zero fragmentation. The practical effect is more sequences fit in the same memory, which means bigger batches, which means lower cost per token. The cache pool size is what gates concurrency — once it is full, new requests queue.

The math is worth doing once, because it tells you exactly how many concurrent users a card holds. The per-token KV-cache footprint is, illustratively, 2 (key and value) × number_of_layers × number_of_KV_heads × head_dimension × bytes_per_element. Take a model with 32 layers, 8 key-value heads (grouped-query attention), a head dimension of 128, served in FP16 (2 bytes): that is 2 × 32 × 8 × 128 × 2 = 131,072 bytes, roughly 128 KB per token. A single 8,000-token conversation therefore needs about 1 GB of KV-cache on its own. On an 80 GB card, after the model weights claim their share, you might have 40 GB left for cache — room for only about 40 such long sequences at full length. Now layer the classic waste on top: if you pre-reserved 8,000 tokens of contiguous space per request but the average conversation only ran 1,500 tokens, you burned more than 80% of that reservation. PagedAttention recovers it by allocating, say, 16-token blocks on demand, so a 1,500-token request holds roughly 94 blocks instead of 500 — and the reclaimed memory becomes additional batch slots. Quantizing the KV-cache to FP8 halves the 128 KB figure again, directly doubling the sequence count the same card sustains. This is the chain that converts a memory optimization into a throughput-and-cost optimization: smaller cache, more sequences, fuller batches, fewer GPU-hours per million tokens.

Flowchart of paged KV cache block allocation and reuse across concurrent sequences

Figure 2: Paged KV-cache allocation — sequences draw fixed-size blocks on demand from a shared pool, and prefix blocks are shared across requests with a common prefix.

Figure 2 shows the allocation flow. A shared block pool serves all in-flight sequences; each sequence holds a block table mapping logical positions to physical blocks. When two requests share a prefix — a common system prompt, a shared document — they point at the same physical blocks. That sharing is prefix caching, and for workloads with repeated system prompts or retrieval-augmented context it can eliminate a large fraction of prefill compute outright. SGLang and vLLM both expose automatic prefix caching; for an agent that resends a 2,000-token system prompt on every turn, the cache turns repeated prefill into a near-free lookup. Concretely, if that 2,000-token prefix is shared across a thousand requests, you pay to prefill it once and the other 999 requests inherit the computed key-value blocks by pointer — the prefill cost of the shared region drops by three orders of magnitude for that cohort. The mechanism that makes this safe is copy-on-write: shared blocks stay read-only and immutable, so divergent continuations each allocate their own fresh blocks past the shared point without corrupting a neighbor’s cache. The practical caveat is eviction policy — prefix blocks are only valuable while they remain resident, so under memory pressure a least-recently-used eviction can quietly cold-start your hot prefixes; monitor cache hit rate, not just cache size.

Quantization: FP8, INT8, and INT4

Quantization stores weights (and sometimes activations and the KV-cache) in fewer bits. FP8 (8-bit floating point) is now production-standard on H100 and H200 class hardware and typically preserves quality within noise while roughly halving memory versus 16-bit. INT8 and INT4 (8- and 4-bit integer) go further; INT4 weight quantization can let a 70B-parameter model fit on a single GPU with room for KV-cache, where 16-bit would demand two or more. Among 4-bit methods, AWQ (Activation-aware Weight Quantization) protects the roughly 1% of “salient” weights that disproportionately affect output and generally preserves quality better, while GPTQ calibrates faster but is slightly less accurate. The cost lever is twofold: smaller weights mean cheaper hardware, and smaller KV-cache (when you quantize the cache too) means bigger batches.

The quality-versus-cost trade-off deserves a clear-eyed framing rather than a blanket “use INT4.” Quantization error is not uniform across the work the model does. Tasks with wide error tolerance — short classification, extraction, casual chat — barely register the noise that 4-bit weights introduce, because the answer space is forgiving and a slightly perturbed logit still picks the right token. Tasks that chain many dependent steps — multi-hop reasoning, arithmetic, code that must compile, long-context retrieval where a single wrong token derails the rest — accumulate that error and degrade visibly. The right mental model is a precision ladder with two semi-independent rungs. On the compute rung, FP8 activations and weights are the safe default on modern silicon: roughly half the memory and meaningful throughput from native FP8 tensor cores, at a quality cost usually inside measurement noise. INT8 weights are typically safe too and run well where FP8 hardware is absent. INT4 is the aggressive rung — it unlocks the single-GPU fit and the bigger batch, but it is the rung where method choice (AWQ over GPTQ), per-channel scaling, and a calibration set drawn from your real prompts stop being optional. The KV-cache is a third, often-overlooked rung: FP8 KV-cache quantization frequently costs less quality than INT4 weights while delivering the batch-size win directly, so for memory-gated workloads it can be the better first move. The decision rule that survives contact with production is simple — never ship a quantization step on vibes; gate every rung behind an evaluation on your own task and watch the worst-case slices, not the average.

Speculative decoding

Speculative decoding uses a small, cheap draft model to propose several tokens ahead, then the large target model verifies them all in a single parallel forward pass; accepted tokens are kept, rejected ones force a fallback. Because the expensive target model now emits multiple tokens per step, output-heavy workloads like code and long-form generation can see a two-to-four-times throughput gain. The crucial caveat: this helps most at low batch sizes where the GPU is memory-bound and idle. At large batch sizes the GPU is already compute-saturated, and the draft model’s extra work becomes pure overhead — it can make you slower. Speculative decoding is a low-traffic and latency-sensitive tool, not a universal one.

The economics turn entirely on one measured quantity: the acceptance rate, the fraction of drafted tokens the target model ratifies. The intuition is a budget. Suppose the draft model proposes 4 tokens per step and the target verifies all 4 in one forward pass that costs roughly the same as generating a single token normally. If the acceptance rate is high — say the first 3 of 4 are accepted on average — you produced about 3 tokens for the price of one target step plus a cheap draft run, a near-3x speedup. If the acceptance rate is poor — only 1 of 4 accepted because the draft model poorly approximates the target’s distribution — you paid for 4 draft passes and a verification pass to advance a single token, and you are now slower than plain decoding. Because verification is exact (the algorithm guarantees the output distribution is identical to the target model’s), there is no quality cost, only a throughput bet — and the bet is won or lost on acceptance rate. That rate depends on how well the draft mirrors the target (a distilled sibling beats an unrelated tiny model), on the workload (predictable, templated text like code and structured output accepts more readily than open-ended creative prose), and on the proposal length (longer drafts raise the ceiling but lower the per-token acceptance probability). The operational discipline is therefore to instrument acceptance rate as a first-class metric, tune draft length to it, and disable speculation automatically once batch occupancy rises past the point where the GPU is compute-bound — at which point the entire premise (idle compute to spend on verification) no longer holds.

A worked cost-per-million-tokens model (illustrative)

The numbers below are illustrative — chosen to show how the levers compound, not to quote a specific cloud’s price list. Assume one rented accelerator at a list rate of $3.00 per GPU-hour (illustrative on-demand price for an H100-class card). Suppose the unoptimized stack sustains an effective 800 output tokens per second across its batch, while a tuned stack sustains 3,200 tokens per second thanks to fuller batches and FP8.

Configuration Tokens/sec (illustrative) GPU-hours per 1M tokens Cost per 1M tokens at $3/hr
Naive single-stream, FP16 800 0.347 $1.04
Continuous batching, FP16 2,000 0.139 $0.42
Continuous batching, FP8 3,200 0.087 $0.26
Add prefix cache (50% prompt hit) 3,200 0.061 effective $0.18
Add spot GPU at ~30% discount 3,200 0.061 $0.13

Read the arithmetic: 1,000,000 tokens at 800 tokens per second is 1,250 seconds, or 0.347 GPU-hours, times $3 equals $1.04. The same million tokens at 3,200 tokens per second is 312 seconds, or 0.087 GPU-hours — $0.26. Layer a 50% prefill saving from prefix caching and a 30% spot discount and the same workload lands near $0.13. That is an eight-times spread from identical hardware and the same model, driven entirely by the serving path. This mirrors the direction reported by practitioners who quote roughly 75% cost reductions from optimization, though the exact figure depends on traffic shape.

A few caveats keep this model honest. The “tokens per second” in the table is effective aggregate output throughput across the whole batch, not the per-user generation speed — an individual user still sees their own inter-token latency, which batching can slightly worsen even as system throughput soars. The prefix-cache line assumes a 50% prompt-token hit rate and counts only prefill savings; its real value swings wildly with workload, approaching zero for all-unique prompts and far higher for agent loops that resend large fixed system prompts. The spot discount is modelled as a flat price cut, but in practice it carries an availability and interruption cost that does not appear in a single-card table — you must hold a reserved floor underneath it, which raises the blended rate above the headline spot price. Finally, the model is single-card and ignores tensor-parallel communication overhead, load-balancer inefficiency, and the long tail of partially-filled batches during traffic troughs. The honest reading is directional: the ordering of the levers and the rough magnitude of the spread are robust; the exact dollar figures are a function of your hardware, model size, and traffic shape, and should be re-derived from your own measured throughput.

Stacked comparison of cost per million tokens across naive batched quantized and cached serving configurations

Figure 3: Illustrative cost per million tokens falling as each lever is applied — the biggest single drop comes from batching, not quantization.

Figure 3 makes the order of operations visible: the largest single reduction comes from batching (idle GPU reclaimed), with quantization, caching, and spot pricing each shaving the remainder. Teams that lead with INT4 and ignore batch occupancy capture the smallest slice first and wonder why the bill barely moved. Profiling the serving path — for example with eBPF continuous profiling — is how you confirm where the GPU-seconds actually go before you optimize.

Disaggregated prefill and decode serving

A newer structural lever deserves its own treatment because it attacks the prefill-versus-decode mismatch head-on. In a single-pool server, the compute-heavy prefill passes and the memory-bound decode steps share the same GPUs and contend for the same scheduler budget — a burst of long prompts can stall the decode stream, and a flood of long generations can starve prefill. Disaggregated serving physically separates the two: a pool of prefill workers ingests prompts, computes the KV-cache, and hands the completed cache to a pool of decode workers that stream out tokens. Each pool can then be provisioned, batched, and even quantized for its own profile — prefill nodes optimized for dense throughput, decode nodes optimized for memory bandwidth and high concurrency — and each can scale independently with its own queue. The cost win is sharper utilization on both halves and the freedom to right-size the expensive resource (decode memory) without over-buying prefill compute. The trade-off is real: the KV-cache must be transferred between pools over a fast interconnect, which adds latency and engineering complexity, so disaggregation pays off mainly at scale and for workloads with pronounced prefill/decode asymmetry. For most teams it is a later-stage lever, applied after batching, caching, and precision are already exhausted, but it is the direction the highest-volume serving stacks are moving.

Routing, right-sizing, and small-model-first

The cheapest token is the one a small model answers. A router can classify each request and send simple ones (classification, extraction, short answers) to a 7B or 8B model and reserve the flagship for hard reasoning. Combined with caching of identical or semantically similar requests, routing removes load from the expensive tier entirely. This is where the gateway and the serving stack meet: the gateway decides which model and whether to cache, the engine decides how to serve it.

Routing has its own economics worth making explicit. The classifier that decides where to send a request is itself an inference call, so it must be cheap — a small encoder, a heuristic, or a tiny model — or it eats the savings it is meant to create. The payoff comes from the shape of real traffic, which is almost always heavy-tailed: a large fraction of requests are trivial (a greeting, a yes/no, a short extraction) and a small fraction are genuinely hard. If 70% of traffic can be answered acceptably by a model that costs a fifth as much to serve, the blended cost per request falls dramatically even though the hard 30% is unchanged. Layer a response cache in front — an exact-match cache for repeated identical prompts and a semantic cache that returns a stored answer when a new request is close enough in embedding space — and you shed not just the cheap tier but the inference call entirely for the hottest queries. The risk to manage is mis-routing: send a hard query to the small model and you pay twice when the user retries or escalates, so route conservatively, measure escalation rate, and treat the router’s confidence threshold as a tunable cost-versus-quality dial rather than a fixed setting.

Trade-offs, Gotchas, and What Goes Wrong

None of these levers is free, and several interact badly when stacked carelessly.

Quantization is not lossless. FP8 is usually safe, but INT4 can degrade reasoning, math, and long-context fidelity in ways that do not show up on a quick eyeball test. Evaluate on your task before shipping a quantized model; a cheaper token that gives a wrong answer is the most expensive token of all because the user retries or escalates.

Speculative decoding can slow you down. As noted, at high batch occupancy the draft model is dead weight. Worse, a poorly matched draft model has a low acceptance rate, so you pay for the draft passes and still fall back to the target. Measure acceptance rate, not just peak throughput.

Batching trades latency for throughput. Bigger batches and longer scheduling windows raise time-to-first-token and inter-token latency for individual users. An interactive chat product and a bulk document-processing job want opposite settings. Tune --max-num-batched-tokens per workload, and separate latency-critical traffic onto its own pool.

Spot and preemptible GPUs vanish without warning. They are the cheapest capacity, often 60–70% off on-demand, but the cloud can reclaim them in seconds. Without checkpointing in-flight requests and draining gracefully, a preemption drops live conversations. Use spot for batch and overflow, keep a baseline of on-demand or reserved capacity for the floor, and make the autoscaler aware of both pools. The robust pattern is a preemption handler that, on the cloud’s interruption notice (typically a short warning window), stops admitting new sequences, lets in-flight generations finish or migrates their KV-cache to a surviving node, and re-queues anything that cannot drain in time — so a reclaim becomes a graceful handoff rather than a wall of dropped requests.

Multi-tenancy and bin-packing leak. Packing many small models or tenants onto one GPU raises utilization but invites noisy-neighbor latency spikes and memory contention; one tenant’s long context can starve another’s cache. Set per-tenant cache quotas and admission limits. Bin-packing is a genuine cost lever — a GPU running one lightly-used 7B model is mostly wasted, and co-locating several such tenants reclaims that idle capacity — but the packing must respect both static memory (each model’s weights) and dynamic memory (each tenant’s peak KV-cache), or a traffic spike on one tenant triggers out-of-memory evictions that cascade across neighbors. Treat the GPU like a multi-tenant database: enforce quotas, reserve headroom, and isolate the latency-sensitive tenants onto their own cards.

Autoscaling on the wrong signal thrashes. Scaling on CPU or even raw GPU utilization misleads, because a memory-bound decode phase can look “busy” while underloaded. Scale on queue depth, KV-cache occupancy, and time-to-first-token, not just GPU percent. The same right-sizing discipline that governs Kubernetes GPU cost optimization applies here. The further trap is cold-start latency: a new GPU replica must pull and load tens of gigabytes of weights before it serves a single token, often a minute or more, so reactive scaling always lags the spike that triggered it. Mitigate with a small warm pool, predictive scaling on leading indicators (queue growth rate, not just queue depth), and scale-down hysteresis so a brief traffic dip does not evict a replica you will need again ninety seconds later.

Practical Recommendations

Start by measuring, not optimizing. Instrument tokens in and out, GPU-hours, and effective utilization, then compute cost per million tokens as your north-star metric. You cannot manage what you do not attribute, so tag spend by model, tenant, and route.

Then apply the levers in impact order. Turn on continuous batching and confirm batches are actually full under load — half-empty batches are the most common silent waste. Enable PagedAttention and prefix caching, especially if you resend system prompts or retrieval context. Move to FP8 once you have validated quality on your own evaluations. Add a small-model-first router and a response cache to shed load from the expensive tier. Only then weigh INT4 and speculative decoding, each behind its own A/B test. Finally, layer spot and preemptible capacity for overflow while keeping a reserved floor, and scale on queue depth and cache occupancy rather than raw GPU percent.

Checklist:

  • [ ] Cost per million tokens tracked and attributed by model, tenant, and route.
  • [ ] Continuous batching on; batch occupancy verified > 70% under peak load.
  • [ ] PagedAttention and automatic prefix caching enabled.
  • [ ] FP8 validated on task evals before rollout; INT4 only behind quality A/B.
  • [ ] KV-cache quantization considered for memory-gated workloads before INT4 weights.
  • [ ] Small-model-first router plus semantic response cache in front of the flagship.
  • [ ] Speculative decoding gated to low-batch, latency-sensitive paths with acceptance-rate monitoring.
  • [ ] Spot and preemptible GPUs for overflow, reserved floor for baseline, graceful drain on preemption.
  • [ ] Autoscaling driven by queue depth, KV-cache occupancy, and time-to-first-token, with a warm pool for cold-start.

Inference cost FinOps feedback loop from measure to attribute to optimize to autoscale

Figure 4: The GPU FinOps loop — measure cost per token, attribute it, apply the highest-impact lever, then autoscale and re-measure.

Figure 4 closes the loop. GPU FinOps is not a one-time tuning pass; traffic shape drifts, models change, and prices move, so the measure-attribute-optimize-autoscale cycle runs continuously. The teams that keep inference cheap are the ones that keep the loop turning.

Frequently Asked Questions

What is the single biggest lever for AI inference cost optimization?

Continuous batching, by a wide margin, for most workloads. A naive single-stream server leaves the GPU mostly idle because generation is memory-bound at low batch sizes. Filling the batch so one weight load serves many sequences can reclaim three to five times the throughput on the same hardware. Quantization and caching matter, but they shave a bill that batching already shrank. Confirm your batches are actually full under peak load before reaching for anything else.

Does quantization always reduce cost without hurting quality?

No. FP8 is usually safe on modern GPUs and roughly halves memory with negligible quality loss, but INT4 can degrade reasoning, math, and long-context accuracy in ways that simple spot checks miss. Among 4-bit methods, AWQ generally preserves quality better than GPTQ by protecting salient weights. For memory-gated workloads, FP8 KV-cache quantization often delivers the batch-size win at lower quality cost than INT4 weights. Always evaluate a quantized model on your own task before shipping, because a cheaper-per-token answer that is wrong triggers retries and escalations that cost far more.

When does speculative decoding actually help?

Speculative decoding helps most at low batch sizes and on output-heavy, latency-sensitive workloads like code generation, where it can deliver two-to-four times throughput. At high batch occupancy the GPU is already compute-saturated, so the draft model becomes overhead and can make you slower. It also depends on a high draft acceptance rate; a mismatched draft model wastes passes. Verification is exact, so there is no quality cost — only a throughput bet won or lost on acceptance rate. Gate it to the paths where it wins and monitor acceptance rate, not just peak tokens per second.

How do I calculate cost per million tokens?

Take the GPU rental rate per hour, divide by the effective tokens served per second to get GPU-hours per token, then scale to a million. For example, 3,200 tokens per second is about 0.087 GPU-hours per million tokens; at an illustrative $3 per GPU-hour that is roughly $0.26. The effective tokens-per-second figure — which folds in batch occupancy and precision — is what separates a good stack from a wasteful one, so measure it under realistic load rather than from a single-request benchmark.

What is the difference between prefill and decode, and why does it matter for cost?

Prefill processes the whole prompt in one parallel, compute-bound pass; decode emits output tokens one at a time in a memory-bound loop. A prompt-heavy request spends its cost in efficient prefill, while a generation-heavy request spends its wall-clock time in expensive decode. Because the two phases have opposite cost profiles, your cost per token depends on the input-to-output ratio of your traffic, and techniques like chunked prefill and disaggregated serving exist specifically to keep one phase from starving the other.

Are spot and preemptible GPUs safe for production inference?

For batch, overflow, and fault-tolerant workloads, yes, and they are often 60–70% cheaper than on-demand. For latency-critical live traffic they are risky because the cloud can reclaim them in seconds, dropping in-flight conversations. The safe pattern is a reserved or on-demand floor for baseline traffic, spot capacity for the elastic top, graceful draining and request checkpointing on preemption, and an autoscaler that understands both pools.

Which serving engine should I use for cost optimization?

vLLM is the sensible default for self-hosting — it ships continuous batching, PagedAttention, and prefix caching on by default. TensorRT-LLM extracts maximum throughput from NVIDIA hardware with FP8 and kernel fusion when you can invest in the build. SGLang shines for agentic and structured-output workloads with aggressive prefix caching. The right choice depends on your hardware, latency targets, and how much engineering you can spend on the serving layer.

Further Reading

By Riju — about

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *