vLLM Cost Economics: 2026 Deep Dive on $/Million Tokens
vLLM cost economics 2026 boil down to four levers and one arithmetic identity. If you understand how PagedAttention frees up KV-cache memory, how continuous batching keeps the GPU pipeline full, how speculative decoding multiplies tokens per second, and how concurrency planning trades p99 latency for throughput, you can move the $/Mtok number by an order of magnitude on the same hardware. Everything else — Helm charts, autoscalers, dashboards — is plumbing around those four levers.
The methodology in this rewrite is deliberately boring: pick a GPU, measure sustained output tokens per second on a realistic workload at your target latency SLO, divide instance $/hour by tokens/hour, multiply by one million. That is the only honest way to quote a vLLM dollar per million tokens number. Anything else — vendor marketing slides, single-request benchmarks, prefill-only TPS — is theatre. This guide walks each lever, shows the worked maths on Llama 3.3 70B for H100 and B200 with clearly labelled illustrative pricing, and ends with a decision tree for when vLLM is not the right answer.
Voice and units note: prices are illustrative US dollars from publicly listed on-demand sources at the time of writing in June 2026; treat all numbers as starting points for your own measurement, not as quotes. Spellings follow Indian/British English conventions.
Why a 2026 cost rewrite
The original version of this post was written in 2024 against vLLM 0.4.x and an inference market that was H100-or-bust. Two years on, every assumption in that draft has shifted.
First, hardware: H200 (141 GB HBM3e) is the new mid-range workhorse for 70B-class models, B200 (192 GB HBM3e, fifth-generation NVLink) and B100 are shipping in volume on Lambda Cloud, CoreWeave, and AWS p5e/p6 families, and AMD MI300X (192 GB HBM3) has matured into a credible vLLM target on ROCm. L4 and L40S — Ada-generation cards with bottom-of-stack pricing — are now the cost-optimal choice for 7B–13B models served at moderate QPS, something almost nobody talked about in 2024.
Second, vLLM itself: chunked prefill is now default-on for long-context workloads, prefix caching (automatic and explicit) is production-stable, speculative decoding with draft models and Medusa heads ships in core, FP8 KV cache and AWQ/GPTQ INT4 weights are first-class, and the V1 engine refactor (merged through 2025) shaved measurable scheduler overhead. The OpenAI-compatible API server has multi-LoRA support, structured outputs, and tool calling baked in.
Third, the market: tokens are a commodity. Hosted providers like Together, Fireworks, Anyscale, DeepInfra, Groq, and Cerebras quote $/Mtok publicly, which means anyone serving a model in-house is implicitly benchmarking against those prices. If your in-house Llama 3.3 70B costs $1.50/Mtok blended and Fireworks lists $0.90, your finance team will notice. The four levers in this post are how you close that gap.
Fourth, the workload shape: agents, RAG, and tool-calling pipelines mean prompts are now dominated by long retrieved context and shared system messages. That changes everything about KV-cache economics — prefix caching can drop the effective input cost by 5–10× when sessions reuse system prompts, which is most agent workloads.
The four cost levers
Before diving into each lever, here is the request lifecycle you are optimising. Every request passes through tokenize, prefill, decode, detokenize, and the scheduler interleaves them across many in-flight sequences. PagedAttention is what lets the scheduler keep many sequences resident without fragmenting HBM.

The four levers, in order of dollar impact:
- GPU choice. A 2–3× swing in $/Mtok on the same model just by picking the right card for the model size and workload shape. Getting this wrong dwarfs everything else.
- KV-cache efficiency. PagedAttention, prefix sharing, and FP8/INT8 KV quantisation can double or triple effective batch size, which directly multiplies tokens/sec on a fixed instance.
- Batching and scheduling. Continuous batching, chunked prefill, and speculative decoding are the difference between 60% and 95% GPU utilisation, and between 1× and 2–3× output tokens/sec.
- Concurrency planning. Picking the right max-concurrent-requests is how you trade tail latency for throughput. Get it wrong in either direction and you either burn money on idle GPUs or ship a p99 that breaks your SLO.
Subsequent sections walk each lever with the maths and the gotchas.
Lever 1 — GPU economics (A100, H100, H200, L4, L40S, B200, MI300X)
The headline question is which card minimises $/Mtok for your model and SLO. The honest answer is: it depends on model size, context length, batch size, and whether you are throughput-bound or latency-bound. Below is a working framework with illustrative on-demand pricing from cloud providers and neoclouds as of mid-2026. Reserved or committed pricing typically drops 30–60%, and spot can be 50–80% lower.
A100 80GB (Ampere, 2020). Still widely available. Indicative on-demand $1.50–$2.20/hr on RunPod and Lambda for single-GPU instances; AWS p4d.24xlarge (8× A100 40GB) and p4de.24xlarge (8× A100 80GB) sit at $32–$40/hr list. Best for 7B–13B FP16 or 70B AWQ/GPTQ INT4. No FP8, no Transformer Engine, no fourth-gen NVLink — the floor for the cards you can usefully run vLLM on in 2026, but the price/perf has been eclipsed for 70B+ workloads.
H100 80GB (Hopper, 2022). The default heavy-lifting card. Indicative on-demand $2.49–$3.50/hr for SXM on neoclouds (Lambda, RunPod, CoreWeave); AWS p5.48xlarge (8× H100 80GB SXM) lists around $98/hr on-demand. Supports FP8 via Transformer Engine, which roughly doubles throughput vs FP16 on dense models. Sweet spot for 70B at FP8 with single-node tensor parallel TP=2 or TP=4.
H200 141GB (Hopper refresh, 2024). Same compute as H100 but 1.7× HBM bandwidth and 1.76× memory. Indicative $3.49–$4.99/hr on neoclouds. The bigger memory matters: 70B FP16 fits in a single H200, and at FP8 you can park very long contexts (32k–128k) without TP. Often the most cost-effective per-token choice for 70B at long context.
L4 (Ada, 24 GB). Tiny, cheap (indicative $0.43–$0.80/hr), no NVLink. Sweet spot for 7B–8B models at FP16, or 13B at AWQ INT4, serving moderate QPS. Surprisingly competitive on $/Mtok for small models when batch size is the binding constraint, not memory.
L40S (Ada, 48 GB). PCIe card, indicative $0.99–$1.69/hr. Good for 13B–34B FP16 or 70B AWQ INT4 single-card. No FP8 Transformer Engine acceleration, so falls behind H100 on dense throughput per dollar at 70B.
B200 192GB (Blackwell, 2024–25). Indicative $4.99–$6.50/hr on early neocloud SKUs; AWS p6-b200 instances listed at premium. FP4 and FP8 via second-gen Transformer Engine, fifth-gen NVLink, 8 TB/s HBM. For 70B at FP8 or 405B/671B at FP4 with TP=4–8, B200 typically delivers 1.8–2.5× tokens/sec vs H100 at 1.6–2× the price, which is a meaningful $/Mtok win — provided you have the load to keep it saturated.
AMD MI300X 192GB. ROCm-native vLLM is now a stable target (vLLM 0.6+ on ROCm 6.2+). Indicative $2.49–$3.99/hr on TensorWave and RunPod. The 192 GB HBM is the headline — a single MI300X holds 70B at FP16 with room for huge KV cache. Per-token cost can be excellent for memory-bound long-context workloads, but kernel maturity still trails CUDA for the bleeding-edge features (FP8, speculative decoding regressions, occasional GEMM perf cliffs). Worth measuring, especially if H100 supply at your provider is tight.
Rule of thumb that holds in 2026. Match the smallest card that fits your model with healthy KV headroom (rule of thumb: 30–50% of HBM left for KV after weights). Going bigger than necessary buys you nothing if you cannot saturate it, and going smaller forces tensor parallelism, which adds NVLink-or-PCIe overhead and complicates scheduling.
Lever 2 — KV cache, PagedAttention, and prefix sharing
PagedAttention is the single most important idea in vLLM. The pre-PagedAttention world allocated KV-cache contiguously per sequence at maximum context length, which meant 70–90% of allocated KV memory was wasted, batch size was tiny, and tokens/sec on a single GPU was a fraction of what the math units could deliver.
PagedAttention borrows operating-system paging: KV-cache is split into fixed-size blocks (typically 16 tokens per block), each sequence holds a logical-to-physical block table, and blocks are allocated on demand from a shared pool. The structural win is double — fragmentation drops to near-zero, and the block table lets multiple sequences share blocks for identical prefixes.

Effective batch size goes up. With 70B FP16 on a single H200 (141 GB), weights take ~140 GB, leaving very little for KV. Drop weights to FP8 (~70 GB) and you have 60–70 GB for KV. With PagedAttention’s near-zero fragmentation and FP8 KV quantisation, you can serve dozens of long-context sequences concurrently where the pre-PagedAttention world would manage a handful.
FP8 and INT8 KV cache. vLLM supports FP8 KV cache (E4M3 on Hopper/Blackwell) and INT8 KV with calibration. FP8 KV is a ~2× memory win over FP16 KV with usually undetectable accuracy degradation on standard benchmarks; INT8 with proper scales is similar, although some models show small regressions on math/code tasks. The cost lever is clear: 2× KV memory = roughly 1.5–2× usable batch size = 1.3–1.8× sustained tokens/sec on memory-bound workloads. Always benchmark accuracy on your eval set before flipping the switch.
Prefix caching — the single biggest agent-workload win. Automatic prefix caching (enabled with --enable-prefix-caching, default-on for most engine v1 paths in late-2025 builds) hashes prefix blocks and reuses them across requests. For an agent or RAG workload where every request starts with a 2,000-token system prompt and a 4,000-token retrieved context that does not change per session turn, prefix caching turns the prefill of those 6,000 shared tokens into a near-free cache lookup after the first request.
Real-world impact. A representative agent pipeline with a 3,000-token system prompt and 50-turn sessions can see prefill compute drop by 80–95% on subsequent turns with prefix caching enabled. That is the single largest cost lever in the entire post for agent workloads, and it is mostly free — you turn it on, you measure, you ship.
Caveats. Prefix-cache blocks consume the same KV memory budget as live sequences, so very heavy KV pressure can evict caches mid-session. Set --num-gpu-blocks-override carefully and monitor vllm:cache_usage_perc. And the hash key is the literal token-id prefix — any change in tokenisation (model upgrade, tokeniser bump) invalidates the entire cache.
Lever 3 — Continuous batching, chunked prefill, and speculative decoding
The 2022-era inference-server pattern was static batching: collect N requests, run them lockstep until all finish, then take the next batch. The tail-latency cost was crippling, because every request waited for the slowest in the batch. Continuous batching, popularised by Orca and made production-grade in vLLM, runs the scheduler every decode step. A request that finishes frees its slot immediately; a new arrival can join in the next step rather than waiting for the batch to drain.

Chunked prefill. A long-context request with a 32k-token prompt would, under naive scheduling, monopolise the GPU for a full prefill pass and stall all in-flight decodes. Chunked prefill breaks the prefill into fixed-size chunks (commonly 512 or 2048 tokens) and interleaves them with ongoing decodes in the same forward pass. Effect: prefill no longer head-of-line-blocks decode, p99 inter-token latency stabilises, throughput rises 10–25% on mixed long/short workloads. Flag: --enable-chunked-prefill (default-on in recent versions).
Speculative decoding. Draft a few tokens with a cheap small model (or Medusa heads, EAGLE, or n-gram look-ups), verify them in parallel with the target model, accept the longest correct prefix. When the draft model is well-aligned to the target, acceptance rates of 60–80% deliver 1.5–2.5× tokens/sec on memory-bound decode. vLLM supports speculative decoding with draft-model, Medusa, EAGLE-2, and ngram speculators in the V1 engine. The catch: at very high batch sizes the GPU is already compute-saturated and the verification cost eats the gain — speculative decoding shines at low-to-moderate batch (1–32) and degrades or even hurts at very high batch (>64).
Quantised weights for throughput. AWQ INT4, GPTQ INT4, GGUF (via Marlin kernels), and FP8 W8A8 reduce weight bandwidth and unlock memory for KV. On a 70B model, AWQ INT4 weights drop from ~140 GB FP16 to ~40 GB, freeing 100 GB of HBM on an H200 for KV cache — which is the dominant cost lever for high-concurrency serving. Accuracy on standard benchmarks is typically within 1–2% of FP16 for well-calibrated AWQ/GPTQ; verify on your domain eval before committing.
Realistic stack for 70B today. FP8 weights + FP8 KV cache + chunked prefill + automatic prefix caching + speculative decoding with a 1B–7B draft model, on H100 or H200 with TP=2 or TP=4. Each lever stacks roughly multiplicatively (with diminishing returns at the high end), and a well-tuned stack delivers 3–5× the tokens/sec of vanilla FP16 vLLM 0.4 from 2024.
Lever 4 — Concurrency, queueing, and tail latency
The hardest lever to get right is concurrency. vLLM exposes max_num_seqs (max in-flight sequences) and max_num_batched_tokens (max tokens scheduled per step). Push them too high and queue depth blows up, TTFT explodes, and the scheduler thrashes. Push them too low and you leave throughput on the table and pay for idle GPU.
The shape of the curve. Throughput rises with concurrency until it plateaus when the GPU is compute- or memory-bound. Latency stays flat until the queue starts backing up (around 60–80% of throughput-saturation), then rises sharply. The dollar-optimal operating point sits at the knee — typically 70–85% utilisation — where you have just enough headroom to absorb bursts without inflating tails.
Practical recipe. Run a sweep at your real input/output distribution: vary concurrency, plot tokens/sec and p50/p95/p99 inter-token latency and TTFT. Pick the concurrency that maximises tokens/sec subject to your SLO. Most teams discover their first-pass guess was off by 2–4×.
Queueing reality. Even at the knee, requests will queue. Set sensible client timeouts, return 429 on overload (vLLM emits queue depth and you can shed at the gateway), and run a small idle reserve. A horizontal autoscaler on tokens/sec or queue depth is more useful than CPU- or GPU-util-based scaling for inference fleets.
Multi-tenant noise. If you share a vLLM instance across teams, one team’s 100k-token prompt can absorb the entire prefill budget and inflate everyone’s TTFT. Either shard per-tenant or use the scheduler’s priority and preemption hooks to bound any one request’s GPU share.
Pricing math: from tokens/sec to $/Mtoks
The arithmetic is one division. The honesty is in what you measure.

The identity:
$/Mtok = (GPU $/hr × hours) ÷ (output tokens produced in that time) × 1e6
Worked structure (illustrative). An H100 SXM at $2.99/hr serving Llama 3.3 70B FP8 with TP=2 (so $5.98/hr for the pair) at a sustained 4,800 output tok/sec — typical for a well-tuned vLLM stack on the right workload mix — produces 4,800 × 3,600 = 17.28 million output tokens per hour. $5.98 ÷ 17.28 = $0.346 per Mtok of output tokens. That is the output-only number.
Input tokens are cheaper to produce because prefill is highly parallel and compute-bound on modern GPUs — a single forward pass over a 2,000-token prompt costs a fraction of a 2,000-token decode. The common rule of thumb is input $/Mtok lands at 20–40% of output $/Mtok depending on prefix caching hit rate and chunked-prefill efficiency.
Blended pricing. If your workload runs at 3:1 input:output (typical chat) and input is 30% of output cost, blended $/Mtok = (3 × 0.30 + 1 × 1.00) ÷ 4 × output_$/Mtok = ~0.475 × output_$/Mtok. For RAG and agents the ratio often shifts to 10:1 or 20:1 input-heavy, and prefix caching becomes the dominant determinant of effective cost.
Overheads to add. Idle time between bursts (10–30% on under-saturated fleets), data egress (cheap but non-zero), control-plane and monitoring (gateway, Prometheus, logs), and a margin for headroom and failures. A safe overhead multiplier for a production fleet is 1.4–1.8× the raw measured $/Mtok.
Reserved and spot. Three-year reserved instances on AWS or GCP cut on-demand by 50–65%, neocloud commits often more. Spot instances can run 60–80% cheaper but bring preemption — only feasible for batch inference workloads or with rapid checkpoint-restart and request-replay.
Worked example: Llama 3.3 70B on H100 vs B200 (illustrative)
Two scenarios on the same model, same workload mix (1,500 input / 500 output tokens, moderate prefix-cache hit rate ~40%). All numbers are clearly illustrative and meant to show methodology, not to quote a price.
Scenario A: H100 80GB SXM, TP=2.
Instance: 2× H100 SXM at $2.99/hr each = $5.98/hr (illustrative neocloud on-demand).
Configuration: vLLM V1 engine, FP8 weights, FP8 KV cache, chunked prefill, automatic prefix caching enabled, no speculative decoding.
Sustained throughput at SLO (p99 TTFT ≤ 600 ms, p99 ITL ≤ 80 ms): ~4,800 output tok/s, ~14,400 input tok/s effective after prefix-cache hits.
Output $/Mtok = $5.98 / (4,800 × 3,600) × 1e6 = $0.35
Input $/Mtok (illustrative ratio of 0.28× output) = $0.10
Blended (3:1 input:output) = $0.16 per Mtok
Scenario B: B200 192GB, TP=1.
Instance: 1× B200 at $5.99/hr (illustrative neocloud on-demand).
Configuration: vLLM V1 engine, FP8 weights, FP8 KV cache, chunked prefill, prefix caching, speculative decoding with a 7B draft model.
Sustained throughput at same SLO: ~7,200 output tok/s, ~22,000 input tok/s effective.
Output $/Mtok = $5.99 / (7,200 × 3,600) × 1e6 = $0.23
Input $/Mtok = $0.065
Blended (3:1) = $0.105 per Mtok
Reading the numbers. B200 wins on blended $/Mtok in this illustrative workload by roughly 35%, driven by single-card placement (no TP overhead), higher memory bandwidth, and Blackwell-generation FP8 throughput. The H100 pair is competitive and may win on availability, tooling maturity, and reserved pricing in many regions. The honest answer is: measure on your own workload before you commit, because input/output mix, context lengths, and prefix-cache hit rates swing these numbers by 2–3× either way.
Comparison to hosted. Public hosted prices for Llama 3.3 70B in mid-2026 sit around $0.40–$0.90 per Mtok blended for general-tier APIs and $0.20–$0.50 on dedicated capacity. In-house vLLM on B200 in this illustration lands inside or below that band — which is the point at which finance teams start asking why you are not using a hosted provider, and the answer must include latency control, data residency, customisation, or volume that the hosted unit economics do not yet match.
For a head-to-head against other engines on the same workload, see our SGLang vs vLLM vs TensorRT-LLM benchmark 2026.
When vLLM is the wrong answer (SGLang, TensorRT-LLM, TGI, llama.cpp, Triton)
vLLM is the right answer for most general-purpose LLM serving in 2026, but not all of it.

SGLang. RadixAttention generalises prefix caching to arbitrary tree-structured reuse — multi-turn agents with branching tool calls, parallel sampling for self-consistency, and structured-output workloads see meaningful wins. If your workload is structured-prefix-heavy, benchmark SGLang. For uniform chat and completion traffic, the win is smaller and vLLM’s OpenAI-API maturity often dominates.
TensorRT-LLM. NVIDIA’s compiled-engine path. Lowest TTFT and highest sustained throughput on a single model with a fixed shape, especially on Hopper and Blackwell. The cost is operational: engine compilation per model and per shape, fewer features (LoRA adapters, dynamic quantisation) than vLLM, harder OSS workflow. Choose when latency is the binding constraint and you can amortise engine builds.
TGI (Hugging Face). The original continuous-batching server. Now feature-overlapping with vLLM and trailing on novel features. Reasonable if you are already in the HF stack; not a clear win otherwise.
llama.cpp. GGUF quantised inference, optimised for CPU and small-GPU deployment. Cheapest at very low QPS, on edge devices, or for single-user local apps. Not the right tool for fleet serving.
Triton Inference Server. Not a competitor — Triton is an orchestration layer that hosts vLLM or TensorRT-LLM as backends and adds model ensembling, multi-framework support, and Kubernetes integrations. Use Triton when you need to serve LLMs alongside vision, embedding, or classical ML pipelines from a single endpoint.
Closed-source / API providers. Together, Fireworks, Anyscale, DeepInfra, Bedrock, Vertex. The $/Mtok numbers are competitive, and at low volume the hidden costs of running your own fleet (on-call, MLOps headcount, GPU procurement) easily exceed the markup. Cross-over to in-house typically happens somewhere between $5k and $50k/month of API spend depending on team capacity.
For latency-critical tool-calling workflows where determinism matters more than $/Mtok, see LLM tool calling determinism patterns 2026.
Trade-offs, gotchas, and what goes wrong at scale
KV-cache pressure cascades. Heavy concurrency + long contexts + prefix caching all compete for the same KV block pool. When pressure spikes, vLLM preempts sequences (swap to CPU or recompute), which adds latency cliffs. Watch vllm:num_preempted and the cache_usage_perc gauge; size --gpu-memory-utilization (default 0.9) carefully — go too high and OOMs replace preemptions.
Throughput vs TTFT trade-off. Maximum throughput often means a saturated scheduler with deep queues, which destroys TTFT. Pick your SLO first, then tune to maximise throughput subject to the SLO, not the other way round.
Quantisation accuracy regressions. AWQ and GPTQ INT4 weights typically lose 1–2% on aggregate benchmarks but can lose much more on niche domains (math, code, multilingual edge cases). FP8 KV cache is usually safe; INT8 KV cache needs domain-specific calibration. Always evaluate on your own task before shipping.
Speculative decoding regressions at high batch. The verification step is not free. At very high batch sizes, speculative decoding can reduce tokens/sec rather than increase it. Gate the feature by current batch size in your scheduler if you serve mixed traffic.
Multi-tenant isolation. vLLM has no native fair-share scheduling across tenants. A single large prompt can starve others. Mitigations: per-tenant instances, gateway-level rate limiting and admission control, or priority weights at the scheduler.
Prefix cache invalidation by tokeniser change. Model upgrades that ship a new tokeniser invalidate the prefix cache and you will see a cold-cache p99 spike. Plan a warming pass after deploys.
Driver and CUDA pinning. NCCL and the CUDA toolkit are tightly coupled to your driver version. A vllm:version bump may force a driver upgrade across the fleet — coordinate carefully.
LoRA adapter overhead. Multi-LoRA serving (--enable-lora) is supported, but each active adapter consumes scheduler bookkeeping and a small per-request latency tax. At a handful of adapters the cost is negligible; at dozens it becomes measurable.
Long-context memory blow-up. A single 128k-token request consumes a lot of KV. One such request can absorb the prefill budget for seconds; chunked prefill helps but does not eliminate the problem. Consider per-tenant context limits at the gateway.
GPU procurement reality. H100/H200 supply remains intermittent in many regions. B200 supply is tight at launch pricing in 2026. Have a fallback SKU and validate vLLM behaviour on it ahead of time, not during an outage.
Practical recommendations
- Pick the smallest GPU that fits your model with healthy KV headroom. Oversizing is the most common and most expensive mistake in vLLM deployments.
- Turn on FP8 weights and FP8 KV cache on Hopper/Blackwell, after measuring accuracy on your eval suite. Reserve INT4 weights for memory-binding workloads.
- Enable chunked prefill and automatic prefix caching by default. Both are nearly free wins for any workload with shared prompts or long contexts.
- Measure your actual workload shape (input length distribution, output length distribution, prefix-cache hit rate, arrival pattern). Synthetic benchmarks lie.
- Tune
max_num_seqsandmax_num_batched_tokensagainst your SLO, not against vendor marketing. Plot the throughput-vs-latency curve and pick the knee. - Speculative decoding for low-to-moderate batch, off for high batch. Gate it dynamically if your traffic varies.
- Reserved or committed pricing once measurements stabilise. Three-year reserves on AWS/GCP or annual neocloud commits cut on-demand by 50–65%.
- Monitor four metrics: sustained output tokens/sec, p99 TTFT, p99 inter-token latency, KV cache usage. Everything else is secondary.
- Build a $/Mtok dashboard that divides observed instance cost by observed output tokens per hour, daily. This is the only honest unit-economics view.
- Plan for failure modes: preemption cascades, tokeniser invalidation, driver pins, supply-chain blips. None are theoretical at scale.
- Benchmark against hosted providers quarterly. If your in-house $/Mtok is more than 1.5× hosted, the case for in-house had better be on something other than cost.
FAQ
Q: What is the realistic $/Mtok for Llama 3.3 70B on vLLM in 2026?
A: Illustrative on-demand single-tenant serving lands at $0.10–$0.50 blended on H100 or B200 with a well-tuned stack and moderate prefix-cache hit rate. Reserved or committed pricing pushes that to $0.05–$0.25. Measure on your own workload before quoting.
Q: Should I use vLLM V0 or V1 engine in 2026?
A: V1. It is now the default and stable, and has measurably lower scheduler overhead and better long-context handling than V0. Only stay on V0 for niche features that have not migrated.
Q: Is FP8 KV cache safe for production?
A: On Hopper and Blackwell, with FP8 E4M3 and proper scales, yes for most workloads. Always run your domain eval suite before flipping the switch. INT8 KV needs per-model calibration and is safer to evaluate explicitly.
Q: How big a draft model should I use for speculative decoding?
A: For 70B targets, a 1B–8B same-family draft (e.g. Llama 3.2 1B or 3B for Llama 3.3 70B) typically gives 60–80% acceptance and 1.5–2× tokens/sec on low-to-moderate batch. Verify on your prompts — domain mismatch tanks acceptance rate fast.
Q: When does in-house vLLM beat a hosted provider on cost?
A: Roughly when steady-state usage exceeds ~$10k–$50k/month of equivalent hosted spend, you have the on-call and ops capacity to run a GPU fleet, and you can commit to reserved pricing. Below that, hosted almost always wins on total cost including human time.
Q: How do I size max_num_seqs and max_num_batched_tokens?
A: Run a concurrency sweep against your real input/output distribution. Plot tokens/sec and p99 TTFT and p99 ITL versus concurrency. Pick the highest concurrency that stays within your latency SLO. Re-tune after model changes, hardware swaps, or workload shifts.
Further reading
- vLLM project blog and documentation — https://blog.vllm.ai and https://docs.vllm.ai
- PagedAttention paper, “Efficient Memory Management for Large Language Model Serving with PagedAttention” — https://arxiv.org/abs/2309.06180
- Orca continuous batching paper, “Orca: A Distributed Serving System for Transformer-Based Generative Models” (OSDI 2022)
- SGLang and RadixAttention, “Efficiently Programming Large Language Models using SGLang” — https://arxiv.org/abs/2312.07104
- NVIDIA H100, H200, B200 product pages — https://www.nvidia.com/en-in/data-center/h100/ and B200 datasheet
- AWS EC2 P5/P5e/P6 instance pricing — https://aws.amazon.com/ec2/instance-types/p5/
- Lambda Cloud GPU pricing — https://lambdalabs.com/service/gpu-cloud
- RunPod GPU pricing — https://www.runpod.io/pricing
- CoreWeave pricing and instance catalogue — https://www.coreweave.com/pricing
- AMD MI300X and ROCm vLLM guide — https://rocm.docs.amd.com/projects/ai-developer-hub/en/latest/notebooks/inference/vllm_inference.html
- NVIDIA TensorRT-LLM — https://github.com/NVIDIA/TensorRT-LLM
- Hugging Face Text Generation Inference (TGI) — https://github.com/huggingface/text-generation-inference
- llama.cpp — https://github.com/ggml-org/llama.cpp
- NVIDIA Triton Inference Server — https://github.com/triton-inference-server/server
