SGLang vs vLLM vs TensorRT-LLM: 2026 Inference Benchmark
If you are picking an LLM inference engine in 2026, the choice has narrowed to three serious contenders: SGLang, vLLM v2, and NVIDIA TensorRT-LLM. Everything else is either a wrapper around one of these, a vendor-locked API, or a research engine without production traction. The interesting questions are no longer “which is fastest” in the abstract — they are fastest at what, on which GPU, with which model, at which concurrency, and at what operational cost. This benchmark answers those questions with a reproducible harness, transparent methodology, and an honest disclosure of where each number comes from.
We ran two GPU pools (H100 80GB SXM and A100 80GB SXM), four models (Llama 3.3 70B, Llama 3.1 8B, Mistral Small 3, Qwen3 32B), three workloads (chat, RAG, batch), and a concurrency sweep from 1 to 512. We then cross-checked every result against the engines’ own published benchmarks — SGLang’s RadixAttention paper and 2025 blog posts, the vLLM v2 release notes, and NVIDIA’s TensorRT-LLM Performance Overview. Where our numbers disagreed with vendor claims by more than 20%, we re-ran the test and disclose the spread. Treat every number in this post as having a ±15–20% honest band unless we say otherwise.
The headline result: there is no universal winner. TensorRT-LLM owns lowest TTFT and the best throughput-per-dollar when you have NVIDIA expertise and a stable model. SGLang dominates anything with long, shared prefixes — RAG, agents, tree search — by 2–3× thanks to RadixAttention. vLLM v2 is the strongest generalist, the fastest to get to “good enough,” and the only engine that consistently supports new models within hours of release. Pick by workload, not by leaderboard.
A second-order finding worth flagging up front: the gap between best and worst engine on any single workload is rarely larger than the gap you create by tuning poorly. We saw vLLM v2 swing 35% in throughput depending on whether max-num-batched-tokens, enable-chunked-prefill, and gpu-memory-utilization were left at defaults or tuned to the workload. The same was true for SGLang’s --mem-fraction-static and TensorRT-LLM’s batch scheduler parameters. If you read this post, pick an engine on engineering fit, and then leave it at defaults, you will leave 20–30% of theoretical throughput on the floor. Plan a tuning sprint.
Context: how LLM inference changed from 2024 to 2026
Two years ago, “serving an LLM” meant Hugging Face TGI or a hand-rolled FastAPI wrapper. Throughput was bounded by naive batching, KV cache fragmentation chewed through 30–40% of GPU memory, and a single long-context request could stall a whole batch. The 2024 release of vLLM with PagedAttention solved the memory fragmentation problem and made continuous batching the default. By mid-2024, TensorRT-LLM had matured into NVIDIA’s answer — slower to onboard but with fused attention kernels and in-flight batching that pushed H100 utilization above 80%. SGLang arrived late 2024 as a Stanford/Berkeley project focused on a different axis: prefix-cache aware scheduling via RadixAttention, plus a frontend DSL for structured generation.
The 18 months since have been about closing gaps. Chunked prefill (now in all three engines) eliminated the head-of-line blocking that made long prompts a tail-latency nightmare — instead of running an 8K prompt’s prefill as a single 8K matmul that starves decode, the prefill is split into chunks (typically 512 or 1024 tokens) and interleaved with ongoing decode steps. Speculative decoding went from a research curiosity to a production default — vLLM ships EAGLE-2 support, SGLang has Medusa and tree-based speculation built in, and TensorRT-LLM bundles ReDrafter. FP8 quantization on Hopper became table stakes, with all three engines now supporting per-tensor and per-block FP8 with negligible quality loss on most models. Quantization for KV cache itself (FP8, INT8) is the next memory frontier and is partially shipped in 2026 builds.
The result: a 70B-class model that needed two H100s and gave you 1500 tok/s aggregate in late 2023 now runs on a single H100 at FP8 and gives you 2500+ tok/s — and on four H100s with all the modern tricks enabled, you are looking at 7000–9000 tok/s aggregate. The engineering question shifted from “can we serve this” to “which engine wins for our traffic shape.”
Methodology
The harness, illustrated in arch_01.png, is intentionally boring so it is easy to reproduce.

A load generator (k6 plus a custom Python tokenizer-aware client) drives requests at fixed concurrency levels into one engine at a time. An orchestrator sweeps concurrency from 1 to 512 in powers of two. A metric collector records TTFT (Time To First Token), ITL (Inter-Token Latency — the gap between successive output tokens), end-to-end latency, throughput (output tokens per second across all in-flight requests), and GPU utilization via DCGM exporter. An output validator computes a token-level diff against a reference generation to catch quantization drift or decoding bugs.
Hardware
Two GPU pools, both single-node, both with NVLink/NVSwitch interconnect so tensor parallelism is not bottlenecked on PCIe:
- H100 pool: one DGX H100 node with 4× H100 80GB SXM5, NVSwitch interconnect, 2 TB DDR5, dual Sapphire Rapids CPUs. CUDA 12.5, driver 555.
- A100 pool: one DGX A100 node with 4× A100 80GB SXM4, NVLink, 1 TB DDR4. CUDA 12.4, driver 550.
For 8B-class models we use a single GPU; for 32B and 70B we use tensor parallel 2 and tensor parallel 4 respectively. We do not use pipeline parallelism — at single-node scale it adds latency without throughput gain.
Models tested
Four models chosen to cover the realistic 2026 spectrum:
- Llama 3.3 70B Instruct at FP8 (Meta’s flagship dense model, the de facto large-model baseline)
- Llama 3.1 8B Instruct at BF16 (the small-model workhorse)
- Mistral Small 3 24B at BF16 (mid-tier, popular for RAG)
- Qwen3 32B at FP8 (strong on multilingual and reasoning)
All models pulled from Hugging Face with identical weights; no fine-tunes. For TensorRT-LLM we built engines with the official conversion scripts at the equivalent precision — this is itself a methodology point because TRT engine build time (15–45 minutes per model per GPU count) is a real operational cost we discuss later.
Workloads
Three workload shapes chosen because they bracket real production traffic:
| Workload | Input tokens | Output tokens | Why it matters |
|---|---|---|---|
| Chat | 256 | 256 | Symmetric, balanced — the classic “assistant” load |
| RAG | 8,192 | 512 | Long prefix-heavy — tests prefill efficiency and prefix caching |
| Batch | 1,024 | 128 | Asymmetric long-in/short-out — offline scoring, classification |
The RAG workload deliberately includes a 4K-token shared system prompt across all requests. This is the case where SGLang’s RadixAttention should shine; we wanted to measure how much.
Metrics
We report five numbers per configuration:
- Throughput — aggregate output tokens per second across all in-flight requests
- TTFT p50 / p99 — Time To First Token, the user-perceived “is it stuck?” metric
- ITL p50 / p99 — Inter-Token Latency, the streaming smoothness metric
- End-to-end p99 — full request latency including queueing
- GPU utilization — SM occupancy and HBM bandwidth from DCGM
Honesty about the numbers
Two classes of numbers in this post deserve different trust levels. Vendor-published numbers (SGLang blog throughput claims, NVIDIA TRT-LLM Performance Overview, vLLM release-note benchmarks) we cite directly and accept at ±5%. Our own reproductions carry ±15–20% bands because single-node benchmarks are sensitive to NUMA pinning, container overhead, tokenizer choice, and warm-up duration. Where the two disagree by more than 20%, we say so explicitly in the relevant section and lean toward the vendor number for headline figures (they have more tuning hours than we do).
Results: Throughput at Concurrency
The throughput sweep is summarized in arch_02.png and the table below. All numbers are for Llama 3.3 70B FP8 on 4× H100 SXM. Higher is better.

| Concurrency | Workload | SGLang 0.4 | vLLM v2 (0.7) | TensorRT-LLM 0.18 |
|---|---|---|---|---|
| 1 | Chat 256/256 | 78 tok/s | 76 tok/s | 92 tok/s |
| 16 | Chat 256/256 | 1,080 tok/s | 1,020 tok/s | 1,140 tok/s |
| 64 | Chat 256/256 | 3,650 tok/s | 3,380 tok/s | 3,520 tok/s |
| 256 | Chat 256/256 | 6,200 tok/s | 6,400 tok/s | 6,150 tok/s |
| 64 | RAG 8K/512 | 5,400 tok/s | 4,100 tok/s | 4,300 tok/s |
| 256 | RAG 8K/512 | 7,900 tok/s | 5,600 tok/s | 6,100 tok/s |
| 128 | Batch 1024/128 | 4,800 tok/s | 4,650 tok/s | 5,200 tok/s |
Three patterns jump out:
- TensorRT-LLM wins low-concurrency and pure batch. Its fused attention kernels and CUDA graph capture deliver the lowest single-stream latency, which translates to higher throughput at concurrency 1–16. NVIDIA’s TensorRT-LLM Performance Overview reports similar ratios on its own runs.
- SGLang dominates the RAG workload by 30–40%. This is the RadixAttention payoff: the 4K shared system prompt is computed once and reused across requests. vLLM’s prefix caching helps (it has had automatic prefix caching since 0.6) but the radix-tree structure SGLang uses is more efficient at high cache-hit rates. SGLang’s own blog post on RadixAttention claimed up to 5× on extreme prefix-sharing benchmarks; in our more realistic RAG workload we see 1.3–1.4×.
- vLLM v2 catches up at high concurrency on standard chat. At 256 concurrent chat requests, vLLM v2’s improved scheduler and chunked prefill make it the marginal winner. The v2 release notes document a 1.5–2× throughput improvement over vLLM 0.5; our numbers are consistent with that.
Methodology note: the table above blends our reproductions with vendor-published medians. Numbers within 10% of the published values were left as published; numbers more than 15% off (we saw two — SGLang at concurrency 256 chat, and TRT-LLM RAG 8K) were re-run three times and the median reported. Honest band: assume ±15% on each cell.
Latency (TTFT + ITL) Results
Throughput is what your CFO cares about. TTFT and ITL are what your users feel. arch_03.png summarizes the latency distribution; the table below shows the headline percentiles.

Llama 3.3 70B FP8, 4× H100, chat 256/256, concurrency 64:
| Metric | SGLang | vLLM v2 | TensorRT-LLM |
|---|---|---|---|
| TTFT p50 | 145 ms | 162 ms | 118 ms |
| TTFT p99 | 340 ms | 410 ms | 265 ms |
| ITL p50 | 18 ms | 19 ms | 17 ms |
| ITL p99 | 38 ms | 46 ms | 33 ms |
| End-to-end p99 | 5.1 s | 5.6 s | 4.7 s |
TensorRT-LLM owns latency on every percentile. Two reasons: (1) the fused multi-head attention kernels and FP8 activation paths shave 20–25% off prefill compute, which is most of TTFT; (2) CUDA graph capture eliminates per-step launch overhead during decode, which compresses ITL tails. SGLang lands in the middle — its scheduler is fast but it does not yet match TRT’s kernel-level optimizations. vLLM v2’s longer p99 tails come from its more conservative scheduling decisions under load; the engineers have prioritized throughput stability over tail compression, and you can feel it.
The practical takeaway: if your SLO is “p99 TTFT under 300ms on a 70B model,” TRT-LLM hits it cleanly, SGLang scrapes through, and vLLM v2 needs you to either drop concurrency or accept the breach.
KV Cache Behavior
arch_04.png shows how each engine manages the KV cache, which is increasingly the deciding factor for cost.

PagedAttention (vLLM) treats the KV cache as fixed-size blocks (typically 16 tokens each) managed by a block manager analogous to OS virtual memory. Prefix sharing happens via copy-on-write across requests. It is the most general-purpose design and the easiest to operate.
RadixAttention (SGLang) maintains a radix tree of all currently cached prefixes, keyed by token-ID hash. When a new request arrives, the longest matching prefix in the tree is reused without recomputation. The tree structure makes it cheap to share prefixes across thousands of concurrent requests — the data structure the academic SGLang paper introduced is doing real work here.
TensorRT-LLM in-flight batching uses paged KV cache as well, but its cross-request prefix reuse is more limited by default. It compensates with extremely tight memory layout and fused kernels that achieve higher cache bandwidth utilization. Per-request, it is the most efficient; across requests with shared prefixes, it leaves performance on the table relative to SGLang.
In our DCGM traces, HBM bandwidth utilization at concurrency 256 RAG: SGLang 78%, vLLM v2 71%, TRT-LLM 74%. The number that matters is the useful bandwidth — bytes serving cache hits rather than recomputing. By that measure SGLang is doing 40% less redundant work on the RAG workload.
When Each Engine Wins
arch_05.png is the decision tree we ended up using internally. The short version:

- Pick SGLang if your workload has high prefix reuse (RAG, agents, tree-of-thought, batch evaluation), if you want frontend DSL features for constrained generation, or if you care about Python-native ergonomics.
- Pick vLLM v2 if you need broad and fast model support (day-zero Llama 4, Qwen3, DeepSeek), if you value operational simplicity and a large community, or if your traffic is general-purpose chat without strong prefix patterns.
- Pick TensorRT-LLM if you have steady, predictable production load on stable models, if your SLO is dominated by p99 TTFT or ITL, or if you have in-house NVIDIA tuning expertise and want the absolute best throughput-per-dollar.
A surprisingly common answer in practice: run two engines. vLLM for the long tail of experimental models and customer-facing chat, TRT-LLM for the one or two flagship models that dominate cost. SGLang appears in serious RAG and agent stacks where its prefix-cache edge is multi-x rather than marginal.
Trade-offs and Gotchas
TensorRT-LLM build complexity is real. Every model, every quantization, every tensor-parallel size needs a separate engine build that can take 15–45 minutes. A new model version means re-building. The build dependencies (specific CUDA, specific TensorRT, specific Python) are tight. Plan for a 1–2 engineer-week initial onboarding investment plus ongoing CI cost for engine builds. The newer NVIDIA Dynamo runtime smooths this somewhat but does not eliminate it.
vLLM stability is a moving target. vLLM ships fast and supports new models within hours of release — which is also why a pip install --upgrade vllm can quietly regress your p99 by 30%. Pin versions, run regression benchmarks before every upgrade, and read the release notes carefully. The v2 rewrite is dramatically more stable than the 0.5–0.6 era, but the velocity remains high.
SGLang’s programming model is unfamiliar. The frontend DSL (@function, gen, select, fork) is powerful for structured output and agent workflows but adds a learning curve. If you only want a drop-in OpenAI-compatible endpoint, SGLang offers one — but you are leaving the most interesting features on the table. The community is smaller than vLLM’s, so debugging novel issues takes longer.
Watch FP8 quality drift. All three engines now ship FP8 paths, but per-block FP8 schemes differ. On reasoning-heavy benchmarks (GSM8K, MATH) we saw 1–3 point drops on Llama 3.3 70B FP8 vs BF16 across all engines, comparable across them. If your evals are quality-sensitive, run them at FP8 before committing.
Cost per Million Tokens — the number your CFO actually asks for
Throughput and latency are engineering metrics. The number an executive asks for is cost per million output tokens. We computed it from the throughput table above using on-demand H100 SXM pricing of roughly $3.50/GPU-hour (2026 cloud market median across AWS p5, GCP A3 Mega, and Lambda Labs reserved one-year) and a 4-GPU node serving Llama 3.3 70B FP8 at concurrency 64 chat:
| Engine | Throughput @ c=64 | $/hour (4× H100) | Cost / 1M output tokens |
|---|---|---|---|
| TensorRT-LLM | 3,520 tok/s | $14.00 | $1.10 |
| SGLang | 3,650 tok/s | $14.00 | $1.07 |
| vLLM v2 | 3,380 tok/s | $14.00 | $1.15 |
At first glance SGLang wins on chat — but switch the workload to RAG and the gap widens dramatically. At concurrency 256 RAG, SGLang costs $0.49 / 1M tokens vs vLLM’s $0.70 and TRT-LLM’s $0.64. Over a million daily requests, that is meaningful real money. Conversely on single-stream low-concurrency, where TRT-LLM owns the floor, the gap inverts. The honest read is: your cost-per-token answer depends entirely on your traffic shape, which is why nobody can hand you a single leaderboard and have it apply.
A reserved-instance contract or H100-on-prem flips these numbers by 30–50% as well. Many production teams in 2026 are running a 70:30 mix of reserved and spot capacity, which means your effective hourly cost is more like $2.20–$2.60/GPU-hour, dropping the cost-per-million by a corresponding amount. We did not include reserved pricing in the table because it varies wildly by cloud and commitment length, but factor it in when modelling.
Operational Footprint and Day-2 Realities
A benchmark in a controlled lab is a snapshot; production is the long tape. Three operational dimensions matter as much as raw throughput once you are past the proof-of-concept stage.
Cold start and engine warm-up. TensorRT-LLM engines take 30–90 seconds to load the engine file, allocate KV cache, and warm CUDA graphs. vLLM v2 cold starts in 20–40 seconds. SGLang lands around 15–30 seconds. If you autoscale based on queue depth, those seconds matter — pod startup latency feeds directly into p99 during scale events. We saw a 4-pod scale-out under a real traffic burst push p99 TTFT from 280ms to 1.2 seconds for a 40-second window on TRT-LLM, vs 600ms peak on vLLM. Pre-warmed pods or always-on minimum-replica baselines are the only fix.
Memory accounting and OOMs. All three engines preallocate KV cache memory at startup. Get the gpu_memory_utilization wrong (default 0.9) and you either OOM during long-context bursts or leave throughput on the table. vLLM and SGLang both ship dynamic KV cache sizing; TRT-LLM requires you to set it at engine build time. In practice teams pin 0.85 for production safety margin and accept a 5% throughput hit.
Observability and tracing. All three engines emit Prometheus metrics, but the schemas differ. vLLM has the richest set (per-request metrics, queue depth, prefix cache hit rate). SGLang exposes radix-tree statistics that no other engine has. TRT-LLM’s metrics are the leanest by default but the most stable across versions. If you are running a multi-engine fleet, normalize the metrics through an adapter layer or your dashboards will diverge.
Practical Recommendations
A pragmatic adoption checklist for engineering teams in 2026:
- Measure your real prefix-reuse rate before picking. If shared prefixes are >40% of input tokens, SGLang’s edge is large. If <10%, it does not matter.
- Benchmark on your actual traffic distribution, not a synthetic chat workload. Tail latencies on long inputs are where vendors’ marketing benchmarks diverge from production.
- Set a p99 TTFT SLO before tuning. Throughput tuning will trade tail latency for aggregate tokens if you let it. Decide your tail budget first.
- Pin engine versions in production and run a nightly benchmark as a regression gate. All three engines move fast enough that silent regressions are common.
- Plan for KV cache quantization in 2026 budgets. Per-block FP8 KV cache lands across all three engines this year and unlocks roughly 1.5× context length at constant memory.
- Do not assume tensor parallel is free. Going from TP=2 to TP=4 on a 70B model only buys 1.3–1.5× throughput, not 2×, due to all-reduce overhead. Make sure you actually need the extra GPU.
FAQ
Is SGLang faster than vLLM? It depends on workload. On chat with no prefix sharing, vLLM v2 and SGLang are within 5% of each other. On RAG with shared system prompts, SGLang is 30–40% faster aggregate throughput. On low-concurrency single-stream latency, both lose to TensorRT-LLM.
Does TensorRT-LLM need Triton Inference Server? No, not anymore. TRT-LLM ships its own runtime (trtllm-serve) and OpenAI-compatible API. Triton is still supported and useful if you are mixing LLMs with other ML models or need its dynamic batching across multiple model types, but it is no longer mandatory.
What about LMDeploy or MLC-LLM? Both are s
