vLLM vs TensorRT-LLM vs SGLang: 2026 Inference Benchmark (Updated)

vLLM vs TensorRT-LLM vs SGLang: 2026 Inference Benchmark (Updated)

vLLM vs TensorRT-LLM vs SGLang: 2026 Inference Benchmark (Updated)

NOTE TO PUBLISHER: This file contains ONLY the freshness-update content blocks. Do NOT replace the existing post body — splice each labeled BLOCK into the live post at the position called out in its header.


BLOCK A — Last Updated badge (insert at top under H1)

Last Updated: 2026-05-28 — refreshed for current vLLM 0.9.x, TensorRT-LLM 2026 release, SGLang RadixAttention v2, and Blackwell-class GPUs.


BLOCK B — “What changed in 2026” intro (insert after lede, before first H2)

What changed since the last revision

Three things shifted the vLLM vs TensorRT-LLM vs SGLang benchmark picture in 2026, and any prior comparison written before mid-2025 should be read with that context. vLLM 0.9.x landed its V1 engine refactor, made automatic prefix caching the default, and tightened scheduling so multi-turn chat and RAG workloads pay much less repeated-prompt tax. TensorRT-LLM’s 2026 release replaced the older builder-and-engine flow with a PyTorch-native frontend and added an auto-tuner that picks kernels and parallelism strategies per model and per GPU, which dramatically cuts the time-to-first-good-config that used to make TensorRT-LLM painful to operate.

SGLang shipped RadixAttention v2 plus a faster structured-output (JSON / regex / grammar) path, which is the workload where it now most clearly beats general-purpose engines. The hardware floor also moved: Blackwell-class GPUs (B100, B200, GB200 NVL72) replaced Hopper as the reference target, and FP4/FP8 tensor cores plus second-gen Transformer Engine change what “fast” means for both throughput and latency. Numbers from H100-era posts no longer transfer one-to-one to a 2026 deployment.


BLOCK C — Refreshed performance table (insert under existing Benchmarks H2)

2026 refreshed benchmark snapshot

The table below summarizes publicly-reported ranges from vendor blogs, MLPerf Inference v5.x submissions, and community benchmarks as of May 2026. Treat them as illustrative, not authoritative — exact numbers depend heavily on prompt length distribution, sampling parameters, draft-model setup, and driver/CUDA version.

Workload vLLM 0.9.x TensorRT-LLM 2026 SGLang (RadixAttention v2)
Llama 3.x 70B, batch 32, BF16, 1xB200, throughput (tok/s) 1 ~3,500 – 4,800 ~4,200 – 5,800 ~3,400 – 4,600
Mixtral 8x22B, batch 16, INT4 (W4A16), 2xB200 TP=2, throughput (tok/s) 1 ~5,800 – 7,200 ~6,500 – 8,400 ~5,600 – 7,000
Llama 3.x 8B, batch 1, BF16, 1xB100, latency p50 / p99 (ms / 128-tok response) 2 ~95 / ~140 ~80 / ~120 ~90 / ~135
Llama 3.x 8B, structured JSON output (constrained decode), throughput (tok/s) 3 baseline 1.0x ~1.0 – 1.2x baseline ~1.6 – 2.1x baseline

Bottom line for 2026: TensorRT-LLM still tends to win raw throughput and tightest p50 latency on NVIDIA hardware when you can afford the build/tune step; vLLM remains the best default for “drop it in and ship” Llama-family serving; SGLang is the right answer when your workload is dominated by long shared prefixes, agentic loops, or constrained JSON / regex output.


BLOCK D — Updated FAQ entries (replace 3 stale FAQ Q&A)

Which is fastest in 2026?

There is no single winner — the answer depends on the workload. On dense Llama-family models with steady, high-concurrency traffic on NVIDIA hardware, TensorRT-LLM 2026 generally posts the highest raw throughput and the tightest p50 latency, especially with FP8 or FP4 weights. vLLM 0.9.x is typically within 10–25% of TensorRT-LLM on the same hardware, with far less operational overhead. SGLang wins when prefixes are heavily shared (agent loops, multi-turn RAG) or when output is constrained by JSON / regex / grammar. Pick by traffic shape, not by leaderboard.

Does Blackwell change the answer?

Yes, in two ways. First, second-generation Transformer Engine plus native FP4 support on B100/B200/GB200 widens the gap between engines that have shipped tuned FP4/FP8 kernels (TensorRT-LLM, increasingly vLLM and SGLang) and those that have not. Older engine versions running on Blackwell can leave 30–50% performance on the table. Second, the GB200 NVL72 rack scale fabric makes very-large-model serving (Mixtral-class MoE, 400B+ dense) a single-rack problem, which favors engines with mature expert-parallelism and pipeline-parallelism — currently TensorRT-LLM, with vLLM closing fast.

When should I use SGLang vs vLLM?

Use SGLang when (a) you run agent / tool-use loops with long shared system prompts, (b) you emit structured JSON or regex-constrained output at scale, or (c) you need RadixAttention’s prefix tree to cache across thousands of related sessions cheaply. Use vLLM when you want a single engine that handles a broad mix of chat, RAG, embeddings-adjacent workloads, and arbitrary Hugging Face checkpoints with minimal config. In practice many teams run both: vLLM for general chat traffic and SGLang for the agent / structured-output tier.


If you are tuning any of these engines end-to-end, three sibling deep-dives on this site pair well with this benchmark. For prompt-length and cost-modeling work — which directly drives every throughput number above — start with the breakdown of how modern LLM tokenizers (BPE, SentencePiece, tiktoken) actually segment text. To understand why prefix caching and paged attention dominate the 2026 benchmark gaps, follow it with the explainer on KV-cache optimization techniques for LLM inference. And if you are deciding whether to even serve a dense model versus a sparse one on the same hardware budget, the companion piece on Mixture-of-Experts (MoE) LLM architecture trade-offs covers the routing, expert-parallelism, and memory math that decide it.


  1. Throughput numbers assume warm cache, 1k-token prompts, 256-token completions, and tuned --max-num-batched-tokens / equivalent. Real production traffic with mixed prompt lengths typically lands 20–40% below the steady-state batch peak. 

  2. Latency assumes a single concurrent request with no admission delay. p99 widens sharply once concurrency exceeds the engine’s preferred batch shape — measure under your own QPS, not at batch=1. 

  3. Structured-output speedup is the dimension where SGLang’s RadixAttention + grammar-fused decode currently has the clearest edge; the gap shrinks on free-form generation. 

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *