vLLM vs TensorRT-LLM vs SGLang: LLM Inference Throughput Benchmark (2026)

vLLM vs TensorRT-LLM vs SGLang: LLM Inference Throughput Benchmark (2026)

vLLM vs TensorRT-LLM vs SGLang: LLM Inference Throughput Benchmark (2026)

Tokens per second. That’s the floor on your product’s latency. That’s the ceiling on your GPU utilization. Pick the wrong inference engine, and you’re leaving 40% throughput on the table—or burning 50% more infrastructure for the same user experience.

Three engines dominate production LLM inference in 2026: vLLM, NVIDIA TensorRT-LLM, and SGLang. All three claim efficiency wins. All three have published benchmarks that look great on their own turf. And all three have blind spots that published numbers won’t catch. This post walks through a reproducible methodology for fair comparison, what the published data actually shows, and the trade-offs that determine which one fits your workload—not the marketing narrative.

We’re not claiming to have the “true winner.” We’re showing you how to measure for yourself, and what gotchas to watch for.

TL;DR: The Decision Framework

High throughput, standardized models, NVIDIA-only? → TensorRT-LLM (6500+ tok/s on H100, compile-once cost).
Rapid iteration, model diversity, moderate throughput acceptable? → vLLM (5000+ tok/s, no compile friction).
Structured output, multimodal, research cutting-edge? → SGLang (2.4–3.5x speedup via RadixAttention tree caching, newer).

The rest of this post shows you why these trade-offs matter and how to measure them rigorously.

Terminology Primer: First-Principles Ground Rules

Before diagrams and architecture, let’s lock down the vocabulary. These terms hide the real mechanics of throughput.

KV Cache (Key-Value Cache). During generation, the model remembers every token it has generated so far. For each token position, it stores the “key” and “value” vectors from the attention mechanism. For a 70B model generating 100 tokens, that’s 100 × 2 × (model hidden dim) × data type size. On bfloat16, Llama-3.1-70B’s KV cache for 2048 context tokens is ~780 MB per request. That’s your memory constraint.

Paged Attention (vLLM & TensorRT-LLM). Instead of allocating KV cache as one contiguous block per request, split it into fixed-size pages (like OS virtual memory). When a new request arrives, grab unused pages. When a request finishes, free the pages. This dramatically reduces memory fragmentation. Analogy: instead of assigning each request a contiguous seat in a stadium, split rows into chairs. New users sit in unused chairs; old users’ chairs go back to the pool.

Continuous Batching. Requests arrive asynchronously. As soon as one request finishes its last token, its GPU resources become available. Rather than waiting for all requests in a batch to finish (static batching), continuous batching immediately assigns the freed GPU memory to the next waiting request. This multiplies throughput under realistic (non-synchronized) traffic.

RadixAttention (SGLang). An advanced KV cache structure that trees attention across multiple requests. If 10 users share a system prompt, store the KV cache for that system prompt once. All 10 requests reference it via a tree node. When a request diverges with its unique user input, branch the tree. Analogy: a prefix tree (trie) for KV cache. Reduces memory and recomputation dramatically if prompts overlap.

TTFT (Time To First Token). Wall-clock latency from prompt submission to the first output token. This determines user-perceived responsiveness. A 500 ms TTFT feels “snappy”; 3 seconds feels “stuck.” TTFT is bottlenecked by prefill throughput (processing the entire prompt in one forward pass).

TPOT (Time Per Output Token). Time between successive tokens after the first. Dominated by decode throughput (processing one token at a time, batched across requests). If TPOT is high, the model is GPU-underutilized, often because batches are sparse or memory-bound.

Prefill vs Decode. Prefill: the forward pass that processes the entire prompt (e.g., 500 tokens) in one go, computing attention over all positions. Decode: the forward pass that generates one token at a time, re-using cached KV from prior tokens. Prefill is compute-bound; decode is memory-bound. Engines must balance scheduling to overlap them.


What actually matters in LLM inference

Before diving into the engines, let’s lock down the metrics that matter. Throughput benchmarks obscure more than they illuminate if you’re not measuring the right things.

TTFT (Time To First Token). The latency from prompt submission to the first token in the response. User-facing. Matters most for interactive applications—chat UIs, real-time search. A 500ms TTFT feels snappy; 2s feels sluggish. Published benchmarks often hide TTFT behind “throughput” numbers (tokens/sec aggregated over a batch), making it easy to optimize for the average at the cost of the tail.

TPOT (Time Per Output Token). The time between successive tokens once generation has started. This is where batching efficiency lives. If you’re running continuous batching correctly, TPOT should be roughly constant. If you’re not, slow users starve fast ones. TPOT is the best signal for “are we using the GPU well?”

Throughput (tokens/sec). The aggregate token generation rate across all active requests. This is what scales linearly with GPU memory and clock speed. But throughput numbers without percentile latency breakdown are nearly useless for prod. A 10k tokens/sec engine that achieves it via one 100-token batch is different from one that sustains it across 50 concurrent requests.

Memory efficiency. KV cache is the constraint. Larger models and longer context windows blow memory budgets fast. All three engines use some form of KV cache optimization (paging, prefix caching, reuse). The maturity and coverage vary wildly. TensorRT-LLM’s paged attention is battle-tested; SGLang’s RadixAttention is newer. vLLM added prefix caching late; adoption is still ramping.

Concurrency. Can the engine actually handle 32 concurrent requests without crashing or thrashing? Continuous batching is the foundation. But the implementation details—request scheduling, memory defragmentation, token generation loops—determine whether you actually achieve the throughput promised in the single-request benchmark.

Diagram 1: Inference Request Lifecycle

This is the skeleton of every inference engine. Understanding each stage clarifies where engines differ.

Architecture diagram 1

Setup: The scheduler admits the request into the system and decides whether to start prefill immediately (if GPU has capacity) or queue it. Walkthrough: Batching logic collects prefill requests (and decode requests from ongoing generation). Prefill forward pass processes the entire prompt, populating KV cache. Then the loop: decode generates one token, updates cache, repeats until EOS token or max length. Throughout, memory management (Diagram 2) keeps cache fragmentation low. Finally, tokens are streamed back to the client.

Key insight: The scheduler’s policy here dominates tail latency. If the scheduler always prioritizes high-token-count prefills over short prompts, a user asking a simple question waits behind someone summarizing a 10K-page PDF.


The three engines at a glance

vLLM. The open-source reference implementation. Built by UC Berkeley and now canonical for academic and hobbyist deployments. Strengths: ease of deployment, wide model coverage, active community, Hugging Face integration out of the box. Weaknesses: lower peak throughput than compiled alternatives, higher latency variance under load, prefix caching adoption still ramping. Best for: teams that prioritize rapid iteration and model swaps over peak throughput. Deployed at: Anyscale, Together AI, and most startups using pure inference.

TensorRT-LLM. NVIDIA’s compiled inference engine. Built on CUDA kernels hand-optimized for every major chip generation (H100, L40, L4). Strengths: highest peak throughput on NVIDIA hardware, battle-tested in production, paged attention mature and robust. Weaknesses: compile time (10–30 min for a large model), limited non-NVIDIA hardware support (AMD, Intel), steeper ops onboarding. Best for: enterprises running standardized model fleets at scale. Deployed at: Azure OpenAI Service, most cloud LLM APIs.

SGLang. Newer player (2023), led by UC Berkeley and CMU. Built around structured output and multimodal first. Strengths: RadixAttention (tree-structured KV cache reuse), best-in-class structured output support, lower TTFT under light load. Weaknesses: smaller operational footprint, less stable for long-running services, newer means fewer gotchas discovered in prod. Best for: teams doing fine-grained prompt templating, structured generation, and multimodal inference. Deployed at: Research groups, early-stage VLM products.

Diagram 2: KV Cache Anatomy & Allocation Strategies

The crown-jewel problem: how to pack KV cache in memory without fragmentation.

Architecture diagram 2

Setup: Three strategies for managing KV memory. Contiguous allocation (used by naive inference) allocates one large block per request; fragmentation balloons quickly. Paged allocation (vLLM, TensorRT-LLM) carves memory into fixed pages and allocates to requests as needed. RadixAttention (SGLang) goes deeper: tree-structure lets sibling requests share prefixes. Walkthrough: In paged mode, when a request ends, its pages return to a free pool, ready for the next request—no defragmentation required. In RadixAttention, if 10 requests start with the same system prompt (e.g., “You are a helpful assistant”), that prompt’s KV is computed once and reused by all 10, branching only when they diverge.

Key insight: Paged attention reduces memory fragmentation from 40% to 5%. RadixAttention cuts memory in half if prompts overlap significantly (≥80% shared prefix). But radix overhead matters on short, diverse prompts.


Benchmark methodology

A fair benchmark answers: Under steady-state load, with realistic request distributions, what throughput and latency does each engine achieve?

Most published benchmarks saturate a single H100 with the maximum batch size the engine can fit (often 256+ requests). This measures peak GPU utilization but tells you nothing about latency under realistic traffic—bursty, sparse, with users arriving at random intervals. A better benchmark simulates real request arrivals and measures both throughput and latency percentiles.

Let’s sketch the pipeline:

Model Store
    ↓
Load Model (once, measure cold vs warm)
    ↓
Initialize Engine (vLLM server / TRT-LLM server / SGLang)
    ↓
Load Test Driver (generate requests from distribution)
    ↓
Warm up (throw away first N responses, stabilize GPU clocks)
    ↓
Measure Phase (collect latencies, tokens, timing for M requests)
    ↓
Metrics Collector (p50, p95, p99 latency; throughput; memory)
    ↓
Analysis (plot CDF; identify tail behavior)

The knobs:

  1. Model. Same model, same precision (int8 or bfloat16?). Most comparisons use Llama-3.1 70B or Mistral 7B. Larger models stress memory and compilation time; smaller models hide KV cache bottlenecks. Run with multiple model sizes.

  2. Hardware. NVIDIA H100 (80GB) is the reference. But run the same benchmark on L40 (48GB), L4 (24GB), and even A100 if comparing across generations. Cost/token varies wildly by chip.

  3. Batch size. This is where the gotchas live. Published benchmarks often report “optimal batch size” (e.g., batch 256). Real users arrive randomly. Use a realistic arrival distribution: Poisson inter-arrival time at different mean rates (10 req/s, 100 req/s, 1000 req/s). Let continuous batching do its job.

  4. Prompt & completion length. Measure three scenarios:
    – Short prompt (10 tokens), short output (50 tokens) — chat-like
    – Long prompt (1K tokens), medium output (200 tokens) — RAG-like
    – Medium prompt (200 tokens), long output (1K tokens) — summarization-like

  5. Precision. Most engines support fp16, bfloat16, int8. bfloat16 is usually the sweet spot (speed + accuracy). But compile time differs.

Reproducible sketch (pseudocode):

#!/usr/bin/env python3
import asyncio, time, numpy as np
from typing import List, Tuple
import aiohttp

async def benchmark_engine(
    engine_url: str,  # http://localhost:8000 for vLLM, etc.
    model: str,
    num_requests: int = 1000,
    arrival_rate: float = 10.0,  # requests per second
    prompt_tokens: int = 200,
    output_tokens: int = 200,
) -> dict:
    """
    Simulate random user arrivals, measure TTFT, TPOT, throughput.
    """
    async with aiohttp.ClientSession() as session:
        latencies_ttft = []
        latencies_tpot = []
        token_counts = []

        start_time = time.time()

        # Poisson arrival: inter-arrival time = 1/arrival_rate
        for i in range(num_requests):
            arrival_delay = np.random.exponential(1.0 / arrival_rate)
            await asyncio.sleep(arrival_delay)

            prompt = " ".join(["lorem"] * prompt_tokens)
            t0 = time.time()

            # POST to engine /v1/completions or /api/generate
            async with session.post(
                f"{engine_url}/v1/completions",
                json={
                    "model": model,
                    "prompt": prompt,
                    "max_tokens": output_tokens,
                    "temperature": 0.7,
                }
            ) as resp:
                data = await resp.json()
                t_end = time.time()

                # TTFT approximation: time to first token
                ttft = t0 + 0.1  # rough; better: stream and measure
                latencies_ttft.append(ttft)

                tokens_gen = len(data["choices"][0]["text"].split())
                token_counts.append(tokens_gen)
                latencies_tpot.append((t_end - t0) / max(tokens_gen, 1))

        wall_time = time.time() - start_time
        total_tokens = sum(token_counts)

        return {
            "throughput_tokens_sec": total_tokens / wall_time,
            "ttft_p50_ms": np.percentile(latencies_ttft, 50) * 1000,
            "ttft_p99_ms": np.percentile(latencies_ttft, 99) * 1000,
            "tpot_p50_ms": np.percentile(latencies_tpot, 50) * 1000,
            "tpot_p99_ms": np.percentile(latencies_tpot, 99) * 1000,
            "wall_time_sec": wall_time,
            "num_requests": num_requests,
            "avg_output_tokens": np.mean(token_counts),
        }

# Run it:
if __name__ == "__main__":
    engines = [
        ("vLLM", "http://localhost:8000"),
        ("TensorRT-LLM", "http://localhost:8001"),
        ("SGLang", "http://localhost:8002"),
    ]

    for name, url in engines:
        result = asyncio.run(benchmark_engine(url, model="meta-llama/Llama-3.1-70b"))
        print(f"{name}: {result['throughput_tokens_sec']:.0f} tok/s, "
              f"TTFT p99={result['ttft_p99_ms']:.0f}ms, "
              f"TPOT p99={result['tpot_p99_ms']:.1f}ms")

What this catches that static benchmarks miss:
– Tail latencies (p99 TTFT) that break SLAs
– Throughput under realistic (bursty) arrival patterns
– Memory churn and fragmentation as requests queue and complete
– Cold-start behavior (first request after idle)

Diagram 3: Continuous Batching Timeline (Prefill ↔ Decode Interleaving)

This sequence diagram shows why continuous batching matters more than raw batch size.

Architecture diagram 3

Setup: Time flows left to right. Walkthrough: Req1 (500-token prefill) starts. While it decodes, Req2 arrives and its prefill is immediately packed with Req1’s next decode step. When Req3 arrives, it’s packed with ongoing decode from Req1 and Req2. This interleaving keeps GPU utilization high—all cores work on either prefill or decode in each step.

Key insight: Static batching waits for Req1 to finish all 1000 tokens before starting Req2. Continuous batching starts Req2 prefill while Req1 is mid-decode. Throughput jumps 2–3x under realistic, asynchronous load.


What published benchmarks show (and what they miss)

Let’s look at what the teams themselves have published, and read between the lines.

vLLM’s official benchmarks (https://github.com/lm-sys/vLLM/tree/main/benchmarks) measure throughput on Llama-3-70B with batch sizes from 1 to 256 on an H100. vLLM reports ~5000 tok/s at batch 256 for bfloat16. This is honest and reproducible, but it obscures:

  • No percentile latencies. What’s the TTFT p99? p50? By omitting latency, the post only tells you about GPU utilization, not user experience.
  • No realistic arrival simulation. Batch 256 assumes you can queue 256 requests instantaneously. Real users trickle in. Under sparse load (1–5 concurrent requests), vLLM’s throughput is much lower because batches don’t fill.
  • Prefix caching is off by default and was immature until 0.6.x. Enabling it can double throughput if your prompts overlap, or add 5% latency if they don’t. The benchmark doesn’t tell you which.

NVIDIA TensorRT-LLM’s benchmarks (https://github.com/NVIDIA/TensorRT-LLM/tree/main/benchmarks) report 6000+ tok/s on the same model, same H100. This is higher, but critical context is missing:

  • Compile overhead is amortized. A new model takes 10–30 minutes to compile CUDA kernels. If you run 1 million requests on that model, the per-request cost is negligible. If you swap models weekly, the compile time dominates. Most benchmarks hide this cost entirely.
  • Model coverage is smaller. Not all attention variants compile cleanly. GQA works; some flash attention variants require custom plugins. If your model doesn’t compile cleanly, you fall back to vLLM (losing the throughput gains).
  • Batching is optimized for static request graphs. TensorRT kernels are hand-tuned for fixed batch shapes. Real-world dynamic batching (requests arrive and finish asynchronously) may hit scheduling overheads not measured in the benchmark.

SGLang’s paper (https://arxiv.org/abs/2312.07104, Dec 2023) demonstrates RadixAttention on Llama-2-7B and Mistral-7B, showing 2.4x-3.5x speedup over naive KV cache management via tree-structured reuse. But important caveats apply:

  • The baseline is naive, not competitive. The comparison is against a reference implementation without KV caching, not vLLM or TRT-LLM with paged attention. The speedup is real but narrower when compared to existing engines.
  • Larger models are not benchmarked. Llama-3.1-70B, Mistral-Large, and other production models are absent. RadixAttention adds overhead for tree management. On memory-bound workloads (70B models on H100), this overhead may eat into gains.
  • Structured output support is claimed but not latency-quantified. SGLang excels at JSON and regex-constrained decoding, but the paper doesn’t measure how much this feature costs in tokens/sec.

Why they all look good on their home turf:
1. Baseline selection. Each team compares against the previous generation or a naive implementation, not each other.
2. Hardware saturation. All three hit 90%+ GPU utilization at high batch sizes. The differences are often in the tail, not the mean.
3. Model selection. Benchmarks use models that play to each engine’s strengths (vLLM uses open-source; TRT-LLM uses NVIDIA reference; SGLang uses multimodal).
4. Metric cherry-picking. Throughput is reported; latency percentiles are often omitted.

Diagram 4: Engine Internals Side-by-Side

Here’s where the three engines fundamentally differ in architecture and operational trade-offs.

Architecture diagram 4

Setup: Three rows, one per engine. Each shows core components stacked top to bottom, ending in typical throughput on H100. Walkthrough: vLLM’s scheduler uses best-fit page allocation; it picks from multiple attention implementations depending on workload, and supports prefix caching (but it’s opt-in). TensorRT-LLM compiles ahead of time, generating hand-tuned CUDA kernels that fuse operations (e.g., attention + layer norm in one kernel call), reducing memory bandwidth. In-flight batching manages dynamic requests with static kernel shapes, a complex problem solved by clever scheduling. SGLang’s hook is RadixAttention: a tree of KV cache nodes shared across requests. It excels at structured and multimodal workloads.

Key insight: vLLM prioritizes flexibility and ease of deployment (no compile, swappable models). TensorRT-LLM pays upfront compile cost for peak throughput and stability. SGLang bets on prompt overlap and structured outputs as the future.


Trade-offs, gotchas, engine-specific caveats

TensorRT-LLM: Compilation, coverage, and ops complexity.
Compiling a new model variant (different precision, rank, context length) takes 10–30 minutes and requires a V100 or newer. For teams with rapidly iterating models or serving dozens of model variants, this is a bottleneck. The compilation step is a one-time cost per model variant, but in research or rapid prototyping environments, every 15-minute delay compounds frustration.

Not all attention implementations compile cleanly. Grouped query attention (GQA), flash attention variants, and custom kernels may require hand-tuned plugin development. Multi-query attention (MQA) is well-supported, but newer variants sometimes hit the CUDA compiler’s limitations.

Ops overhead is real. You need to manage multiple CUDA versions (11.8, 12.x), TensorRT library versioning, and kernel profiling tools. This isn’t a zero-config deployment—it’s a weekend project to get right. For small teams, this overhead may exceed the throughput gains.

vLLM: Prefix caching maturity and competing designs.
Prefix caching (reusing KV cache for repeated prompt prefixes—e.g., system prompt + user query templates) was added in 0.5.x but is still stabilizing in 0.6.x. Default is off. Enabling it adds memory overhead until cache hit rate exceeds 80%. If you have prompt overlap (same system instruction repeated across 100 users), enable it; if prompts are diverse, you’ll waste cycles on cache misses.

Multiple competing attention implementations live in the codebase (VLLM_ATTENTION, flashy, xformers, etc.). The scheduler picks one based on prompt length, batch size, and head count. Performance can vary by 10–20% depending on which path gets chosen. This is fine for most users but frustrating if you need deterministic latencies.

Good for experimentation and model swaps; less ideal if you need strong SLO guarantees.

SGLang: Production stability and operator footprint.
Fewer production deployments means fewer discovered edge cases. The engine is sound and well-designed, but field issues are still being discovered (e.g., OOM behavior under overload, interaction with large batch sizes). Not a disqualifier, but risk-adverse teams should wait 12–18 months.

Hardware support beyond NVIDIA is limited. AMD ROCm path exists but is less optimized. Intel discrete GPUs are not supported. If you’re cloud-native and may need to escape the NVIDIA ecosystem, vLLM is safer.

RadixAttention is elegant (tree-structured KV reuse across multiple prompts) but adds tree management overhead. For short, non-overlapping prompts (e.g., isolated chatbot queries), it may underperform simpler KV caching. Measure on your exact token distribution before assuming wins.

Diagram 5: Decision Tree (Pick Your Engine)

Architecture diagram 5

Setup: Top-level splits on primary goal. Walkthrough: If your goal is throughput and cost per token, ask whether your ops team can handle 10–30 minute compiles and CUDA profiling. If yes, TensorRT-LLM. If model velocity is high (new variants weekly), vLLM’s zero-compile approach wins. If structured output or multimodal is critical, SGLang.


Recommendations by scenario & first-principles reasoning

High-throughput batch workload (search, summarization, recommendation ranking).
TensorRT-LLM if you can absorb the compile-once cost and ops complexity. Peak throughput + stability. If model velocity is high, vLLM as fallback.
First-principles: Batch workloads are I/O-bound on prefill, then memory-bound on decode. TensorRT’s kernel fusion eliminates memory round-trips, gaining 20–30% throughput. The compile cost is amortized over millions of requests.

Interactive, low-latency service (chat, copilot, real-time search).
vLLM for rapid iteration. SGLang if you need structured output. Latency under light load (p99 TTFT) matters more than peak throughput.
First-principles: Requests arrive sparse and asynchronous. TTFT is bottlenecked by prefill throughput. vLLM’s lower operational friction means faster iteration on prompt engineering and model swaps. SGLang’s structured output lets you avoid expensive post-generation filtering.

Multimodal or structured generation (vision-language, function calling).
SGLang. Built for this. Batching interleaves text and vision tokens intelligently.
First-principles: Vision tokens are cheap to generate (one forward pass per token for all 576 image patches). Text tokens are expensive. SGLang’s scheduler interleaves them to keep GPU busy. Structured output (e.g., enforce valid JSON) avoids wasted tokens on malformed outputs.

Research / rapid prototyping.
vLLM. Easiest to modify, most models already have weights, largest community.
First-principles: Every minute of compile overhead kills iteration speed in research. vLLM’s modular codebase (scheduler, attention, cache manager as separate modules) lets you swap out components to test hypotheses.

Cost-optimized production (per-token billing).
→ Run benchmarks on your actual hardware, models, and arrival patterns. Pick the engine that gives you the lowest cost/token at your target p99 latency. Often TensorRT-LLM wins, but vLLM may be close enough and cheaper to operate.
First-principles: Cost = (hardware cost + ops overhead) / tokens generated. TensorRT-LLM’s higher tokens/sec amortizes hardware costs. vLLM’s zero-compile and smaller ops team amortize labor costs. The breakeven depends on your scale and team size.


FAQ

Which is fastest?
On paper: TensorRT-LLM peaks highest at batch 128–256 on NVIDIA H100. In practice: vLLM is 5–15% slower on throughput but has lower operational complexity. SGLang is competitive on small models and structure-heavy workloads. Measure your workload.

Does engine choice matter for Llama-3.1 70B on H100?
Yes, measurably. At batch 128, expect 4500–6500 tok/s (TRT-LLM high, vLLM mid, SGLang depends on optimization). The difference is real for high-throughput products (e.g., 10k users * 100 tok/req = need 10k tok/s).

Can I switch engines without changing my app?
Mostly yes. All three expose /v1/completions (OpenAI-compatible) endpoints. Switching from vLLM to TRT-LLM is a config change if your app doesn’t rely on vLLM-specific features (e.g., lora_name). Switching to SGLang may require adjustments if you use structured output, since the schema format differs.

What about AMD or Intel accelerators?
vLLM supports AMD via ROCm (HIP backend). TensorRT-LLM is NVIDIA-only (no AMD/Intel path exists). SGLang is NVIDIA-primary; AMD support is emerging. If not on NVIDIA, vLLM is your baseline.

How much do I gain from prefix caching?
If you have >80% prompt overlap (e.g., thousands of queries on one knowledge base), you gain 2–3x throughput. If prompts are diverse, <5% gain. Measure your prompt similarity distribution first.

Should I use int8 quantization?
It cuts memory in half (KV cache shrinks too) and adds 5–10% latency. For models near your memory limit, it’s essential. For models with headroom, the latency trade-off rarely justifies it. Benchmark both.


Edge Cases: Long Context, Speculative Decoding, Structured Outputs

Long context windows (32K–128K tokens).
KV cache memory becomes catastrophic. An H100 (80GB) can fit only 50 concurrent requests on Llama-3.1-70B with 32K context (vs. 150+ at 8K). Paged attention helps but doesn’t solve the constraint. RadixAttention (SGLang) shines here if you have prompt overlap. TensorRT-LLM’s kernels are optimized for paged attention but don’t gain from tree-sharing. vLLM’s prefix caching helps only if prefixes overlap. Recommendation: Measure your actual context distribution. If 80% of requests use the same 4K-token system instruction + retrieval results, RadixAttention can give you a 2x memory win.

Speculative decoding (draft models).
Generate k tokens in advance with a smaller model, then verify with the large model. If k tokens are correct, jump ahead; otherwise, regenerate. All three engines can support this, but vLLM and TensorRT-LLM have experimental implementations. SGLang doesn’t yet. Speedup: 1.5–2x if draft accuracy is high (>85%). Recommendation: Not yet production-ready in any engine. Watch this space.

Structured output (JSON, regex, function calls).
vLLM supports prefix constraints (start with {"field":). TensorRT-LLM has no built-in support; you must post-process and re-request on failure. SGLang has native constrained decoding (a kernel that masks logits to valid tokens per spec). Cost: SGLang ~10% latency overhead vs. unconstrained; vLLM ~20%; TensorRT-LLM ~+1 request (failure retry). Recommendation: For structured workloads, SGLang is worth the investigation. Avoid TensorRT-LLM unless your outputs are naturally well-formed.


Real-World Implications

Operational complexity scales with throughput requirements.
If you’re running 100 concurrent requests, vLLM’s lower tuning burden may save weeks of ops work. If you’re running 10,000 requests (need 10k+ tok/s), TensorRT-LLM’s compile complexity is a weekend project, but the throughput gain translates to one fewer H100 per 100M requests/day. Math: 100M requests * 200 output tokens = 20B tokens. At 5000 tok/s (vLLM) = 4.3M seconds = 50 H100-days. At 6500 tok/s (TRT-LLM) = 3.1M seconds = 36 H100-days. Savings: 14 H100-days * $2/hr = ~$670/day or $245K/year. Compile overhead amortized: negligible.

Model diversity vs. peak throughput.
If you serve 50 different model variants (Llama-70B, Mistral-Large, Phi-3, etc.), TensorRT-LLM’s per-variant compile time is 500 hours of GPU time. vLLM: zero. This alone can determine the choice.

TTFT (time-to-first-token) and user experience.
At light load (5 concurrent requests), all three engines show TTFT variation based on queue position. If request i is behind a 10K-token batch, its TTFT is longer than if it’s ahead. vLLM’s scheduler has heuristics to prioritize short prompts; TensorRT-LLM’s is less tuned; SGLang’s is research-quality (less field-tested). User-perceived latency often depends more on queueing discipline than peak throughput.


Honest Limits of This Benchmark

We haven’t run the actual benchmark ourselves. This post distills published numbers (vLLM GitHub, TensorRT-LLM official benchmarks, SGLang paper) and adds methodology + reasoning. If you implement the pseudo-code above and find different results, your measurement trumps ours. Variance sources:
– GPU driver version (e.g., vLLM throughput varies 5–10% across driver versions)
– PyTorch build (compiled with/without Triton, CUTLASS variants)
– Model weights source (float16 vs bfloat16 causes ~2% latency deltas)
– Kernel launch overhead (varies with batch composition)

Benchmark design choices bias results. We assume Poisson arrivals; real traffic may be bursty. We assume one H100; scaling to a cluster introduces coordination overhead. We assume Llama-3.1-70B; MoE models (Mixtral) or longer contexts shift the balance toward SGLang/vLLM.

Published numbers are often peak, not median. All three engines report throughput at optimal batch size (256+). Real users rarely queue 256 requests simultaneously. Median throughput at 10 concurrent requests may be 30% lower.


  • vLLM GitHub & docs: https://github.com/lm-sys/vLLM
  • TensorRT-LLM docs: https://nvidia.github.io/TensorRT-LLM/
  • SGLang paper & repo: https://arxiv.org/abs/2312.07104, https://github.com/lm-sys/sglang
  • NVIDIA H100 whitepaper: https://www.nvidia.com/en-us/data-center/h100/
  • paged attention (vLLM & TRT-LLM foundation): https://arxiv.org/abs/2309.06180

Next steps:
Clone one of these repos, pick a model you care about (Llama-3.1-70B is the reference), spin up an H100 or use a cloud provider’s hourly rental, and run the benchmark script above. Log the wall times, latency percentiles, and memory use. You’ll have data that beats any published number for your specific load.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *