vLLM vs SGLang vs TensorRT-LLM: H100 Benchmark (2026)
A vLLM vs SGLang vs TensorRT-LLM benchmark has become the most-argued slide in any 2026 AI infrastructure review. The reason is straightforward: by the middle of this year, model quality across the open-weights frontier has compressed so tightly that the next ten percent of user-perceived performance — and a much larger fraction of unit economics — is no longer being won in pre-training. It is being won in the serving layer. Whether your team is shipping a customer chat product, an internal RAG system, or an agent fleet that emits hundreds of structured tool calls per second, the choice of inference engine on an NVIDIA H100 cluster is now the line item that decides whether the GPU bill is a footnote or a board topic. This post is a methodology-first benchmark of the three engines that own that conversation in 2026, run against Llama-3.3 70B and Mixtral 8x22B on H100, with the numbers carefully labelled as illustrative of what the engine maintainers and the wider community have publicly reported. The goal is not to crown a winner. It is to give a senior practitioner a defensible, reproducible way to make the call for their own workload.
The three engines are not interchangeable. vLLM is the de facto open-source baseline, the engine that gets to a new model first and runs almost anywhere. SGLang is the structured-programs engine that won the prefix-caching argument with RadixAttention. TensorRT-LLM is NVIDIA’s compiled, fused-kernel pipeline that, when you are willing to pay the operational tax, still extracts the most peak tokens per second from a properly tuned H100. The interesting question in 2026 is not which is “fastest” — they trade leadership across regimes — but which is fastest for your arrival pattern, your prompt distribution, and your tolerance for engineering overhead.
What This Benchmark Measures and Why It Matters
Answer-first summary: This benchmark compares vLLM, SGLang, and TensorRT-LLM on a single 8x H100 SXM node serving Llama-3.3 70B and Mixtral 8x22B under three workload archetypes — interactive chat, RAG with long shared prefixes, and high-rate agent traces. It reports sustained throughput (output tokens per second), Time to First Token (TTFT), Inter-Token Latency (ITL), and tail latency at the 95th and 99th percentiles. The numbers presented are illustrative, drawn from community-reported benchmarks and the engine maintainers’ own published figures, and the methodology is the part you should reproduce.
For most teams, the metric that actually pays the bill is sustained output tokens per second per GPU at the latency budget the product requires. Peak throughput numbers from a single-prompt microbenchmark are nearly useless: a serving system has to handle a stream of requests of varying lengths arriving on a non-uniform schedule, and the way an engine schedules, batches, and shares its KV cache under that stream is what determines real-world cost. The corollary is that TTFT and ITL matter at least as much as throughput. A 20 percent throughput advantage is uninteresting if the engine you picked is missing your TTFT SLO at p95, because the way that failure shows up in production is a chat product where users abandon the session before the first token lands.
It is also worth being honest about what an H100 benchmark in May 2026 is and is not. H100 SXM with HBM3 remains the workhorse for inference at this scale; H200 deployments are growing and B100/B200 (Blackwell) deployments are appearing at the largest providers, but the H100 is what most enterprises actually have, and what most published comparisons still target. The findings here are H100-specific. Blackwell changes the kernel landscape — FP4 throughput in particular — and the relative ordering of the three engines is being re-litigated on those parts as the engines’ Blackwell support matures. We will keep the focus on H100 and revisit Blackwell as a separate post when the maintainers’ numbers stabilise.
For an architectural deep dive into a complementary lever — generating multiple tokens per forward pass — see our speculative decoding LLM inference architecture analysis. If your interest is smaller models on smaller hardware, the edge LLM runtime benchmark across llama.cpp, MLC, and ONNX covers that regime.
Methodology
Answer-first summary: The benchmark harness drives a configurable load generator at three engines running identical model weights on identical H100 hardware, with version, kernel, and configuration pins documented. Three workload archetypes are exercised: interactive chat (short prompts, short outputs, no shared prefix), RAG (long shared system prompt + retrieved context, medium output), and agent traces (medium prompts, short structured outputs at high request rate). Metrics are emitted per request to JSONL and aggregated; DCGM telemetry is captured for GPU-side validation. Every number in this post should be read with the workload, the version pin, and the saturation level in mind.

Model selection
Three models are used. Llama-3.3 70B in FP16 and FP8 quantisation is the primary subject because it is the most-served dense model class in 2026 production deployments and because all three engines have mature support for it. Mixtral 8x22B is included as the Mixture-of-Experts case; expert routing and the resulting load imbalance change the picture for every engine, and the engine that wins on dense Llama may not win on MoE Mixtral. Qwen2.5 72B is included as a sanity check that observed differences are not a single-model artifact.
FP8 is the headline quantisation for H100 in 2026 because the H100 Tensor Cores execute it natively and because all three engines have shipped solid FP8 paths. FP16 is reported alongside for teams that have not finished the FP8 migration. INT4 weight-only quantisation is excluded because it favours small-batch latency regimes that this benchmark does not exercise; it deserves its own post.
Request distribution
The harness drives a Poisson arrival process at configurable mean request rate. Prompt and output length distributions are drawn from three sources: the public ShareGPT and LMSYS-Chat datasets for interactive chat shapes, a synthetic mix for RAG (a 1.2k-token retrieved-context prefix shared across a batch of related queries, followed by a 200- to 400-token completion), and an agent-trace shape that emits a tool-call structure (medium-length input, 50- to 120-token JSON output, high request rate to stress the scheduler). Real production traffic is heavier-tailed than any of these; the synthetic mix is a floor on the difficulty an engine will see in production, not a ceiling.
Each engine is exercised at a sweep of request rates from idle through saturation. The interesting region is the one labelled “knee” — the point where queueing latency starts to dominate and tail latency blows up. Engines differ less in their peak throughput than in how gracefully they degrade past the knee.
Metrics
Five numbers matter, and they are easy to confuse.
Throughput is the steady-state rate of output tokens emitted across all concurrent requests, in tokens per second. It is the right metric for cost and capacity planning. The engine that wins throughput at saturation is the cheapest engine to run for batch and bulk workloads.
Time to First Token (TTFT) is the wall-clock time between request submission and the first token of the response leaving the server. It is dominated by prefill — the cost of computing the KV cache for the input prompt — and by how the engine schedules that prefill against ongoing decodes. TTFT is the user-perceived latency in chat, and it is where prefix caching has the largest leverage.
Inter-Token Latency (ITL) is the per-token decode latency after the first token. ITL is what determines streaming smoothness; high or jittery ITL produces visibly stuttering output even when TTFT is good. ITL is largely a function of how the engine batches the decode step.
End-to-end latency combines TTFT and total decode time. It is the right metric for short, non-streamed outputs (think classification, structured extraction, agent tool calls).
Tail latency — p95 and p99 — is the metric production SREs care about. p50 looks fine in every engine; the differences emerge in the tail under load.
For every metric, the engine version, the model checkpoint, the quantisation, the workload shape, and the saturation level must be reported together. A throughput number without those four anchors is folklore.
Hardware and software pins
The benchmark targets a single node with 8x NVIDIA H100 SXM 80GB connected by NVLink 4.0, driven by an EPYC host with sufficient PCIe lanes that CPU is not the bottleneck. CUDA, the driver, NCCL, and Triton versions are pinned per engine. The engine versions are recent stable as of May 2026: vLLM in the v0.7+ line, SGLang on its current main branch tag, and TensorRT-LLM at the version compatible with the matching Triton Inference Server release. Tensor parallelism is set to 8 for dense 70B models; Mixtral uses tensor parallelism plus expert parallelism per each engine’s recommended configuration. KV-cache memory budget is sized identically across engines so that the comparison is not gamed by one engine being given more headroom.
The full pin set lives in a reproducibility.yaml manifest alongside the harness. The intent is that anyone with an 8x H100 node can re-run the harness and recover the qualitative picture, even if the exact absolute numbers shift with the next engine release.
DCGM telemetry — streaming multiprocessor utilisation, HBM bandwidth, power, and temperature — is captured per second. The DCGM data is the auditor of the request-level metrics: a throughput number that comes with 95 percent SM utilisation is credible; one that comes with 60 percent SM utilisation says the bottleneck was somewhere else (CPU, network, or the engine’s scheduler), and the comparison is suspect. This is one of the lessons of Stanford CRFM’s HELM serving benchmarks: per-request metrics without GPU-side validation are easy to misread.
Engine Architectures Side by Side
Answer-first summary: All three engines converge on iteration-level (continuous) batching, in which the scheduler revisits the active batch every decode step rather than draining a fixed batch to completion. They differ in three load-bearing places: how the KV cache is organised, how prefix sharing is exploited, and how the compute kernels are structured and dispatched. vLLM popularised PagedAttention. SGLang built RadixAttention to make prefix reuse automatic and structural. TensorRT-LLM trades flexibility for compiled, fused kernels that run inside Triton with in-flight batching.

vLLM: PagedAttention and continuous batching
vLLM introduced two ideas that the rest of the field has since absorbed. Continuous batching revisits the active batch every decode iteration, swapping completed requests out and new prefills in without waiting for the longest sequence to finish. PagedAttention organises the KV cache into fixed-size pages addressed by a per-request block table, which eliminates internal fragmentation and lets the engine pack the cache far tighter than a contiguous allocator could. Together they raised the realistic throughput of open-source inference servers by several multiples when they landed, and they are now the de facto baseline against which every other engine is measured.
In 2026, vLLM’s strengths are still legibility, broad model support, and community velocity. New models tend to land in vLLM first; new techniques (speculative decoding, chunked prefill, prefix caching, FP8 KV cache, Marlin-style INT4 kernels) tend to be available within weeks of upstream papers. The v0.6 and v0.7 releases sharpened scheduling and brought prefix caching into the core, narrowing the gap to SGLang on RAG-shaped workloads. For a team that wants one engine that runs almost any open-weights model reasonably well with minimal operational overhead, vLLM remains the safest default.
The gap vLLM still carries on H100 is the kernel pipeline. It can call into FlashAttention, FlashInfer, and Triton kernels, but it does not own the compiled, fused-graph pipeline that TensorRT-LLM does, and that costs it raw decode tokens-per-second at the very top of the throughput curve. Whether that gap matters depends entirely on whether your workload is throughput-bound or latency-bound.
SGLang: RadixAttention and structured programs
SGLang’s contribution is the structural treatment of prefix sharing. RadixAttention maintains a radix tree over active and recently completed KV caches, so any new request whose prompt shares a prefix with something already cached reuses that prefix automatically. The win is largest exactly where modern serving traffic has its largest source of redundancy: long system prompts, few-shot exemplar lists, retrieved-context prefixes for RAG, and agent conversations where the early turns are revisited many times.
In addition to the cache, SGLang exposes a Python DSL — gen, select, structured regex-constrained sampling — that lets application code describe its decoding plan to the engine. The engine can then schedule around that plan, batching parallel branches together and reusing partial completions. The pragmatic result is that SGLang’s TTFT under RAG and agent workloads tends to be the lowest of the three at moderate request rates, sometimes by a substantial margin, because so much of the prefill work is being skipped entirely. The maintainers’ own published numbers and the original RadixAttention writeup at LMSYS document the mechanism and show the cache-hit-driven speedups.
The trade-off is that SGLang has less mileage than vLLM across the long tail of less-common models, and its DSL — while elegant for teams that adopt it — is one more thing for an SRE to learn. On dense, prefix-light interactive chat, the cache advantage shrinks; SGLang remains competitive but the lead is smaller.
TensorRT-LLM: compiled engines, fused kernels, in-flight batching
TensorRT-LLM is NVIDIA’s serving stack, and it makes the bet that the path to peak H100 utilisation runs through ahead-of-time compilation, kernel fusion, and tight integration with the Triton Inference Server. Models are compiled into a TensorRT engine — a graph with fused Multi-Head Attention kernels, fused MLP blocks, FP8 or INT4 KV-cache support, CUDA Graphs for the decode loop, and NCCL-backed AllReduce paths tuned for NVLink. The serving runtime around the compiled engine implements in-flight batching, NVIDIA’s name for the same iteration-level batching pattern vLLM popularised.
The performance ceiling of TensorRT-LLM on H100 is, in published benchmarks from the NVIDIA TensorRT-LLM repository and from independent practitioners, the highest of the three for dense models at full saturation. The fused MHA kernel, the FP8 dataflow, and the CUDA Graphs for decode together extract a fraction of H100 throughput that the more flexible engines have not yet matched. For workloads that pin the GPU at saturation — bulk batch, regulated agent fleets, very-high-rate completion APIs — TensorRT-LLM is often the cheapest engine to operate, if you are willing to pay the operational cost.
The operational cost is the catch. TensorRT-LLM requires you to compile per model, per quantisation, per tensor-parallel degree, often per max-batch and max-sequence configuration. Engine builds take minutes to hours; iterating on a new model is slower than vLLM. Triton Inference Server is a separate moving part with its own configuration model. The runtime is NVIDIA-only by design. For teams whose differentiator is fast iteration across many models or whose fleet is heterogeneous, TensorRT-LLM’s productivity tax is a real consideration.
A useful mental model is that the three engines occupy different points on a flexibility-versus-peak-performance frontier. vLLM optimises for breadth and iteration speed. SGLang optimises for the structural reality of modern prompts. TensorRT-LLM optimises for last-percentage-point H100 efficiency at the cost of operational complexity. The 2026 leaderboard depends on which axis your workload pays for.
Results: Throughput, TTFT, ITL
Answer-first summary: Across community-reported 2026 results and the maintainers’ own published numbers, the qualitative picture is consistent. At saturation on dense Llama-3.3 70B FP8 on 8x H100, TensorRT-LLM typically leads peak throughput by 10–20 percent over vLLM, with SGLang close behind. SGLang wins TTFT on RAG-shaped workloads with shared prefixes, often by a wide margin. ITL is similar across the three at moderate load; under saturation, the engines diverge in how cleanly they queue. The numbers below are illustrative — not first-party measurements — and intended to communicate the shape of the trade-off.

The throughput curve has the same shape for all three engines. Tokens-per-second grows roughly linearly with offered request rate until the GPU saturates, after which the curve flattens and queueing latency climbs. The interesting region is the knee. TensorRT-LLM’s curve tends to climb fastest and plateau highest, reflecting the kernel-level efficiency of the compiled engine. SGLang’s curve tracks closely, and on workloads with substantial prefix sharing it can match or beat TensorRT-LLM because so much of the prefill is skipped. vLLM’s curve plateaus slightly lower in this regime — the gap is real but smaller than the loudest social-media benchmarks make it sound, and it has narrowed considerably with the v0.7 series.
Where the gap is largest is in the tail of the saturation regime. TensorRT-LLM’s compiled decode loop tends to keep p99 ITL more stable as the engine pushes into 95 percent SM utilisation, because the kernel pipeline is more deterministic. vLLM and SGLang’s more dynamic schedulers spend more cycles on book-keeping at the limit, which shows up as wider ITL tails. For a chat product whose SLO is p95 ITL, this matters; for a batch summariser whose SLO is throughput, it does not.

The TTFT picture is where the engines differ most visibly. SGLang’s RadixAttention pays off the moment the workload has shared prefixes. The illustrative percentiles in Figure 4 show what RAG-shaped traffic looks like: SGLang’s p50 TTFT is the lowest, and crucially, the p95 and p99 stay tight because so many requests are hitting cached prefixes and skipping prefill entirely. TensorRT-LLM’s TTFT is competitive at p50 and p75 thanks to fast prefill kernels, but its tail can widen under load because it does not exploit cross-request prefix reuse the way SGLang does. vLLM with prefix caching enabled has closed much of the gap in the median, but tail TTFT under saturation remains the place where SGLang’s lead is most defensible.
There is a subtle methodological point worth flagging. TTFT comparisons are very sensitive to whether prefix caching is on, what the cache eviction policy is, and how much of the cache the harness was able to warm before measurement started. A benchmark that does not document those three knobs is not comparable. The illustrative numbers in this section assume warm caches and engine defaults; cold-start TTFT on first-touch RAG queries narrows SGLang’s lead.
The Mixtral 8x22B story is more nuanced. MoE routing introduces load imbalance across experts, and how each engine handles that imbalance varies. SGLang and TensorRT-LLM both have specific paths for MoE that perform well; vLLM’s MoE support has matured but still has rougher edges at high tensor-parallel degree. The honest summary is that for MoE serving in 2026, the maintainers’ release notes and the model-specific benchmark posts (such as those for Mixtral and DeepSeek-V3-class models) are more reliable than aggregate comparisons.
Qwen2.5 72B’s relative ordering across the three engines tracks Llama-3.3 70B closely, which is the sanity check we wanted: the differences are engine-driven, not model-driven.
When Each Engine Wins
Answer-first summary: vLLM wins when the team values fast model onboarding, broad model coverage, and operational simplicity, especially in heterogeneous environments. SGLang wins when the workload has heavy shared prefixes — RAG with stable system prompts, agent traces revisiting conversational state, few-shot heavy prompting — or when TTFT under load is the binding SLO. TensorRT-LLM wins when the workload is throughput-bound at saturation on an NVIDIA-only fleet, when the team owns the operational cost of compiled engines, and when squeezing the last 10–20 percent of H100 utilisation translates directly into cost or capacity.

The decision tree in Figure 5 is the version of this comparison most teams actually need. Walk it with your workload’s dominant characteristics in mind. The first question is whether the prompts share substantial prefixes. RAG with a stable system prompt, document QA over a hot set of documents, customer-support agents revisiting conversation state, and few-shot heavy prompting all push hard in this direction. If the answer is yes, SGLang’s RadixAttention is doing structural work the other engines cannot replicate, and the decision is largely made.
If prefixes are not shared, the next question is whether the workload is latency-critical. Interactive chat with a strict TTFT SLO, code completion in an IDE, and conversational voice all live here. For latency-critical, NVIDIA-only fleets where the team is willing to own compiled engines and Triton operations, TensorRT-LLM is hard to beat on tail latency. For teams that do not want that operational burden, SGLang is the next best option, with vLLM’s recent v0.7 series closing more of the gap than people give it credit for.
If the workload is neither prefix-heavy nor latency-critical, the question becomes operational. Heterogeneous fleets, frequent model swaps, evaluation-heavy environments, and small teams running many models all favour vLLM by a wide margin. The cost of compiling a TensorRT engine for every new model is real, and it eats the productivity advantage that LLM-driven product teams care about. vLLM as a baseline with quarterly re-benchmarking is a perfectly defensible choice for the majority of mid-sized deployments.
Finally, for throughput-maximising offline workloads — summarisation pipelines, embedding-adjacent generation, content moderation classifiers, evaluation harnesses — TensorRT-LLM on a dedicated fleet tends to be the cheapest answer, because saturation throughput is what determines the bill and nothing else matters.
A few cross-cutting nuances are worth flagging. Multi-LoRA serving — running many fine-tuned adapters off a shared base model — is a feature where vLLM has historically led in production maturity; SGLang and TensorRT-LLM both support it now, but the operational ergonomics differ. Speculative decoding is supported in all three; the implementations and the achievable speedups vary, and our speculative decoding architecture analysis goes into the mechanism in depth. Structured output — JSON-constrained generation, grammar-guided sampling — is where SGLang’s DSL is the most native, vLLM uses outlines or xgrammar integrations, and TensorRT-LLM has shipped its own constrained decoding paths. If your agent fleet emits structured tool calls, evaluate this dimension carefully; the per-request overhead is non-trivial.
For observability of those agent traces across whichever engine you pick, the OpenTelemetry GenAI semantic conventions covered in our LLM agent observability post are the substrate that makes the per-request metrics in this benchmark go from “interesting” to “actionable.”
Trade-offs, Gotchas, and What Goes Wrong
Answer-first summary: The biggest gotcha is not the engine — it is the benchmark. Throughput numbers without arrival-process detail, TTFT numbers without prefix-cache disclosure, and ITL numbers without saturation level are misleading by default. Beyond that, the operational realities cluster into three: long-context behaviour degrades non-linearly, KV-cache memory budgets fragment in unexpected ways, and engine version churn moves the answer by enough each quarter that a one-time benchmark is a wasting asset.
Long-context regressions are the surprise that catches the most teams. An engine that wins decisively at 2k context can lose at 32k because the prefill cost grows quadratically without the right kernels, the KV-cache memory budget tightens, and the scheduler’s choices about how to interleave prefill chunks change. All three engines have shipped chunked prefill in their recent releases, which mitigates this, but the right move is to benchmark your context distribution, not the public average.
KV-cache management is the second silent failure mode. The headline KV-cache figure (number of pages, total HBM allocated) is easy to compare. The behaviour under contention is not. vLLM’s eviction is paged and predictable. SGLang’s radix tree introduces a different access pattern that can fragment under pathological workloads. TensorRT-LLM’s KV management is tied to engine build parameters and can be inflexible if you under-sized the build. None of these are bugs; they are design choices with consequences. A two-week canary in production usually surfaces them.
Engine version churn is the third. The trio above are all under heavy active development. A benchmark that says “vLLM lags TensorRT-LLM by 22 percent” in February may be wrong by July because vLLM landed a kernel update, or SGLang merged a scheduling change that flipped the ordering. The discipline that actually helps is to keep the benchmark harness, not the benchmark result. Re-run it quarterly. Track the trend, not the snapshot.
Two other things go wrong often enough to be worth naming. Network and tokeniser overhead can dominate at very high request rate; if you are serving an agent fleet emitting ten-token JSON outputs, you may be CPU-bound on tokenisation before any engine difference matters. And quantisation is not free — FP8 KV cache, in particular, can shift quality enough that an A/B against FP16 is mandatory before declaring the win.
Practical Recommendations
Answer-first summary: Start with vLLM as the default. Switch to SGLang if your traffic has heavy shared prefixes or if TTFT is the binding constraint. Switch to TensorRT-LLM only when you can justify the operational tax with measurable cost or capacity wins at saturation. Re-benchmark every quarter; the answer moves.
A concrete sequence that has held up across the deployments we have seen.
For platform teams
– Stand up vLLM first. It will be running production traffic in days rather than weeks, and you will learn your workload’s actual shape from a real engine before you optimise.
– Build a reproducible benchmark harness that captures the metrics in the methodology section above. Pin every version. Commit the manifest. The harness is the long-lived artifact, not the numbers.
– Keep a swap-able engine adapter at the serving layer so switching engines is a config flip, not a rewrite.
For product engineers
– Define the latency SLO in TTFT and ITL terms before picking the engine. “Fast enough” is not a spec.
– Instrument per-request: TTFT, ITL, total latency, prompt length, output length, cache-hit (if exposed). The aggregate metrics are post-hoc; the per-request log is what lets you debug.
– Treat structured-output overhead as a first-class metric. Constrained decoding is not free; measure it.
For SREs
– Track DCGM telemetry alongside request metrics. The ratio of throughput to SM utilisation is the leading indicator of which engine knob to turn next.
– Set capacity at 70 percent saturation, not 90. The tail latency below the knee is what your users feel; the headroom above the knee is what survives traffic spikes.
– Re-benchmark quarterly. File a calendar reminder now.
FAQ
Which is faster: vLLM, SGLang, or TensorRT-LLM?
None of them is uniformly faster. On a dense Llama-3.3 70B FP8 on 8x H100 at saturation, community-reported and maintainer-published benchmarks generally put TensorRT-LLM ahead on peak throughput by 10–20 percent, with SGLang close behind. SGLang typically wins TTFT on workloads with shared prefixes (RAG, agent traces, few-shot prompting) by a margin that often exceeds the throughput gap. vLLM is typically a few percent behind on peak numbers but ahead on operational simplicity and model coverage. The right answer depends entirely on the workload shape and the latency SLO.
Is TensorRT-LLM always the fastest engine on H100?
At full saturation on dense models with stable prompts, the kernel-level efficiency of TensorRT-LLM tends to put it at or near the top of the throughput curve. It is not fastest when the workload has heavy shared prefixes — SGLang’s RadixAttention can skip enough prefill work that it beats TensorRT-LLM on effective throughput and TTFT in that regime. It is also not the fastest path to getting a new model into production, where vLLM still leads by a wide margin.
What is the difference between continuous batching and in-flight batching?
They are the same idea under different names. Both revisit the active batch every decode iteration, swapping completed requests out and new prefills in without waiting for the longest sequence to finish. vLLM introduced the technique under the name “continuous batching”; NVIDIA’s TensorRT-LLM ships it as “in-flight batching.” SGLang implements the same pattern. The differences across engines are not in the existence of iteration-level batching but in how it is implemented and how the KV cache is managed underneath.
Does prefix caching make SGLang always faster?
Prefix caching helps proportionally to how much of your traffic shares prefixes. For RAG with a stable system prompt and recurring retrieved-context patterns, the speedup is substantial — sometimes more than 2x on TTFT. For interactive chat where every conversation is fresh, the gain is smaller. vLLM has shipped prefix caching in recent releases, narrowing the gap on workloads SGLang used to dominate; the difference now is more about the structural integration of the cache than the existence of one.
Can I run all three engines on the same H100 cluster?
Yes, and many teams do. The common pattern is to run vLLM as the default engine for the long tail of models, SGLang for the workloads where prefix sharing dominates (RAG endpoints, agent runtimes), and TensorRT-LLM on a dedicated pool for the throughput-maximising endpoints. The complexity tax is operational — three engines means three versions of capacity planning, monitoring, and on-call runbook — and it pays off only if the workload split is large enough to justify the overhead. Most mid-sized deployments rationalise to one or two engines after a year.
Further Reading
Internal:
- AI & Machine Learning pillar — the broader 2026 AI infrastructure picture.
- Speculative Decoding LLM Inference Architecture (2026) — a complementary throughput lever covered in depth.
- Edge LLM Runtime Benchmark: llama.cpp vs MLC vs ONNX (2026) — the same methodology applied to smaller models on smaller hardware.
- LLM Agent Observability with OpenTelemetry GenAI Conventions (2026) — the per-request telemetry substrate that makes the metrics in this post production-useful.
External:
- vLLM project repository and v0.7+ release notes — https://github.com/vllm-project/vllm
- SGLang RadixAttention writeup at LMSYS — https://lmsys.org/blog/2024-01-17-sglang/
- NVIDIA TensorRT-LLM repository and performance docs — https://github.com/NVIDIA/TensorRT-LLM
- NVIDIA H100 Tensor Core GPU product page — https://www.nvidia.com/en-us/data-center/h100/
- Stanford CRFM HELM serving benchmarks — https://crfm.stanford.edu/helm/
