Speculative Decoding for LLM Inference: Architecture (2026)
Production LLM serving in 2026 has a stubborn arithmetic problem at its core. A 70-billion-parameter model running on a single H100 is bottlenecked on memory bandwidth, not compute — the GPU spends most of its decode step shuttling weights from HBM3 to the streaming multiprocessors and then sitting half-idle while it waits. Per-token latency for a single user at batch size one is dominated by that one round-trip, and no amount of FlashAttention or quantization changes the fundamental shape of the curve. Speculative decoding LLM inference has, by mid-2026, become the dominant escape hatch from this trap. vLLM 0.7+, SGLang’s spec executor, and TensorRT-LLM’s medusa and EAGLE plugins all ship it as a first-class scheduler mode, and the headline numbers — 2x to 4x lower time-per-output-token at batch sizes where conventional optimizations have run out of room — are now reproducible across H100, MI300X, and B200 deployments.
The technique itself is older than its current popularity suggests: Leviathan, Kalman, and Matias at Google Research and Chen, Borgeaud, and the DeepMind group both published the foundational papers in late 2022 and early 2023. What changed in 2024 and 2025 was implementation. EAGLE-2 made the draft model good enough that acceptance rates above 0.8 became routine; Medusa removed the separate draft model entirely; and the major serving engines finally figured out how to compose speculative decoding with continuous batching and PagedAttention without losing the throughput properties that made those systems fast in the first place. This post takes the pattern apart.
What Speculative Decoding Actually Does
Answer-first summary: Speculative decoding accelerates autoregressive LLM generation by using a cheap “draft” model to propose K tokens ahead, then running the expensive “target” model once to verify all K in parallel. Tokens that the target would have produced anyway are accepted for free; mismatches are corrected via a rejection-sampling rule that provably preserves the target’s output distribution. The win comes from converting K sequential target forwards into a single batched one, exchanging extra draft work for fewer target round-trips.
The core insight is that LLM decoding is sequential not because the math demands it but because each token is conditioned on its predecessor. If you could guess the next K tokens well enough, you could verify them all in a single forward pass through the target — and a single forward pass costs about the same as the one you would have done anyway, because the target is memory-bound and underutilized at small effective batch sizes. The expensive thing about a 70B-class model is loading the weights, not multiplying them by a tensor that happens to have 1 row versus 5 rows.
So you pay the loading cost once, verify K candidate tokens in parallel, and accept whichever prefix matches what the target would have sampled. The Leviathan and Chen formulations both prove that with the right acceptance rule, the resulting sequence is distributionally identical to standard sampling from the target — speculation is exact, not approximate. That distinction is what makes the technique safe to deploy without a model-quality regression test on every release.
The cost you trade is draft work plus verification overhead. If acceptance rates are high enough, the average number of accepted tokens per target forward exceeds one and you win. If they are low, you have done extra work for nothing and you lose. The whole engineering discipline of speculative decoding LLM inference in 2026 is about pushing acceptance rates up and verification overhead down, and about knowing when to turn the whole machinery off.
The Draft/Target Architecture
Answer-first summary: The classical draft/target architecture pairs a small autoregressive draft model that proposes K tokens with a large target model that verifies them in a single batched forward pass. A per-token acceptance test compares the draft’s probability q(x) to the target’s p(x) and accepts each token with probability min(1, p/q); on rejection, the algorithm samples a replacement from the residual distribution max(0, p – q) and stops the chain for that step. The result is mathematically equivalent to sampling K tokens from the target.

The diagram above shows the round-trip in detail. A scheduler hands the draft model the prefix’s KV cache state at decode step n. The draft runs K autoregressive forward passes — cheap, because the draft is typically 10x to 100x smaller than the target — and emits a chain of K candidate tokens along with their per-position probability distributions q(x). Those K tokens are then concatenated onto the prefix and fed to the target in a single forward pass; thanks to causal masking, the target produces a probability distribution p(x) for every position in the chain in one shot. The sampler walks left to right, accepts or rejects each candidate, and emits the accepted prefix plus one “bonus” token sampled from the target’s tail distribution. The KV cache advances by (m + 1) where m is the number of accepted draft tokens, and the loop restarts.
That bonus token is not a quirk — it falls out of the proof. After K rejections you would still want to sample one token from the target so you do not regress to sub-one-token-per-iteration throughput; after m < K acceptances you can use the target’s distribution at position m+1 to take that free token. In practice, you get between (1) and (K+1) accepted tokens per iteration, with the steady-state mean tied directly to the acceptance rate alpha.
Draft model choice
The draft model is the single most consequential design decision. There are three families in production use in 2026.
Same-family small siblings. Llama 3.1 8B drafting for Llama 3.1 70B is the canonical example; same tokenizer, similar pretraining distribution, similar style preferences. Acceptance rates of 0.65 to 0.8 are typical on chat and code workloads. The downside is that an 8B draft costs roughly 8 to 10 percent of the target’s per-token cost — non-trivial — and it has its own memory footprint to schedule around.
Distilled or pruned drafts. Several vendors ship purpose-distilled drafts at 0.5B to 1.5B parameters that are explicitly trained to mimic the target’s next-token distribution. The acceptance rate is comparable to a same-family sibling but the draft cost drops by another 3x to 5x. This is the configuration TensorRT-LLM’s reference recipes default to.
Feature-conditioned heads. EAGLE and EAGLE-2 take a different path: instead of a standalone draft model, they train a small module that takes the target’s own hidden states (last-layer features) and an embedding of the previous token, and predicts the next token. Because the head sees features that already encode the target’s full computation, acceptance rates climb to 0.85 to 0.92 on the same workloads. We will treat EAGLE separately in the next section, but it is worth flagging here as the modern default when you can afford the training step.
The wrong choice is a draft from a different model family with a different tokenizer. Tokenizer mismatches force you to round-trip through detokenized text, which kills latency and breaks the math.
Target model verification
The target’s job during verification is to run one forward pass on the prefix plus K appended draft tokens and return the full per-position probability distribution. Two things to know.
First, the verification pass is not more expensive than a normal decode step at the same effective batch size, if you do it right. The KV cache for the prefix is reused; only the K new positions need fresh attention and MLP computation. On an H100 running a 70B model, the verification pass for K=5 measured around 1.08x the wall-clock of a K=1 decode step in published vLLM benchmarks — a 25 percent overhead for 5x potential parallelism is the deal that makes the whole technique pay.
Second, continuous batching changes the picture. If you are already running 32 concurrent requests through continuous batching, the target is no longer memory-bound at the iteration level — it has plenty of work to do per weight load — and the marginal cost of K extra positions starts looking like real compute, not free verification. This is the single biggest reason speculation can hurt at high batch sizes, and we return to it in the latency-vs-throughput section.
Rejection sampling and exactness
The acceptance rule is what makes speculative decoding exact rather than approximate. For each draft token t with draft probability q(t) and target probability p(t):
- Draw u uniformly from [0, 1].
- Accept t if u <= min(1, p(t) / q(t)).
- On rejection, sample a replacement from the residual distribution normalized from max(0, p(x) – q(x)) over the vocabulary, and stop the chain.
Leviathan et al. and Chen et al. independently prove that the resulting marginal distribution over output tokens is identical to sampling from p directly. The proof is short and worth reading once if you are responsible for production correctness, because the rule is what lets you ship speculation without an offline eval regression and without giving the security and safety teams a new question to ask about output equivalence. Greedy decoding falls out as the limit case: if both draft and target are greedy, acceptance reduces to a token-equality check, which is what many production setups actually run because it is faster and the sampling-temperature path is rarely exercised in latency-sensitive workloads.
The acceptance rate alpha — the expected fraction of draft tokens accepted per step — is the single number that determines whether speculation pays. We will derive the speedup formula in section 5.
Newer Variants: EAGLE-2, Medusa Heads, Self-Speculative
Answer-first summary: The post-2023 wave replaced the standalone draft model with cheaper, tighter-coupled alternatives. EAGLE and EAGLE-2 attach a small autoregressive head to the target’s last-layer features and use a dynamic draft tree to propose multiple candidate continuations per step. Medusa adds N independent heads to the target, each predicting tokens at increasing offsets, and verifies via tree attention. Self-speculative decoding skips the auxiliary entirely by running the target with some of its own layers skipped. All three converge on the same insight: the draft does not need to be a separate model if you can borrow signal from the target itself.

The vanilla pattern from section 3 has two structural costs that the newer variants attack directly. First, the draft model is a separate set of weights to load, schedule, and version — operationally annoying and memory-hungry. Second, the K-token chain is linear: one rejection kills the rest of the chain, so the marginal value of larger K diminishes quickly even when acceptance is decent.
EAGLE (Li et al., 2024) keeps the draft autoregressive but moves it inside the target’s representational space. The head — typically a single transformer layer with shared embeddings — takes the target’s last-layer hidden state for position t plus the embedding of the next token to predict and outputs a distribution over position t+1. Because the head sees features that already encode the target’s full attention and MLP computation, it is dramatically better calibrated than a separate small LLM. Reported acceptance rates climb from the 0.6-0.75 range typical of vanilla drafts to 0.85-0.92 on the same chat and code benchmarks.
EAGLE-2 (arXiv:2406.16858) adds a second crucial idea: a dynamic draft tree. Instead of proposing a linear chain of K tokens, the head explores top-k branches at each step, pruned by the head’s own confidence, producing a tree with a few dozen candidate continuations. The target verifies the whole tree in a single forward pass using a tree attention mask — a lower-triangular block structure that lets every branch attend to its own ancestors but not to siblings. The verifier picks the longest prefix that survives the acceptance rule along any branch. Published EAGLE-2 numbers on Llama-2-70B-Chat at temperature 0 show 3x to 4x wall-clock speedup at batch size 1, with the gap to vanilla draft-target widening as the workload’s predictability rises.
Medusa (arXiv:2401.10774) takes the simpler-from-an-ops-standpoint route. There is no separate draft model and no draft KV cache. Instead, the target is fine-tuned with N additional output heads — Medusa heads 1 through N — that each predict tokens at offsets +1, +2, …, +N from the current position, conditioned on the target’s last hidden state. Verification uses the same tree-attention trick: the heads collectively propose a candidate tree, the target’s normal next-token head verifies, and the longest accepted prefix is committed. Medusa-2 (Cai et al., 2024) extends this to jointly fine-tune the heads with the backbone for better acceptance. Reported speedups are in the 2x to 2.8x range — below EAGLE-2 in raw numbers — but Medusa wins on operational simplicity because there is exactly one model artifact to deploy and one set of weights to manage.
Self-speculative decoding (Zhang et al., LayerSkip, and several follow-ups) skips the auxiliary entirely. The “draft” is the target model run with some intermediate layers bypassed — typically the middle third — and the “verification” is the full target. Acceptance rates are modest (often 0.5-0.65) but the operational simplicity is unmatched: no extra weights at all. The 2026 case for self-speculative is mostly on memory-constrained edge deployments where the draft has to share weights with the target, which we cover in the edge LLM runtime benchmark.
A pragmatic 2026 ordering for new deployments: try EAGLE-2 first if you can afford the head training step (a few GPU-days on the target’s pretraining-style data); fall back to Medusa if you want zero-extra-model operational simplicity; use vanilla draft-target only when you have an existing small-sibling model in hand and no time to train; use self-speculative when memory is the binding constraint and you can tolerate weaker acceptance.
Latency vs Throughput Math
Answer-first summary: The effective speedup of speculative decoding follows a closed-form expression in the acceptance rate alpha, the chain length K, and the draft-to-target cost ratio c. Speedup peaks at moderate K (typically 4-7) and falls off as alpha drops or batch sizes rise. The math says speculation wins decisively at batch size 1 with alpha > 0.7, breaks even around alpha = 0.5, and loses at high batch sizes regardless of alpha because the target stops being memory-bound.

The classical analysis from Leviathan et al. gives the expected number of accepted tokens per iteration as E[m] = (1 - alpha^(K+1)) / (1 - alpha), where alpha is the per-token acceptance rate and K is the chain length. The expected target-forward cost per iteration is approximately 1 unit (one verification pass) plus K*c units of draft cost, where c is the ratio of one draft forward to one target forward. Putting them together, the effective speedup over standard decoding is:
speedup = E[m] / (1 + K*c)
That formula explains everything practitioners observe. At alpha = 0.7, K = 5, c = 0.1: E[m] = (1 – 0.7^6) / 0.3 ≈ 3.1, divided by 1.5, gives 2.07x — squarely in the published range for Llama-3-70B with an 8B draft on H100. At alpha = 0.9 with EAGLE’s tiny head (c ≈ 0.02), the same formula gives E[m] ≈ 4.1 divided by 1.1, or 3.7x — also matching published EAGLE-2 numbers. At alpha = 0.4 the formula collapses to E[m] ≈ 1.55, divided by 1.5, gives 1.03x — speculation just barely above break-even.
The chart in Figure 3 shows the curve. Two features are worth flagging. First, the curve is sharply concave in alpha: every 0.1 you push acceptance up is worth more than the previous 0.1. This is why the engineering effort spent on draft-target alignment, EAGLE head training, and Medusa fine-tuning pays so disproportionately. Second, K has a sweet spot. For any given alpha and c, there is an optimal K that maximizes speedup — too short and you do not amortize the draft overhead, too long and the chain is killed by an early rejection. Production schedulers in vLLM and SGLang tune K per-request based on recent acceptance history.
The harder part of the math, and the one most blog posts skip, is what happens at batch size > 1. Continuous batching’s whole pitch is that as you stack more concurrent requests, the per-request decode cost falls because the target’s weight load is amortized over more in-flight tokens. At batch size 32 on an H100 running a 70B model, you are no longer memory-bound — you are nearer to compute-bound — and the marginal cost of the K verification positions actually goes up. The break-even shifts: speculation that delivered 2x at batch size 1 might deliver 1.05x at batch size 8 and net-negative at batch size 32. SGLang’s published 2025 benchmarks make this concrete: their fixed-K = 5 spec-dec configuration shows a 2.1x time-to-first-token improvement at batch 1, falling to 1.3x at batch 4, 1.0x at batch 16, and going slightly negative at batch 32 on Llama-3-70B.
This is not a flaw of speculation, it is a property of the underlying GPU utilization curve. The right operational mindset is that speculative decoding is a latency tool for the low-batch regime — interactive single-user chat, latency-sensitive agent calls, real-time code completion — and not a throughput tool for high-QPS background batch workloads. The serving engines that get this right enable speculation conditionally, switching it off when the in-flight batch crosses a threshold. For a deeper treatment of how vLLM, SGLang, and TensorRT-LLM compare at the throughput level, the H100 inference benchmark covers the numbers.
Integration in vLLM, SGLang, TensorRT-LLM
Answer-first summary: The three major serving engines all support speculative decoding by 2026, but they differ in how they integrate it with the rest of the scheduler. vLLM treats spec-dec as a pluggable executor that sits between the scheduler and the model runner, with PagedAttention extended to handle separate draft KV caches. SGLang ships a tighter integration with RadixAttention-aware drafting. TensorRT-LLM compiles spec-dec as a fused graph — Medusa and EAGLE heads become part of the engine plan — sacrificing flexibility for throughput. All three respect the underlying constraint that spec-dec wants low batch sizes to shine.

The diagram above shows the canonical pipeline. A scheduler — vLLM’s LLMEngine, SGLang’s Scheduler, or TRT-LLM’s executor — receives a request and decides per-iteration whether to invoke the speculative executor or fall through to standard decoding. The speculative executor invokes a draft engine (a small model, an EAGLE head, or a set of Medusa heads), gets back proposed tokens in a chain or tree shape, runs the target’s verification forward, and applies the acceptance sampler. Accepted tokens commit to the shared KV cache; the loop repeats until the request is complete.
vLLM introduced first-class spec-dec support in 0.7 and expanded it through 0.8 and 0.9. The design treats speculation as an executor that wraps the existing model runner. Two KV caches coexist: the target’s PagedAttention pool and a parallel draft pool. The scheduler is responsible for keeping the two in sync when requests are evicted or preempted. The interesting wrinkle is that vLLM exposes per-request enable/disable, so a server can run mixed workloads — interactive chat with spec-dec on, batch summarization with it off — without restarting. The reference recipe pairs Llama-3.1-70B-Instruct with Llama-3.1-8B-Instruct as a draft, K=5, and a max in-flight batch threshold of 8 above which spec-dec auto-disables.
SGLang integrated speculative decoding alongside its RadixAttention prefix-sharing system. The two interact in a non-obvious way: RadixAttention can serve a common prefix to many requests from a shared KV cache, which compresses the effective sequence length the target has to attend to. SGLang’s spec executor is aware of the radix tree and can re-use draft KV state across requests that share prefixes — a meaningful win in agentic workloads where many calls share a long system prompt. SGLang has been one of the more aggressive shops in publishing tree-attention spec-dec results, and their EAGLE-2 integration is the reference implementation many teams cite.
TensorRT-LLM takes the opposite philosophical approach. Where vLLM and SGLang treat spec-dec as a runtime mode, TRT-LLM compiles it into the engine plan. A Medusa-fused engine for Llama-3-70B contains the backbone plus the Medusa heads as a single fused TRT graph; an EAGLE engine fuses the EAGLE head similarly. The upside is throughput — fewer Python-side dispatch overheads, more aggressive kernel fusion. The downside is flexibility: changing K, switching from greedy to sampling, or switching draft models requires rebuilding the engine, which takes minutes to hours depending on the model. TRT-LLM is the right choice when the workload is stable and the absolute throughput floor matters; vLLM and SGLang are the right choices when iteration speed or per-request configurability matters.
Three integration details that bite teams in practice. First, tokenizer alignment between draft and target is non-negotiable — most serving engines will not even let you load mismatched tokenizers, but if you find a way around the guardrail, you will get silent correctness failures. Second, KV cache memory budgeting needs to account for the draft — a 1B draft on H100 alongside a 70B target eats around 4-6 GB of additional KV space at typical context lengths, which has to come out of the target’s pool. Third, logging and observability are still rough: the per-iteration acceptance rate, draft cost, and verification cost are the metrics teams need on dashboards, and as of mid-2026 the OpenTelemetry GenAI conventions are only just catching up to spec-dec. The LLM agent observability piece covers the conventions that the major engines are aligning on.
When Speculation Wins vs Hurts
Answer-first summary: Speculative decoding wins when batch sizes are low, target models are large, draft acceptance is high, and traffic is bursty enough that idle GPU cycles are available for verification overhead. It hurts when batch sizes are high (continuous batching already amortizes target cost), when target models are small (the memory-bandwidth bottleneck is weaker), when acceptance is low (overhead dominates the savings), or when traffic is so steady that the scheduler has no slack for extra draft work.

The decision tree in Figure 5 collapses the field experience of a dozen production teams into a flowchart. The first branch is batch size at SLO. If your steady-state average is 32+ concurrent requests, continuous batching is already doing the amortization that spec-dec also tries to do, and you will not see net benefit. If you are between 5 and 16, the answer depends on traffic shape — bursty workloads with low-batch valleys are exactly where conditional spec-dec wins because it kicks in only when the GPU is otherwise idle. At batch size 1 to 4, almost any decent spec-dec configuration on a large target model is a win.
Target model size matters because the memory-bandwidth-versus-compute trade-off does. A 7B model at batch 1 on H100 is barely memory-bound — it finishes a decode step in under 10 ms and the weights are small enough that they nearly fit in the L2 plus some HBM bandwidth headroom. Speculation’s overhead easily eats the savings. A 70B or 405B model at batch 1 is profoundly memory-bound, and the savings are correspondingly large. The break-even target size is roughly 30B parameters on H100-class hardware; below that, spec-dec rarely pays.
Workload predictability shows up through alpha. Chat assistants on common queries, code completion in popular languages, and RAG-grounded question answering all produce token streams that a draft model can predict at alpha > 0.7. Creative writing, long-form reasoning chains, and adversarial or out-of-distribution prompts produce alpha in the 0.4 to 0.6 range where the math gets tight. Teams running mixed workloads — agent calls with structured outputs alongside open-ended generation — increasingly use per-request adaptive K, sampling acceptance history to decide whether to enable speculation for the next iteration.
The most overlooked dimension is traffic burstiness. A workload that averages 8 concurrent requests but spikes between 0 and 32 has substantial low-batch valleys where speculation pays. A workload pinned at exactly 16 has none. The decision tree’s recommendation to “enable spec-dec only when batch < threshold” is the operational pattern that captures the upside without paying the cost during peak.
Trade-offs, Gotchas, and What Goes Wrong
Answer-first summary: Speculative decoding adds three classes of operational risk: configuration mistakes that quietly disable the speedup, correctness pitfalls when the acceptance rule is implemented incorrectly, and observability gaps that make regressions hard to catch. None are inherent to the technique, but all are common enough to budget for.
The most common failure mode is silent overhead. A team enables spec-dec, sees a small but real latency improvement, and assumes the system is working. Six weeks later, the workload has drifted, the production acceptance rate has fallen from 0.75 to 0.4, and the system is now 5 percent slower than vanilla decoding — but nobody noticed because the dashboards measure end-to-end latency without the comparison case. The mitigation is to track acceptance rate as a first-class metric and alert when it crosses a configurable floor. Most teams settle on alpha = 0.55 as the alarm threshold.
The second failure mode is incorrect acceptance rules. Hand-rolled spec-dec implementations frequently get the residual-sampling step wrong, either by sampling from the raw target distribution (which biases output toward easier tokens) or by skipping the normalization step in max(0, p – q). Both are statistically detectable with a chi-squared test against a known-good reference, but the surface symptom — a slightly different output distribution that still looks plausible — is invisible without that test. The fix is to use a reference implementation (vLLM’s, SGLang’s, or the original DeepMind reference) and resist the temptation to hand-optimize the sampler.
The third is memory pressure. A 1B draft alongside a 70B target eats 6 to 10 GB of additional GPU memory across weights and KV cache. Teams running multi-tenant inference often discover this only when concurrency degrades because the KV cache has shrunk. Plan for the draft as part of the deployment footprint, not as an afterthought.
Two structural gotchas round out the list. Tokenizer drift between training-time draft-target alignment and serving-time tokenizer versions can collapse acceptance overnight after a routine upgrade; pin both. Temperature and top-p sampling at the target changes the effective acceptance rate — high temperature smooths p toward uniform, which raises the min(1, p/q) acceptance threshold and helps; very low temperature sharpens p and can hurt. Re-tune K when sampling parameters change materially.
Practical Recommendations
Answer-first summary: Treat speculative decoding as a low-batch latency tool, not a high-throughput throughput tool. Default to EAGLE-2 or Medusa where you have the training budget; fall back to a same-family small draft otherwise. Instrument acceptance rate from day one, and disable speculation conditionally when batch size crosses a workload-specific threshold.
The configurations that have held up across the production deployments worth learning from in mid-2026 share a few traits. They use a large enough target (30B+) that memory bandwidth is the real bottleneck. They invest in a high-acceptance draft path — EAGLE-2 in most new builds, Medusa in operations-sensitive ones, vanilla draft-target only as a transitional choice. They tune K per workload, typically landing between 4 and 7. They set an acceptance-rate alarm at alpha = 0.55, treat the metric as a SLO, and have a runbook for what to do when it fires. They disable spec-dec when in-flight batch exceeds a threshold derived from offline benchmarking, typically 8 to 16 on H100-class hardware. And they keep an honest comparison case — a fraction of traffic served with speculation forced off — so they can detect silent regressions instead of inferring them from anecdotes.
The pillar overview for this space lives at the ai-ml hub and connects to the broader inference-stack and observability content for context.
FAQ
What is speculative decoding in LLM inference?
Speculative decoding is a technique for accelerating autoregressive LLM generation by using a small “draft” model to propose K tokens ahead, then verifying all K in a single batched forward pass of the large “target” model. A rejection-sampling acceptance rule ensures the output distribution is mathematically identical to standard sampling from the target, so the technique is exact rather than approximate. Speedups of 2x to 4x on time-per-output-token are routine at batch size 1 on 70B-class models in 2026 production deployments.
How does EAGLE-2 differ from vanilla speculative decoding?
EAGLE-2 replaces the standalone draft model with a small autoregressive head attached to the target’s last-layer hidden states, and it explores a dynamic tree of candidate continuations rather than a linear chain. Both changes lift acceptance rates: feature-conditioned drafts are better calibrated than separate small LLMs (alpha 0.85-0.92 vs 0.6-0.75), and the tree structure means an early rejection on one branch does not kill the entire chain. Published EAGLE-2 results show 3x to 4x speedup versus 1.5x to 2.5x for vanilla draft-target on the same Llama-2-70B-Chat workloads.
What are Medusa heads, and when should I use them?
Medusa heads are additional output heads attached to a target LLM that each predict tokens at offsets +1, +2, …, +N from the current position. They replace the separate draft model entirely — there is one set of weights, fine-tuned end-to-end, and verification uses tree attention to evaluate the candidate tree in a single target forward. Speedups are modestly lower than EAGLE-2 (2x to 2.8x) but the operational story is simpler because there is no second model to deploy and version. Use Medusa when your inference platform team values “one artifact, one deployment” over the last 30 percent of raw speedup.
Does speculative decoding work with vLLM?
Yes. vLLM has supported speculative decoding as a first-class scheduler mode since 0.7, with EAGLE and Medusa integrations following through 0.8 and 0.9. The reference configuration pairs a large target (Llama-3.1-70B-Instruct) with a same-family small draft (Llama-3.1-8B-Instruct), K=5, with auto-disable above a configurable in-flight batch threshold. SGLang and TensorRT-LLM both ship comparable support; TRT-LLM compiles the configuration into a fused engine plan for higher peak throughput at the cost of flexibility.
When does speculative decoding hurt instead of help?
Three regimes where speculation typically hurts: (1) high in-flight batch sizes (32+), because continuous batching already amortizes the target’s memory-bandwidth cost and the marginal verification work becomes pure overhead; (2) small target models (below roughly 30B parameters), because they are not memory-bound enough at batch 1 to benefit from the trade; (3) low-acceptance workloads (alpha < 0.5), typically creative or out-of-distribution generation, where the draft cost exceeds the savings. The decision tree in Figure 5 walks through the branches.
Further Reading
Internal:
- vLLM, SGLang, TensorRT-LLM Inference Benchmark on H100 (2026) — throughput-side numbers and how the three engines compare at the scheduler level.
- Edge LLM Runtime Benchmark: llama.cpp, MLC, ONNX (2026) — where self-speculative and memory-constrained variants matter most.
- LLM Agent Observability with OpenTelemetry GenAI Conventions (2026) — how to put acceptance-rate metrics on a dashboard that survives an audit.
- AI/ML pillar hub — the broader inference-stack and serving-architecture content.
External:
- Leviathan, Kalman, Matias. “Fast Inference from Transformers via Speculative Decoding.” arXiv:2211.17192.
- Chen, Borgeaud, et al. “Accelerating Large Language Model Decoding with Speculative Sampling.” DeepMind, arXiv:2302.01318.
- Li, Wei, et al. “EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees.” arXiv:2406.16858.
- Cai, Li, et al. “Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads.” arXiv:2401.10774.
- vLLM project documentation on speculative decoding (docs.vllm.ai).
