FP8 vs INT8 vs INT4 LLM Quantization Benchmark (2026)

FP8 vs INT8 vs INT4 LLM Quantization Benchmark (2026)

FP8 vs INT8 vs INT4: An LLM Quantization Benchmark for 2026

An honest LLM quantization benchmark is harder to find than it should be. Vendors quote the throughput number that flatters their kernel. Model cards quote the accuracy number that flatters their recipe. Practitioners are left guessing whether FP8, INT8, or INT4 is the right place to spend their precision budget. The three formats are not ranked points on a single quality dial. Each trades accuracy, memory, and hardware support differently, and the best choice depends on the GPU you have and the workload you serve.

This post does not pretend to hand you fresh lab numbers. Running a clean cross-format sweep across many models and hardware classes is a multi-week effort, and stale measurements mislead more than they help. Instead it gives you something more durable: a transparent methodology you can run yourself, plus directional figures drawn from published results and typical observed behavior. Treat every number here as a representative range, not a measurement we took on your hardware. The goal is to help you reason about the trade-offs and design your own test.

What this covers: why precision matters, a methodology you can reproduce, a directional comparison of FP8 vs INT8 vs INT4 across accuracy and speed, decision guidance, the failure modes that bite in production, and an FAQ.

Background: Why Quantization Precision Matters

A large language model spends most of its inference budget moving numbers, not computing them. Weights have to be read from GPU memory for every token. Activations flow through every layer. The KV cache grows with context length. Quantization shrinks the bits used to represent those numbers, and that single change touches three things at once: memory footprint, throughput, and accuracy.

Memory is the first win. Dropping weights from 16-bit to 8-bit roughly halves their footprint. Going to 4-bit roughly quarters it. That is the difference between fitting a 70B model on one GPU or needing two. Throughput follows, because inference is usually memory-bandwidth bound. Fewer bytes per weight means more tokens per second, even when the math runs at higher precision internally.

The arithmetic is worth seeing once. A 70B model at FP16 needs roughly 140GB just for weights, which forces multi-GPU serving on most cards. At INT8 that drops to about 70GB, and at INT4 to about 35GB plus quantization metadata. The INT4 number is what lets a 70B model fit on a single 48GB or 80GB GPU. That is not a marginal optimization. It changes the deployment topology, removes inter-GPU communication, and often cuts cost per token more than any kernel tweak.

Accuracy is where it gets subtle. Compressing a number means losing information, and some numbers matter more than others. The art of quantization is deciding which ranges to preserve.

A key distinction shapes everything below: weight-only versus weight-plus-activation quantization. Weight-only schemes (often written W4A16 or W8A16) compress the stored weights but run the math in higher precision. They are simpler and safer for accuracy. Weight-plus-activation schemes (W8A8) compress both, unlocking faster tensor-core paths but exposing the model to outlier activations that can wreck quality if handled carelessly.

The research lineage matters here. GPTQ introduced an accurate post-training method for aggressive weight quantization, and AWQ showed that protecting a small fraction of salient weight channels recovers most of the lost accuracy at 4-bit. On the systems side, NVIDIA’s Transformer Engine and FP8 documentation describes how Hopper and Blackwell tensor cores execute FP8 natively, and the vLLM quantization docs document which recipes their serving stack actually accelerates.

The formats themselves are not interchangeable bit-counts. FP8 is a floating-point type, usually the E4M3 variant, with four exponent bits and three mantissa bits. Those exponent bits give it a wide dynamic range, which is why it tolerates the spiky values that appear in activations. INT8 and INT4 are integer types with uniform spacing across a fixed range. They are precise where values cluster but unforgiving when a single outlier stretches the range. That structural difference, floating-point range versus integer uniformity, explains most of what you see in the results.

There is also a fourth quantity many teams forget: the KV cache. For long-context serving, the cache of past keys and values can dwarf the weights in memory. Quantizing the cache to 8-bit or lower is a separate lever from weight quantization, and it interacts with the precision you pick for the weights. We keep the focus on weights and activations here, but flag it because memory math that ignores the KV cache will mislead you on long prompts. Figure 1 maps the formats and the paths each one takes from baseline to deployment.

Overview of FP8 INT8 and INT4 quantization formats and their typical weight only and weight plus activation paths
Figure 1: How FP8, INT8, and INT4 branch from an FP16 baseline, and the calibration and hardware path each one typically takes.

Methodology: How You’d Run This Benchmark Honestly

Here is the test design we would run, and the one you should run before trusting any vendor chart. We describe it in full so the directional figures later are interpretable, not magical.

Models. Pick at least three sizes and two architectures. A dense 8B model exposes per-layer sensitivity. A dense 70B model shows where memory savings actually unlock single-GPU serving. A mixture-of-experts model behaves differently again, because only some experts fire per token. Quantization sensitivity is not uniform across families, so never generalize from one model.

Datasets and metrics. Use two layers of evaluation. Perplexity on a held-out corpus such as WikiText gives a cheap, sensitive signal of degradation. But perplexity alone lies — a model can hold perplexity and still lose reasoning ability. So add task accuracy on benchmarks like MMLU for knowledge, GSM8K for arithmetic reasoning, and a code or instruction-following set if those are your workload. Report retention as a percentage of the FP16 baseline, not absolute scores.

Hardware classes. Format support is not universal. FP8 needs Hopper-class or newer tensor cores to run natively. INT8 and INT4 run on a far wider range of GPUs, including older datacenter cards and many consumer parts. Always pin the GPU, driver, and kernel versions, because a throughput number without hardware context is noise.

Recipes. Test the formats the way they are actually deployed. FP8 as W8A8 for activation-heavy serving. INT8 as either weight-only (W8A16) or full W8A8 with an activation-smoothing step. INT4 almost always as weight-only (W4A16), since 4-bit activations rarely survive. Use the standard tools so results transfer: llm-compressor and AutoAWQ for producing checkpoints, and an engine like vLLM, SGLang, or TensorRT-LLM for serving.

Calibration. This is where most benchmarks quietly cheat. Post-training quantization needs a calibration set — typically 128 to 512 samples — to estimate the scaling factors. If that set does not match your real traffic, your reported accuracy will not match production. Calibrate on domain-matched data, and document the sample count.

Controlling for the engine. A subtle source of confusion is that the serving engine changes the numbers as much as the format does. The same INT4 checkpoint can run at different speeds under vLLM, SGLang, and TensorRT-LLM, because each has different kernel coverage and scheduling. So pin the engine and version alongside the hardware. If you compare FP8 on one engine against INT4 on another, you are measuring the engines, not the formats. Hold everything constant except the precision under test.

Warm-up and batch size. Throughput is meaningless without a stated batch size and a warm cache. Single-request latency and high-concurrency throughput tell different stories. A format that wins at batch size one can lose at batch size sixty-four, because the bottleneck shifts from memory bandwidth to compute. Report at least two operating points: a latency-sensitive low-concurrency point and a throughput-oriented high-concurrency point. Otherwise the headline number flatters whatever regime the benchmark happened to run.

What to control for. Fix the prompt length, the output length, the sampling parameters, and the tensor-parallel degree. Each of these moves the result. A benchmark that varies two things at once cannot attribute the difference to precision. The discipline is dull but it is the entire value of the exercise.

Metrics defined. Track five numbers. Perplexity and task accuracy for quality. Tokens per second and time-to-first-token for speed. Peak GPU memory, counting both weights and KV cache, for footprint. Normalize speed and memory against the FP16 baseline on the same hardware so the comparison is fair.

Figure 2 shows the full pipeline. One caveat we will repeat: the figures in the next section are representative ranges synthesized from published results and typical behavior, not measurements we freshly collected. They are directional. Run the pipeline above to get numbers for your stack.

Benchmark methodology pipeline from model selection through calibration quantization accuracy validation and throughput measurement
Figure 2: The end-to-end methodology — model selection, calibration, quantization, accuracy validation, throughput measurement, and normalization against an FP16 baseline.

Results: Accuracy, Throughput, and Memory

The table below summarizes how the three formats typically compare. Every figure is directional and hedged. Real results shift with model family, recipe, kernel, and hardware. Read these as the shape of the trade-off, not a leaderboard.

Dimension FP8 W8A8 INT8 W8A8 Or W8A16 INT4 W4A16
Memory footprint vs FP16 About 50 percent of weights About 50 percent of weights About 25 percent of weights
Relative throughput High, often 1.5x to 2x on supported GPUs Moderate to high, about 1.3x to 1.8x High for memory bound decode, about 1.4x to 1.8x
Accuracy retention Very high, typically within about 0.5 percent High, typically within about 1 percent Good but variable, often 1 to 3 percent loss
Hardware support Narrow, needs Hopper or newer tensor cores Broad, most datacenter and many consumer GPUs Broad for weight only kernels
Calibration sensitivity Low to moderate Moderate, activation smoothing helps Higher, group size and salient channels matter
Best use case High throughput serving on modern GPUs Safe general purpose speedup, wide hardware Fitting big models on limited memory

A few patterns hold across most published work and field experience.

FP8 is the accuracy-preserving speed play, where the hardware exists. Its floating-point format keeps a wide dynamic range, which handles activation outliers far better than 8-bit integers do natively. That is why FP8 retains accuracy so well even as a full W8A8 scheme. The catch is hardware: without native FP8 tensor cores, you lose the speed advantage, and the format becomes pointless. On a current-generation serving GPU, FP8 is frequently the default recommendation for throughput-sensitive deployments.

INT8 is the safe, universal middle. It runs nearly everywhere, accuracy retention is strong, and the tooling is mature. The wrinkle is activations. Naive INT8 activation quantization is fragile because LLM activations contain large outliers. Techniques like SmoothQuant shift difficulty from activations into weights, which integer formats handle better. With that step, INT8 W8A8 is reliable. Without it, expect surprises. Weight-only INT8 sidesteps the issue entirely at the cost of some speed.

INT4 is the memory play with the most variance. Four bits per weight is where a 70B model fits on a single 48GB or 80GB card. Modern weight-only methods like AWQ and GPTQ recover most of the lost accuracy, and for many generation tasks the gap is small. But the loss is real and uneven. Reasoning-heavy and long-context tasks tend to suffer more than casual generation. Smaller group sizes help accuracy but cost a little speed and memory. INT4 rewards careful evaluation more than any other format.

A word on group size, since it is the main INT4 tuning knob. Weight-only INT4 does not use one scale for a whole tensor. It splits weights into groups — commonly 128 or 64 elements — and gives each group its own scale. Smaller groups track local weight statistics more tightly, which raises accuracy, but they add metadata and slightly raise the effective bit-rate. A group size of 128 is the common default. Dropping to 64 is the usual first move when accuracy lands just short. This single parameter often explains why two INT4 checkpoints of the same model behave differently.

Two more results-level patterns are worth naming. First, bigger models tend to quantize more gracefully. A 70B model usually loses less relative accuracy at INT4 than an 8B model does, because the redundancy in a larger network cushions the loss. So conclusions from a small model can be pessimistic for a large one, and the reverse holds too. Second, instruction-tuned and reasoning-tuned models can be more sensitive than their base versions, because the fine-tuning concentrates capability in specific directions that aggressive quantization may blur. Evaluate the exact checkpoint you intend to ship.

Figure 3 plots accuracy retention as precision drops. The curve is gentle from FP16 through INT8 and steepens into aggressive INT4, which is exactly the shape practitioners report.

Chart of representative accuracy retention from FP16 through FP8 INT8 and INT4 precisions
Figure 3: Representative accuracy retention as precision decreases. The drop stays small through INT8 and widens at aggressive INT4. Directional, not measured.

Figure 4 contrasts memory footprint against relative throughput. The bars fall as bits shrink. The throughput line is more nuanced, since speed depends on whether your workload is memory-bound decode or compute-bound prefill, and on kernel maturity for the format.

Chart comparing relative memory footprint and throughput across FP16 FP8 INT8 and INT4
Figure 4: Representative memory footprint versus throughput. Memory falls steadily with bit width; throughput depends on hardware and workload shape. Directional ranges.

One honest caveat on throughput. Lower precision does not always mean faster. For short prompts that are compute-bound during prefill, an INT4 weight-only kernel can be slower than FP8, because it must dequantize weights before the matmul. The speedup shows up most clearly in memory-bound decode of long outputs. Always measure both phases.

It is worth being concrete about where the savings come from. During decode, the model generates one token at a time, and the dominant cost is streaming the weights from memory. Fewer bytes per weight directly buys you tokens per second. That is why memory-bound decode is where INT4 shines. During prefill, the model processes the whole prompt in parallel, the matmuls are large, and the bottleneck moves to compute. There, a native FP8 tensor-core path that runs the math in low precision beats a weight-only INT4 path that has to expand weights first. The same model, same hardware, can show INT4 winning one phase and FP8 winning the other.

There is a second-order effect on batching. Smaller weights free up memory for a larger KV cache, which lets you serve more concurrent requests. More concurrency raises aggregate throughput even if per-request latency is flat. So INT4’s memory savings can translate into throughput indirectly, by raising the ceiling on batch size, not only by speeding the matmul. This is the kind of effect a single-stream benchmark misses entirely, and it is often the reason teams quantize at all.

When to Choose Each Precision

Match the format to your binding constraint, not to a benchmark headline.

Choose FP8 when you serve on modern hardware and accuracy is precious. If your fleet is Hopper-class or newer and you want maximum tokens per second with minimal quality risk, FP8 is usually the strongest default. It is the easiest format to deploy without a long accuracy-recovery loop. Its dynamic range absorbs the outliers that break integer activation quantization. The single hard requirement is native FP8 support. Without it, skip FP8 entirely.

Choose INT8 when hardware variety or operational safety matters most. INT8 runs across a wide span of GPUs, including older datacenter cards and many consumer parts. If you deploy to mixed hardware, sell software that customers run on unknown GPUs, or simply want the lowest-drama speedup, INT8 is the conservative pick. Prefer weight-only W8A16 if you want to avoid activation-quantization fragility, or W8A8 with SmoothQuant when you need the extra throughput and can validate it.

Choose INT4 when memory is the wall you keep hitting. If a model will not fit, or fits only with KV-cache compromises that hurt context length, INT4 weight-only is the lever. It is the difference between one GPU and two, or between a model running on a single card and not running at all. Pay for that memory with a careful accuracy evaluation on your real tasks, especially reasoning and long context. Use AWQ or GPTQ, and tune group size.

A practical default ladder: start at FP8 if your hardware supports it, fall back to INT8 for broad compatibility, and reach for INT4 only when memory forces it. Then validate, because the only number that matters is the one from your own evaluation.

It also helps to think in terms of the asset, not the format. A user-facing chat assistant tolerates a tiny accuracy dip in exchange for lower cost, so aggressive quantization is fine. A model grading exams, writing code that ships, or making a financial decision sits at the other end. There, an INT4 quality cliff on a reasoning task is a real liability, and FP8 or INT8 buys insurance worth paying for. Let the cost of a wrong answer set your precision floor.

Mixed precision is a real option too. You do not have to pick one format for the entire model. Some teams keep sensitive layers — often the first and last, or attention projections — at higher precision while quantizing the bulk to INT4. Tooling support for this varies, but it is the natural escape hatch when a uniform recipe leaves accuracy just short of acceptable. Treat it as a tuning knob, not a default, because it complicates the serving path.

Trade-offs, Gotchas, and What Goes Wrong

Quantization fails in a handful of predictable ways. Knowing them shortens debugging.

Outlier activations are the classic killer. A few activation channels carry values far larger than the rest. Quantize them naively to INT8 and those outliers dominate the scale, crushing precision for everything else. SmoothQuant-style techniques migrate that difficulty into the weights, and FP8’s wider range sidesteps it. If a W8A8 model suddenly degrades, suspect outliers first.

Calibration mismatch is the silent one. A model calibrated on generic web text can look fine on standard benchmarks and then drift on your domain traffic. The fix is unglamorous: calibrate on data that resembles production, and check enough samples to stabilize the scales.

Kernel and hardware availability bite at deploy time. A format can be theoretically supported but unaccelerated by your serving engine, or unsupported by your GPU generation. Verify that your specific engine and hardware actually run the recipe fast before committing. FP8 on a pre-Hopper card is the textbook trap.

INT4 has quality cliffs. The average looks fine, then a reasoning or long-context task falls off sharply. Smaller group sizes and salient-channel protection help, but always evaluate the hard tasks, not just perplexity.

Perplexity hides regressions. A model can hold its perplexity within a hair of baseline and still lose several points on a reasoning benchmark. Perplexity is a fluent-text metric, not a capability metric. If your acceptance gate is perplexity alone, you will ship quietly degraded models. Always pair it with task accuracy on workloads that resemble production. Figure 5 maps these failure modes to their fixes.

Decision diagram of common quantization failure modes including outlier activations calibration mismatch INT4 quality cliffs and kernel availability with their fixes
Figure 5: Common quantization failure modes and the corrective action for each — from outlier activations to INT4 quality cliffs.

FAQ

Is FP8 always more accurate than INT8?
Usually, but not by a huge margin. FP8’s floating-point range handles activation outliers more gracefully than integer formats, so full W8A8 FP8 tends to retain accuracy better than naive INT8 W8A8. With SmoothQuant, well-tuned INT8 closes much of the gap. The bigger differentiator is hardware support.

Does INT4 always run faster than INT8?
No. INT4 saves more memory and speeds up memory-bound decode, but weight-only INT4 must dequantize before the matmul. For compute-bound prefill or short prompts, it can be slower than FP8 or INT8. Measure both prefill and decode on your workload.

Can I quantize activations to 4-bit?
Rarely with good results. Activations carry large outliers that 4-bit integers cannot represent without severe loss, so INT4 is almost always weight-only. Keeping activations at 16-bit or FP8 is the standard, reliable approach for most deployments.

Which tools should I use to produce quantized checkpoints?
For serving with vLLM or SGLang, llm-compressor and AutoAWQ are common choices, with TensorRT-LLM on the NVIDIA path. Match the recipe to what your serving engine actually accelerates, or the speedup will not appear.

How many calibration samples do I need?
Typically 128 to 512 for post-training methods. More samples help stability up to a point. What matters more than count is that the samples resemble your production traffic, not generic text.

Will a quantized model stay reproducible across upgrades?
Not automatically. Kernel implementations and default scales change between engine versions, so a checkpoint can behave differently after an upgrade. Pin your toolchain, treat the quantized model as a build artifact tied to a software stack, and re-validate after any quantizer or engine update.

Further Reading


Riju writes about applied AI infrastructure, edge inference, and the systems behind digital twins at iotdigitaltwinplm.com. More on the about page.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *