On-Device SLM Inference: A 2026 Edge GPU Benchmark

On-Device SLM Inference: A 2026 Edge GPU Benchmark

On-Device SLM Inference Benchmark: A 2026 Edge GPU Methodology

Most “edge AI benchmark” content you find online is unreproducible: a single throughput number, no hardware notes, no power data, and no way to tell whether it would hold on your board. This post is the opposite. It is a rigorous, reproducible on-device SLM inference benchmark methodology for 2026 edge GPUs — the kind you can re-run on a Jetson on your own bench and trust. We deliberately do not hand you a leaderboard of numbers we claim to have measured for you. Instead we define exactly what to measure, how to measure it without fooling yourself, and how to read the trends — because the moment you change a quantization level, a runtime flag, or the ambient temperature in your enclosure, the absolute numbers move. Small language models from the Phi, Gemma, and Qwen families now run usefully on hardware that fits in a robot arm or a gateway box, and the deployment decision hinges on latency, memory, and cost, not on a press-release token rate.

What this covers: why run SLMs on-device, what counts as an SLM, the full benchmark methodology, an illustrative results table, the trade-offs that bite, and a decision checklist.

Why Benchmark On-Device SLM Inference at All

On-device SLM inference means running a small language model entirely on local edge hardware — a Jetson, an industrial gateway, a robot controller — with no round-trip to a cloud API. You benchmark it because four constraints decide whether a feature ships: latency, privacy, cost, and offline availability. A cloud call cannot guarantee any of them at the edge, so you must measure the local alternative on the exact board you will deploy.

The industrial and embedded case is concrete. A machine operator asking a natural-language question about a fault code cannot wait 800 milliseconds for a cloud round-trip across a flaky plant network, and the plant may forbid sending machine telemetry off-site at all. A mobile robot navigating a warehouse loses connectivity in dead zones; its reasoning has to keep working. A smart camera summarising events should not stream raw video to a data centre for privacy and bandwidth reasons. In every case the SLM has to live on the device, and the engineering question is no longer “can it run?” but “does it run well enough inside our latency and power envelope?”

Cost is the quieter driver. At any meaningful request volume, per-token cloud inference accrues forever, while an edge device is a one-time capital cost that amortises. We unpack the cloud side of that equation in our deep dive on vLLM cost economics; this post is its on-device counterpart. The catch is that edge cost is dominated not by tokens but by whether the model fits and stays cool — a model that thermally throttles or swaps to disk has effectively infinite cost because it never meets the SLA. That is precisely why a benchmark that ignores power and thermals is worthless for edge planning.

There is also a regulatory and data-governance angle that pushes industrial teams toward on-device inference independent of latency or cost. Plants in regulated sectors — pharma, defence, automotive — often operate under data-residency rules or air-gapped network policies that prohibit any telemetry leaving the facility. For these deployments the cloud is not a slower option; it is simply not an option, and the only question the benchmark answers is which local model meets the requirement. The benchmark methodology therefore has to be defensible to an auditor, not just to an engineer: reproducible conditions, recorded hardware state, and honest variance reporting are what let you stand behind a deployment decision months later when someone asks why you chose a particular model and quant.

What Counts as a Small Language Model in 2026

A small language model (SLM) in 2026 is, loosely, a transformer in the ~0.5B to ~9B parameter class that can run on a single edge accelerator without model parallelism. The boundary is fuzzy and intentionally so — what matters is whether the quantized weights plus the KV cache fit in your device’s memory with headroom, not a precise parameter count.

Three families dominate edge deployments, and it is worth describing them generally rather than asserting exact figures that shift between releases. The Phi family from Microsoft (Phi-3 and Phi-3.5 “mini” and small variants, and their successors) is known for punching above its weight on reasoning relative to size, thanks to heavy data-curation. The Gemma family from Google (Gemma 2 and Gemma 3, roughly 2B through 9B-class variants) offers a strong open-weights lineup with permissive-ish licensing and good tooling support. The Qwen family from Alibaba (Qwen2.5 and later, spanning a wide range from sub-1B up through 7B-class on the small end) is notable for breadth of sizes and strong multilingual coverage. Treat any specific parameter count, context length, or benchmark score as something to verify against the current model card on Hugging Face, because these families iterate fast and the numbers in a six-month-old blog post are routinely stale.

The practical reason these three keep appearing in edge work is that each ships in multiple sizes, all three have first-class GGUF conversions and quantizations available, and all three are well-exercised by the open runtimes. That last point matters more than raw quality: a slightly weaker model with a battle-tested kernel path will often beat a slightly stronger model whose edge runtime support is immature.

The Benchmark Methodology

The methodology has one job: produce numbers that another engineer can reproduce on the same board and that move predictably when you change one variable. Everything below is in service of that.

On-device SLM inference benchmark methodology flow from workload definition through fixed conditions and warmup to a published results table
Figure 1: The benchmark loop — define the workload and budget, fix every condition, warm up, measure, repeat, and publish with hardware notes.

Hardware: Pin the Exact Board and Power Mode

“Jetson” is not a benchmark target; a specific board in a specific power mode is. The relevant 2026 edge GPU range spans roughly from the Jetson Orin Nano class (modest GPU, ~8GB shared LPDDR5, single-digit-to-low-double-digit watt envelope) up to the Jetson AGX Orin class (far larger GPU, up to 64GB shared memory, tens of watts). These are unified-memory systems: the GPU and CPU share one LPDDR5 pool, so “VRAM” is really a slice of system memory, and memory bandwidth, not capacity, is usually what bounds decode speed.

Pin three things and record them in your results: the exact module and carrier, the power mode (nvpmodel) and clock state (jetson_clocks on or off), and the JetPack / CUDA / TensorRT versions. A board left in a low-power nvpmodel profile will post throughput less than half of the same board at max clocks — most “Jetson is slow” complaints are an un-pinned power mode.

Runtimes: Compare Like With Like

Four runtimes cover the 2026 edge landscape, and they make different trade-offs:

  • llama.cpp (GGUF) — the lingua franca of edge inference. Broad model and quant support, CUDA backend on Jetson, easy to script. Excellent baseline and the most reproducible.
  • TensorRT-LLM — NVIDIA’s compiled-engine path. Usually the best first-token latency and throughput on Jetson when a supported model and engine build exist, at the cost of a heavier build step and less flexibility.
  • ONNX Runtime — a portable middle ground with a CUDA execution provider; useful when you already live in an ONNX toolchain.
  • Ollama — a convenience wrapper around llama.cpp. Great for app integration, but for benchmarking you want to know its underlying flags because defaults (context size, batch, offload) silently change results.

The cardinal rule: never compare a TensorRT number against a llama.cpp number and call it a model comparison. You are comparing runtimes. Hold the runtime fixed when comparing models, and hold the model fixed when comparing runtimes.

On-device inference software stack showing model weights flowing through the runtime layer into CUDA acceleration and Jetson hardware
Figure 2: The on-device inference stack — quantized weights sit atop a runtime, which drives CUDA kernels and tensor cores on the shared-memory Jetson hardware.

Quantization: Q4, Q8, FP16

Quantization shrinks weights from 16-bit floats to fewer bits, trading a little quality for a lot of memory and bandwidth. Benchmark at least three points: FP16 (the quality baseline, largest footprint), Q8 (roughly half the memory, quality typically very close to baseline), and Q4 (roughly a quarter, the usual edge sweet spot, with a small but real quality cost). Sub-4-bit quants exist and save more memory but increasingly degrade output; benchmark them only if your task is quality-tolerant.

Be specific about which quant scheme you test, because “Q4” is not one thing. The GGUF ecosystem alone ships several Q4 variants — Q4_0, Q4_K_S, Q4_K_M and others — that differ in how they group and scale weights, and the “K” mixed-precision schemes generally preserve quality better at a similar size than the older linear ones. Record the exact quant filename in your results, because two posts both claiming “Q4” can differ measurably. TensorRT-LLM and ONNX Runtime use their own quantization paths (INT4/INT8 with calibration), so a TensorRT INT4 result is not interchangeable with a GGUF Q4_K_M result even on the same model. This is another reason to hold the runtime fixed: quantization quality and quantization scheme are entangled with the runtime that produced them.

Quantization trade-off from FP16 baseline through Q8 and Q4 to sub-4-bit, trading memory and speed against output quality
Figure 3: The quantization trade-off — each step down saves memory and bandwidth but spends output quality, with Q4 the common edge sweet spot.

Metrics: Define Them Precisely

Vague metrics are how benchmarks lie. Define them once and stick to it:

  • TTFT (time to first token) — wall-clock from request submission to the first generated token. Dominated by the prefill phase (processing the whole prompt). Grows with prompt length.
  • Inter-token latency (ITL) — average wall-clock gap between successive generated tokens during decode. This is what makes streaming feel fast or slow.
  • Throughput — generated tokens per second during decode, typically the reciprocal of ITL. Report decode throughput separately from any blended number.
  • Peak VRAM — maximum memory occupied (weights + KV cache + runtime overhead). On unified-memory Jetson, watch total system memory, not a separate GPU pool.
  • Energy per token — average power during generation divided by throughput, in joules per token. The only honest cross-board efficiency metric.

Latency pipeline from input prompt through tokenization, prefill, first token, and the decode loop to the output stream
Figure 4: Where each metric lives — TTFT covers tokenize plus prefill up to the first token; ITL and throughput govern the decode loop.

Controlled Conditions: Warmup, Batch=1, Fixed Prompts

Edge inference is almost always batch=1 (one user, one request), so benchmark at batch=1 unless you have a genuine batching use case. Use a fixed prompt set with controlled prompt lengths — say short (~64 tokens), medium (~512), and long (~2048) — because TTFT scales with prompt length and a single prompt hides that. Always warm up: the first run pays one-time costs (kernel autotuning, memory allocation, file cache population), so discard the first 1–3 runs and average the rest. Report the mean and the spread; a tight board posts low variance, a throttling board posts a widening tail.

Measuring Power on Jetson with tegrastats

Jetson exposes onboard power rails through tegrastats, which is the practical way to get energy-per-token without external instrumentation. Sample it in the background while a fixed generation runs, then integrate. The snippet below is illustrative scaffolding, not measured output — run it on your own board:

# --- Reproducible Jetson SLM micro-benchmark (illustrative scaffold) ---
# 1. Pin the board to max performance and record state
sudo nvpmodel -m 0          # max-power mode (check modes with: nvpmodel -q)
sudo jetson_clocks          # lock clocks high
nvpmodel -q ; cat /etc/nv_tegra_release   # record for your results notes

# 2. Start power logging in the background (samples every 100 ms)
tegrastats --interval 100 --logfile power.log &
TEGRA_PID=$!

# 3. Run a FIXED prompt, batch=1, with timing. Example: llama.cpp
#    -n caps generated tokens; -p is the fixed prompt; repeat for warmup.
for run in 1 2 3 4 5; do
  /usr/bin/time -v ./llama-cli \
    -m ./models/model-Q4_K_M.gguf \
    -p "Summarise the following machine fault log in two sentences: ..." \
    -n 128 --no-display-prompt 2>> timing.log
done

# 4. Stop power logging and compute averages offline
kill $TEGRA_PID
# Parse timing.log for tokens/sec and TTFT; parse power.log for mean watts;
# energy_per_token (J) = mean_watts / tokens_per_second.
# Discard run 1 as warmup; report mean and std-dev of runs 2..5.

For Ollama users, the equivalent timing comes from OLLAMA_DEBUG=1 and the --verbose flag, which prints prompt-eval and eval token rates per request — but pin the context size explicitly, because Ollama’s default context can differ from your llama.cpp run and invalidate the comparison.

Results: An Illustrative Table and How to Read It

Read this first. The table below contains illustrative, typical-order-of-magnitude figures, NOT numbers we measured for you. They exist to show the shape of results and the column structure you should fill in. Absolute values depend heavily on your exact board, power mode, runtime version, prompt length, and thermal conditions. You must reproduce these on your own hardware. Do not cite these as measured benchmarks.

Model (class) Params Quant Tokens/sec (decode) First-token latency (TTFT) Peak VRAM Power $/1M tokens
Qwen small ~0.5B Q4 80–140 40–90 ms ~0.6–1 GB 7–12 W very low*
Phi mini ~3–4B Q4 18–35 120–300 ms ~2.5–3.5 GB 10–18 W low*
Phi mini ~3–4B Q8 10–22 150–350 ms ~4–5 GB 12–20 W low*
Gemma small ~2B Q4 30–55 90–200 ms ~1.5–2.5 GB 9–16 W low*
Gemma mid ~9B Q4 6–14 250–600 ms ~5.5–7 GB 15–25 W moderate*
Qwen mid ~7B Q4 7–16 220–550 ms ~4.5–6 GB 14–24 W moderate*

* Edge $/1M tokens is an amortised capital figure, not a per-call price. Compute it as (device cost + energy over lifetime) ÷ (tokens generated over lifetime). Ranges assume a mid-tier Jetson Orin board; on a smaller Orin Nano the larger models may not fit at all, and on AGX-class boards throughput rises. Numbers above are illustrative ranges, not measurements.

Three trends are robust enough to state regardless of the exact numbers. First, quantization buys speed mostly through bandwidth. Decode on these boards is memory-bandwidth bound — every token requires streaming the weights — so halving the bytes per weight (FP16 to Q8, Q8 to Q4) roughly tracks a throughput gain, which is why Q4 is the edge default. Second, parameter count dominates decode latency. A 7B model is fundamentally slower to decode than a 2B model at the same quant because there are more weights to stream per token; no runtime flag changes that arithmetic. Third, prompt length punishes TTFT, not throughput. A long prompt makes prefill longer (hence higher TTFT) but, once decode begins, inter-token latency is largely independent of how long the prompt was. If your app feels slow on long inputs, the fix is prefill optimisation (or a shorter prompt), not a faster decode kernel.

Notice what the table makes obvious: the sub-1B Qwen is the only model comfortably north of 80 tokens/sec, while the 9B-class models drop into single digits. For a streaming chat UX where ~10–15 tokens/sec already reads faster than most humans, the mid-class models are usable; for a real-time control assistant needing sub-100ms responsiveness, you are pushed toward the small end or toward TensorRT-LLM tuning. This is the same kind of latency-versus-capacity trade-off we explore for streaming analytics in our guide to time-series forecasting at the edge in production.

One subtlety the table cannot show but every real deployment hits: the KV cache grows with context, and on a shared-memory board it competes directly with the weights for the same LPDDR5 pool. A 7B model at Q4 might leave comfortable headroom at a 512-token context and then run out of memory at 8K tokens, because the KV cache for a long context can rival the size of the weights themselves. This is why the peak-VRAM column must be reported at your maximum intended context, not at a convenient short prompt — a benchmark that quotes peak memory for a 64-token prompt is hiding the failure mode that will actually take down production. When you fill in the table for your own board, run the long-context row first; if it OOMs, you have learned the most important thing before wasting time on the easy cases.

It is also worth stating what these numbers are not good for. They cannot rank model intelligence, and they cannot tell you whether an answer is correct. A benchmark of this kind measures the cost of generating tokens, full stop. Pairing it with a task-level quality evaluation — accuracy on your real prompts, scored by a rubric or a held-out set — is non-negotiable, because the entire decision is a quality-per-watt-per-millisecond trade-off and you have only measured two of those three axes here.

Trade-offs, Gotchas, and What Goes Wrong

The honest part. Several effects routinely invalidate edge benchmarks.

Quality drop at aggressive quantization is real and task-dependent. Q4 is usually fine for summarisation and classification but can degrade multi-step reasoning, code generation, and instruction-following more than a perplexity number suggests. Always evaluate output quality on your actual task, not just speed — a fast wrong answer is the worst outcome. Sub-4-bit quants amplify this sharply.

Thermal throttling silently corrupts results. A passively cooled board in a sealed enclosure heats up over a sustained run; clocks drop, throughput falls, and your last runs look slower than your first — the opposite of warmup. If your variance widens over time, you are watching thermals, not noise. Benchmark in the enclosure and ambient temperature you will actually deploy in, and log temperature alongside power.

Memory limits cause hard failures, not graceful slowdowns. On a unified-memory Jetson, weights plus KV cache plus the OS plus your application all share one LPDDR5 pool. A model that “fits” with a short context can OOM the moment a long prompt grows the KV cache. Always benchmark at your maximum intended context length, not a convenient short one.

Tokenizer differences make token-count comparisons apples-to-oranges. Phi, Gemma, and Qwen use different tokenizers, so the same English sentence becomes a different number of tokens in each. Tokens/sec across families is therefore not directly comparable as useful work per second; normalise to characters or words per second if you need a fair cross-family efficiency view.

Practical Recommendations

Picking a model and quant is a constraint-satisfaction problem: start from your latency and quality budget and work backward to the largest model that fits.

Model and quantization selection decision tree branching on latency budget, quality criticality, and memory constraints
Figure 5: A selection decision tree — branch on your latency budget, then quality criticality, then memory headroom, and finally tune the runtime.

  • Start with the smallest model that clears your quality bar, not the largest that fits. Over-provisioning model size is the most common edge mistake; it burns latency and power you cannot spare.
  • Default to Q4 and only step up to Q8/FP16 if your task-level quality eval shows Q4 failing. Verify with your prompts, not a generic benchmark.
  • Benchmark TensorRT-LLM before you give up on a model for latency — a model that misses your TTFT budget under llama.cpp may clear it once compiled into a TensorRT engine.
  • Pin power mode and clocks (nvpmodel, jetson_clocks) and document them; an un-pinned board makes every other number meaningless.
  • Test at max context and in-enclosure thermals, not on a cool open bench at a short prompt.

Quick checklist before you trust a result: (1) power mode and JetPack/CUDA versions recorded; (2) warmup runs discarded; (3) batch=1 and fixed prompt set; (4) TTFT, ITL, peak VRAM, and energy/token all captured; (5) quality evaluated on the real task; (6) variance reported, not just a single best run.

FAQ

What is an SLM, and how is it different from an LLM?
A small language model (SLM) is a transformer roughly in the 0.5B–9B parameter range — small enough to run on a single edge accelerator without model parallelism. The distinction from a large language model is practical, not formal: an SLM fits and runs usefully on a Jetson-class device after quantization, whereas a frontier LLM needs data-center GPUs. SLMs trade some capability for the ability to run locally with low latency, full privacy, and no per-token cloud cost.

Which is best for Jetson: Phi, Gemma, or Qwen?
There is no universal winner; it depends on your task, size budget, and tokenizer fit. Phi models are strong reasoners for their size, Gemma offers a well-tooled open lineup, and Qwen spans the widest size range with strong multilingual support. Benchmark all three at the size and quant you can afford on your board, evaluate quality on your actual prompts, and pick the one that clears your quality bar at the lowest latency. Verify current specs on each model card.

How do I measure edge LLM latency correctly in 2026?
Separate the two phases. Time-to-first-token (TTFT) measures prefill and grows with prompt length; inter-token latency and throughput measure the decode loop and depend on model size and quantization. Benchmark at batch=1 with a fixed prompt set across short, medium, and long lengths, discard warmup runs, and report mean and variance. A single blended “tokens/sec” number hides the prompt-length effect and is not reproducible.

Does quantization hurt model quality on edge?
Yes, but how much depends on the level and the task. Q8 is typically near-indistinguishable from FP16. Q4 — the common edge default — costs a small, real amount of quality that summarisation and classification tolerate well but multi-step reasoning and code generation may not. Sub-4-bit quantization degrades output noticeably. Always evaluate quality on your specific task rather than relying on a perplexity figure, because perplexity understates reasoning degradation.

How do I estimate on-device inference cost?
Edge cost is amortised capital, not a per-call price. Compute it as (device hardware cost + energy consumed over the device’s service life) divided by total tokens generated over that life. Because the device is paid for once, cost per token falls as utilisation rises — the opposite of cloud pricing. The hidden cost is failure to meet the SLA: a model that thermally throttles or OOMs has effectively infinite cost. Compare against the per-token cloud model in our vLLM economics deep dive.

Can I just use Ollama for benchmarking?
Ollama is excellent for app integration and quick checks, but for a trustworthy benchmark you must pin its hidden parameters — context size, batch, and GPU offload defaults can silently change results. Use OLLAMA_DEBUG=1 and --verbose to read prompt-eval and eval token rates, and set the context length explicitly so it matches your llama.cpp or TensorRT-LLM runs. Otherwise you are comparing different configurations, not different models.

Further Reading and References

  • NVIDIA Jetson Developer Documentation — authoritative reference for Jetson modules, power modes (nvpmodel), jetson_clocks, and tegrastats power monitoring.
  • llama.cpp project (ggml-org) — the reference open-source runtime for GGUF inference and the most reproducible edge baseline, including its CUDA backend.
  • Model cards on Hugging Face for the Phi, Gemma, and Qwen families — the only authoritative, up-to-date source for parameter counts, context lengths, licenses, and quantized GGUF releases. Always verify specifics there before quoting them.

Internal deep-dives: Time-series forecasting at the edge in production · vLLM cost economics deep dive

  • Facebook
  • Twitter
  • LinkedIn
  • More Networks
Copy link
Powered by Social Snap