Edge LLM Benchmark: Jetson Orin Performance (2026)
Running language models on commodity edge hardware has crossed a threshold in 2026. The edge LLM benchmark on Jetson Orin is no longer an academic exercise — it’s a practical foundation for privacy-first applications, offline deployment, and inference at the edge scale. This is the first iteration of a living benchmark tracking real-world performance across NVIDIA’s Jetson portfolio, updated quarterly. The Orin platform (both AGX and Nano variants) delivers measurable inference throughput on models like Llama 3.2, Phi-3.5, and Qwen2.5 when properly configured with quantization and engine selection. We benchmark five inference engines (TensorRT-LLM, vLLM, llama.cpp, Ollama, and stock ONNX Runtime), test both throughput and power efficiency, and map trade-offs in compile time, model coverage, and operational complexity.
What this benchmark covers:
– Decode throughput (tokens/second) and prefill latency (p50, p95) across three Jetson platforms.
– Energy efficiency (tokens/sec per watt) for sustained production workloads.
– Quantization impact (FP16, INT8, INT4-AWQ) on quality and speed.
– Engine comparison: compiled (TensorRT-LLM), dynamic batching (vLLM), portable (llama.cpp).
– Thermal and memory ceiling trade-offs under realistic batch sizes and context windows.
Why Edge LLMs on Jetson Matter in 2026
For the last three years, the assumption was: “If you need inference, ship to the cloud.” That story is aging fast.
Privacy constraints alone justify the shift. Enterprises processing sensitive documents, medical data, or financial records face regulatory friction every time data leaves the perimeter. Cloud API calls get logged, cached, and eventually subpoenaed. On-device inference leaves no trail. A Jetson AGX Orin running Llama 3.2 3B can process a 1000-token document end-to-end without a single outbound HTTP request.
Latency matters just as much. Rounding-trip time to a cloud LLM API, even from a nearby region, adds 200–500 ms per request. Edge LLM inference on a local Orin cuts that to 15–40 ms (including kernel overhead). For real-time chat, question-answering in kiosks, or IoT-device decision-making, that’s the difference between snappy and glacial.
Finally — and this is new in 2026 — cost per inference has inverted. A single Jetson AGX Orin 64GB ($2,499) amortized over 36 months, running 5,000 inferences per day, costs ~$0.0004 per inference. A cloud LLM API at typical 2026 pricing ($0.0001–$0.001 per 1k tokens) reaches parity only at high request rates. Below 1,000 req/day, edge wins on TCO. Below 100 req/day, edge wins by an order of magnitude.
Small, efficient models (1B–3B parameters) on $500–$2,500 boards have crossed the usefulness threshold. They’re not general-purpose replacements for Llama 70B, but they handle classification, extraction, summarization, and dialogue with acceptable quality and measurable speed. This benchmark documents that inflection point.
Test Methodology

Our test rig consists of three hardware platforms:
- Jetson AGX Orin 64GB Developer Kit: 275 TOPS (INT8), 64GB LPDDR5X memory, 12-core Arm CPU, 60W max power budget (MAXN mode). Target: maximum throughput, large batches, extended context windows.
- Jetson AGX Orin 32GB: Same architecture, halved memory; tests realistic batch-size constraints.
- Jetson Orin Nano Super 8GB (refresh announced December 2024): 67 TOPS, 8GB LPDDR5, 25W max. Target: low-power, single-request latency-sensitive deployments.
Software Stack (reproducible as of April 2026):
– JetPack 6.1 (includes CUDA 12.6, cuDNN 9, TensorRT 10).
– TensorRT-LLM 0.13 (native Jetson support; requires model compilation).
– vLLM 0.6 (page-attention batching; dynamic shape).
– llama.cpp build b3829 (ARM NEON + CUDA GGML quantization).
– Ollama 0.1.40 (high-level serving wrapper).
– ONNX Runtime 1.18 + ONNX opset 20.
Models Tested (commercial license permitting, April 2026 snapshot):
– Llama 3.2 1B & 3B Instruct (September 2024 release): Industry-standard small models; widely adopted for edge.
– Phi-3.5-mini (3.8B) (August 2024 release): Microsoft’s efficiency-optimized instruction-tuned model; strong quality/speed ratio.
– Qwen2.5 1.5B & 3B (January 2026 release): Fast-moving Alibaba baseline; excellent multilingual support.
– Gemma 2 2B (July 2024 release): DeepMind/Google variant; tuned for inference.
Quantization Schemes:
– FP16: Full precision (no accuracy loss, larger memory footprint).
– INT8 (per-channel): Post-training quantization via TensorRT or ONNX quantizers.
– INT4-AWQ: Activation-weighted quantization (Ollama + vLLM native); best quality/size trade-off observed in 2025 benchmarks.
Standard Test Prompts:
– Prefill phase: 128 tokens of context (typical user query).
– Decode phase: 256 tokens of generation (summary, code, response).
– Batch sizes: 1 (latency-sensitive), 4, 8 (throughput exploration).
Methodology:
1. Warm up engine: 5 iterations (JIT compilation, cache warming).
2. Measure: 50 samples of end-to-end latency (prefill + decode).
3. Report: mean token/sec, 95% confidence interval, peak memory.
4. Repeat for each model × precision × batch-size combination.
5. Power: measure system input current @ 19.5V (AGX) or 5V (Nano), convert to watts.
Reference Architecture for Edge LLM Inference

Production edge LLM deployments on Jetson follow a standard layering:
Application Layer: Your domain logic (chatbot handler, document processor, sensor-event analyzer) communicates via REST or gRPC to an inference server.
Inference Server (Triton Inference Server, vLLM, or Ollama): Exposes a standard interface, manages model loading, batching, and request scheduling. Triton and vLLM support concurrent requests; Ollama prioritizes simplicity.
Inference Engine: The compiled or runtime-compiled model execution:
– TensorRT-LLM: Pre-compiled CUDA kernels, minimal overhead, requires per-model compilation step.
– vLLM: Dynamic-shape engine with paged attention; trades off compilation time for flexibility.
– llama.cpp: Portable C++ runtime, GGML quantization, zero-dependency deployment.
CUDA/GPU Subsystem: NVIDIA CUDA 12.6 runtime, cuBLAS (matrix multiplication), cuDNN (deep learning primitives), GPU memory management. The Orin Tensor Cores (in FP16 and INT8 mode) provide 2–4x throughput speedup compared to shader-core execution alone.
This layering matters because each component adds latency and constraint. A high-throughput server (vLLM, Triton) adds 20–50 ms scheduling overhead; a stripped-down llama.cpp direct integration adds 5 ms but handles single-request only. Choose based on workload.
Headline Numbers
This is our April 2026 test bench. All numbers are approximate and represent best-effort configuration (optimal quantization, memory placement, and batch size for each model). Reproducible test scripts and hardware specs are in the GitHub repo.
Decode Throughput (tokens/second)

- Llama 3.2 3B INT4 on AGX Orin (TensorRT-LLM, MAXN mode): ~85 tok/s decode (p95 latency 11–13 ms per token), measured across 50 samples, 95% CI ±3 tok/s.
- Llama 3.2 1B FP16 on Orin Nano Super: ~55 tok/s @ 12W system draw (battery-life class device).
- Phi-3.5-mini-3.8B INT4: ~62 tok/s on AGX Orin; quality retention is highest of this test set (minimal instruction-following regression vs. FP16).
- Qwen2.5-1.5B INT8 on Nano Super: ~70 tok/s @ 10W; strong multilingual performance at speed.
- Gemma 2 2B FP16 on AGX Orin: ~120 tok/s; lightweight kernel, most efficient in this batch (trades some instruction quality for raw speed).
Prefill Latency (p50, p95):
– 128-token prefill on AGX Orin typically completes in 40–80 ms (p50 ~50 ms, p95 ~70 ms). Prefill is compute-bound and less sensitive to model size than decode.
– On Nano Super, prefill extends to 120–180 ms due to lower memory bandwidth (LPDDR5 vs. LPDDR5X).
Energy Efficiency (tokens per second per watt)

- Qwen2.5-1.5B INT8 on Nano Super: 7.0 tok/s/W (champion for battery-powered scenarios).
- Llama 3.2 3B INT4 on AGX Orin: 3.0 tok/s/W (absolute throughput leader, moderate power).
- Gemma 2 2B FP16 on AGX: 4.8 tok/s/W (balanced efficiency).
- Phi-3.5-mini INT4 on AGX: 2.1 tok/s/W (higher quality, lower speed-per-watt than Llama).
Memory Footprint:
– Llama 3.2 3B INT4: ~3.2 GB resident (model weights + KV cache for batch=1, 2k-token context).
– Qwen2.5-1.5B INT8: ~2.0 GB resident on Nano Super.
– Prefill memory scales with batch size; decode memory is context-window bound (KV cache grows linearly with context length).
Engine Comparison: TensorRT-LLM vs vLLM vs llama.cpp on Orin
Choosing an inference engine is a three-way trade-off.

TensorRT-LLM (NVIDIA’s native compiler):
– Throughput: Winner. Pre-compiled CUDA kernels, minimal scheduling overhead. Achieves 80–120 tok/s on 3B models depending on precision.
– Compile Time: Loser. Building a TensorRT engine for a new model takes 30–90 minutes on AGX Orin (depends on model size and precision). Not practical for dynamic model loading.
– Model Coverage: Partial. Official support for Llama, Mistral, Qwen, Phi; community PRs for others. Quantization must match the TensorRT workflow.
– Recommended for: High-volume production (1,000+ req/day on a single Orin), where upfront compile cost is amortized.
vLLM (Paged Attention Batching):
– Throughput: 15–25% lower than TensorRT-LLM (60–90 tok/s on 3B models in batch=1 scenario), but excels with batching. Request 4–8 concurrent queries and vLLM matches TensorRT throughput through paged-attention memory efficiency.
– Compile Time: Zero. Models load in seconds; swap models at runtime.
– Model Coverage: Excellent. Supports Llama, Mistral, Qwen, Phi, Gemma, and custom models via HuggingFace transformers.
– Recommended for: Development, mixed workloads, or deployments where request rate varies. Best general-purpose pick for Jetson edge.
llama.cpp (Portable + GGML):
– Throughput: 20–40% below vLLM in end-to-end latency, but acceptable for latency-insensitive batch jobs.
– Compile Time: Zero. Drop a GGML quantized model and run.
– Model Coverage: Best. Works with any HuggingFace model; GGML quantizer handles conversion. CPU fallback available if CUDA unavailable.
– Operational Complexity: Lowest. Single binary, no Python runtime, no versioning headaches.
– Recommended for: Embedded systems, low-power Nano deployments, or situations where operational simplicity outweighs raw speed.
Ollama (Serving Wrapper):
– Sits atop llama.cpp; adds a REST API and model management layer.
– Good for: Prototyping, non-expert teams, scenarios where ease-of-use justifies 10–15% latency overhead.
– Not recommended for latency-critical applications.
Practical Recommendation:
– Under 100 req/day: Use llama.cpp directly or Ollama. Speed doesn’t matter; simplicity wins.
– 100–1,000 req/day: vLLM. Load model once, serve variable request stream.
– 1,000+ req/day or batch-inference job: TensorRT-LLM if you can afford compile time; vLLM otherwise.
Trade-offs and Failure Modes
Jetson Orin is not a cloud GPU. Understand these constraints before deploying.
Thermal Throttling on Orin Nano Super:
Under sustained inference load (continuous requests, 100% duty cycle), the Nano Super’s SoC hits ~60–65°C within 10–15 minutes. Once thermal throttle kicks in, GPU clock drops from 1.1 GHz to 600–800 MHz, reducing throughput by 10–15%. Mitigation: Add passive heatsink + active cooling (5V fan, ~$15), or design workloads with request batching and idle windows.
KV Cache Memory Ceiling:
The Nano Super’s 8GB is tight. A batch-size-1 request for Llama 3.2 3B with INT4 quantization occupies ~3.5 GB (2 GB weights + 1.5 GB KV cache at 2k context). Expand to batch 2 or 4k context and you overflow. Mitigation: Use 1B or 1.5B models on Nano, or request a 16GB variant (not released as of April 2026; AGX Orin 32GB exists).
FP16 Tensor-Core Utilization on Small Batches:
Orin’s Tensor Cores are high-throughput, low-latency when running wide operations (batch size 32+, large matrix multiply). For batch-size-1 requests, they’re underutilized; the GPU reverts to shader-core execution. FP16 decode on batch=1 yields ~55 tok/s; batch=8 yields ~95 tok/s (1.7x gain). Mitigation: Design applications to batch requests (collect 4–8 queries, process as one batch, return results). This increases latency by 5–10 ms but improves throughput 40–50%.
INT4 Quantization Quality Regression:
Extreme quantization (4-bit) hurts some models more than others. Phi-3.5-mini remains high-quality at INT4; Qwen2.5 loses measurable instruction-following capability; Llama 3.2 occupies the middle ground. Mitigation: Benchmark your specific model + use case. If quality drops unacceptably, fall back to INT8 (accepts 10–15% speed loss but near-lossless quality).
Memory Bandwidth Saturation:
Orin Nano Super has ~102 GB/s memory bandwidth; AGX Orin has ~273 GB/s (LPDDR5X). Large models + prefill phase consume most of this. Once saturated, adding more batch size doesn’t improve throughput (wall-clock latency per token stays flat). Indicator: System memory utilization at 90%+ and throughput plateauing means you’re bandwidth-bound. Reduce batch size or model size.
Production Deployment Recommendations
Based on April 2026 testing, here’s our guidance for production edge LLM workloads:
For Sustained Production Workloads (500+ req/day):
– Use Jetson AGX Orin 64GB. The 64GB memory and 273 GB/s bandwidth support realistic batch sizes (4–8), longer context windows (4k–8k tokens), and thermal headroom.
– Deploy vLLM or TensorRT-LLM depending on compile-time budget and model-swap frequency.
– Provision active cooling (5–10 W fan) and ensure 40–45°C ambient.
For Low-Power Kiosk / IoT Nodes (10–100 req/day):
– Jetson Orin Nano Super 8GB is cost-effective (~$499). Pair with 1B or 1.5B models (Llama 3.2 1B, Qwen2.5 1.5B, Phi-3 mini).
– Active cooling is mandatory under sustained load.
– llama.cpp or Ollama for operational simplicity.
For Development / Mixed Workloads:
– Jetson AGX Orin 32GB balances cost and capability. vLLM is ideal (zero compile overhead, dynamic model swaps).
Quantization Strategy:
– Default to INT4-AWQ for maximum speed and acceptable quality (>95% of FP16 on most instruction tasks).
– Fall back to INT8 if quality is critical (medical, legal, or high-stakes use cases).
– Test your model + benchmark your quality metrics before committing to a deployment.
Context Windows:
– AGX Orin 64GB: Support up to 8k-token context comfortably; 16k with careful batch management.
– Orin Nano Super: Stick to 2k-token context in production to leave headroom for thermal transients.
Serving Infrastructure:
– Prefer vLLM as the default; it’s the sweet spot of flexibility, performance, and Jetson stability.
– Use TensorRT-LLM only if you’ve validated compile-time overhead and can afford a model-deployment pipeline.
– llama.cpp for embedded or battery-powered scenarios.
Living Benchmark Roadmap
This benchmark will be updated quarterly to track hardware and software improvements.
Completed (April 2026):
– Jetson AGX Orin 64GB, AGX Orin 32GB, Orin Nano Super.
– Llama 3.2, Phi-3.5, Qwen2.5, Gemma 2.
– TensorRT-LLM 0.13, vLLM 0.6, llama.cpp b3829.
– FP16, INT8, INT4-AWQ quantization.
Planned (July 2026 update):
– Llama 4 (expected availability pending Meta release schedule).
– Mistral Small-3 and Mistral Large-2 variant optimized for 8GB systems.
– Qwen2.5-VL (vision-language version) on multi-modal pipelines.
– Jetson Thor (Project Thor announced at GTC 2024; mass production expected late 2026 for humanoid robots). Expect 500+ TOPS, supporting larger models (7B–13B) at edge scale.
– Memory-tiered analysis: Systematic evaluation of OOM boundaries for each Jetson variant.
– Batching dynamics: Throughput curves across batch size 1–64 for standard models.
Target Updates:
– Q3 2026 (July): Llama 4, Mistral refresh.
– Q4 2026 (October): Jetson Thor if available; full memory analysis.
– Q1 2027+: Annual refresh cycle.
FAQ
Q: Why not just use a Mac mini M4 or MacBook Pro? Apple Silicon is faster per watt.
A: True, but trade-offs differ. Apple hardware excels at 1–2 concurrent requests on 8GB–16GB memory; Jetson excels at sustained throughput and scales to 64GB. For high-volume edge deployment (100+ inferences/day across multiple requests), Orin amortizes to lower TCO. Also, proprietary Apple ecosystem; Jetson is open. But if you’re running a single chatbot on your desk, M4 MacBook wins.
Q: Can Jetson Orin run Llama 70B or Mixtral 8x22B?
A: No. Llama 70B (280 GB quantized as INT4) doesn’t fit on AGX Orin 64GB VRAM. Mixtral exceeds it too. The Orin-class hardware maxes out around 7B–13B in INT4, or 3B–7B in FP16. For 70B, look at professional GPUs (L40S, H100) or wait for Jetson Thor (late 2026, not confirmed for that scale yet).
Q: What about Llama 3.3 or Llama 4? Are they better?
A: Llama 3.3 (released early 2026) and Llama 4 (expected mid-to-late 2026) will be benchmarked in the July 2026 and Q4 2026 updates. Early reports suggest Llama 4 improves instruction-following quality on small models (1B–3B); throughput will be similar. We’ll quantify the trade-off.
Q: Is INT4 quality good enough for production?
A: Yes, for most tasks. INT4-AWQ on Llama, Phi, and Qwen retains 95%+ quality on benchmarks (MMLU, HumanEval, HELLASWAG). Instruction-following, classification, and extraction are robust. Decline is measurable in open-ended generation and hallucination resistance. For high-stakes applications (medical decision support, legal document analysis), run a quality audit on your specific use case. For chat, Q&A, and content summarization, INT4 is production-ready.
Q: How do I reproduce these numbers?
A: We publish a GitHub repo with:
– Hardware spec checklist (thermals, power supply, cooling).
– Model download + quantization scripts.
– Engine build instructions (TensorRT-LLM compile, vLLM deps, llama.cpp).
– Test harness (Python, captures throughput, latency, memory, power).
– Raw data (CSV logs for each run).
Full transparency: https://github.com/iotdigitaltwinplm/jetson-llm-benchmark-2026. The repo also lists deviations from this post if any. Benchmark results are only valid for the specific hardware revision, JetPack version, and model quantization. Don’t assume Llama 3.2 3B INT4 on a community-compiled TensorRT engine will match our AGX Orin 64GB MAXN results.
Q: When will Jetson Thor be available, and will it change everything?
A: Jetson Thor was announced at GTC 2024 for humanoid-robot deployment, with mass production ramping in late 2026. Specs are limited (NVIDIA is tight-lipped), but rumors suggest 500+ TOPS and 48GB+ memory. If true, it would support 7B–13B models comfortably and shift the edge-inference landscape. We’ll benchmark it the moment it’s available (Q4 2026 target). For now, base production deployments on Orin AGX or Nano.
Further Reading
- Internal: Apple On-Device AI: Neural Engine and Private Cloud Compute (2026) — Competitive perspective on consumer-edge LLM deployment.
- Internal: Humanoid Robot Benchmark: Figure, Optimus, Unitree, Digit (2026) — LLM-driven robotics; some use Jetson for on-robot reasoning.
- Internal: AI/ML category — All depth-technical AI/ML posts on the site.
- External: NVIDIA Jetson Developer Kit Documentation — Official hardware specs, thermal design, power budget.
- External: TensorRT-LLM GitHub — Compiler, build instructions, model support matrix, known issues.
- External: vLLM GitHub — Paged-attention engine; Jetson support documented in JetPack-specific branches.
Living benchmark first iteration: April 2026. Test harness, raw data, model artifacts, and hardware spec reproducibility checklist live in the benchmark repo. Update cadence: quarterly (next: July 2026). Questions or contributions welcome via GitHub issues.
Riju M P | iotdigitaltwinplm.com | April 24, 2026
