NVIDIA Jetson + K3s Edge AI Cluster Tutorial: Build, Schedule, and Run Inference

Running deep-learning models on a single Jetson dev kit is easy. Running them across a fleet of devices, with rolling updates, health checks, and GPU scheduling, is where most teams stall. This tutorial closes that gap. We build a real jetson k3s edge ai cluster from bare boards, wire up GPU scheduling so pods can actually reach the iGPU, deploy a vision inference container, and call it over the network.

The goal is not a toy demo. By the end you will have a multi-node Kubernetes cluster running on arm64 Jetson Orin hardware, a device plugin advertising GPUs to the scheduler, and a model serving live detections behind a Service. Everything here is copy-pasteable and ordered the way you would actually run it on a bench.

What this covers: Jetson and K3s background, flashing and node provisioning, K3s server and agent install, GPU scheduling with the NVIDIA device plugin, deploying and testing a vision model, the gotchas that bite arm64 edge clusters, and production recommendations.

Context and Background

Kubernetes was designed for datacenters, not for fanless boxes bolted to a factory wall. K3s changes that. It is a CNCF-certified Kubernetes distribution packaged as a single binary under 100 MB, with an embedded SQLite or etcd datastore and sensible edge defaults. It strips out legacy in-tree cloud providers and alpha features you will never use on a Jetson, which is exactly what you want when memory is scarce and the device may lose power without warning.

Why does the distribution matter so much at the edge? A full upstream Kubernetes install pulls in a long list of components that assume a cloud control plane, fast disks, and machines that never lose power abruptly. On a 4 GB Orin Nano those assumptions are expensive. K3s collapses the API server, scheduler, controller-manager, and kubelet concerns into one process, runs containerd embedded rather than as a separate daemon, and boots in seconds. That smaller surface also means fewer moving parts to debug when a board reboots after a brownout, which on factory floors is routine rather than exceptional.

The NVIDIA Jetson Orin family is the other half of the story. Orin Nano, Orin NX, and AGX Orin span roughly 20 to 275 TOPS of AI performance in a power envelope from 7 W to 60 W. They share an Ampere-class integrated GPU, an arm64 (aarch64) CPU, and a unified memory architecture where CPU and GPU share the same physical RAM. That last detail matters: there is no separate VRAM to copy tensors into, which both simplifies and complicates how you reason about memory.

The unified-memory model is genuinely different from a discrete-GPU server. On an x86 box with an A100 you copy tensors across PCIe into dedicated VRAM, and that copy is a real cost you optimize away. On Orin the GPU reads the same DRAM the CPU writes, so the copy disappears, but the budget does not. A model, its activations, the OS, K3s, and every other pod all draw from the same pool. Size your inference pod’s memory request against the whole board, not against an imaginary separate GPU. This is also why you cannot meaningfully overcommit GPU memory on Jetson the way you might on a server with 80 GB of VRAM.

Choosing which Orin to buy follows from the workload, not the other way around. An Orin Nano with 8 GB is a fine home for a single quantized detector at modest resolution. An Orin NX with 16 GB comfortably runs a larger model or two light ones. An AGX Orin with up to 64 GB is the board you reach for when you need a big model, multiple streams, or the headroom to also act as the control plane. Buy for the model you intend to serve plus a generous margin, because unified memory is shared with everything else on the board and the OS plus Kubernetes already take a bite before your model loads.

All of them run JetPack, NVIDIA’s board support package built on top of Linux for Tegra (L4T). JetPack bundles the kernel, CUDA, cuDNN, TensorRT, and the NVIDIA container runtime. You do not install CUDA separately the way you would on an x86 server; it comes with the flashed image, version-locked to that JetPack release. Mixing CUDA versions across nodes is one of the first traps, and we will come back to it.

The edge-AI use case driving all of this is latency and bandwidth. A camera generating 30 frames per second cannot ship raw video to a cloud GPU and wait for a round trip. Running inference on-device gives single-digit-millisecond response and keeps sensitive footage local. If you have read our ROS 2 Jazzy on Jetson Orin warehouse robotics tutorial, the hardware here will feel familiar; this guide focuses on the orchestration layer instead of robotics. For the canonical install reference, keep the official K3s documentation open in a tab as you work.

So why orchestrate edge inference with Kubernetes at all, rather than just running a container per board by hand? Because a fleet is a moving target. Boards fail, models get retrained, sites get added, and somebody has to make sure the right version runs in the right place. Doing that with SSH and shell scripts works for three boards and collapses at thirty. Kubernetes gives you a declarative target state, automatic restarts, rolling updates, and health-based routing for free. K3s makes that affordable on hardware that could never run a full control plane. The combination is what turns a pile of dev kits into a managed inference platform, and it is the reason this pairing has become a default pattern for serious edge deployments.

Architecture and Cluster Setup

A Jetson K3s edge AI cluster is one K3s server (control plane) plus one or more Jetson agents that run inference pods. You flash each board with a matching JetPack release, set the NVIDIA container runtime as the default so GPU access works inside containers, install the K3s server on one node, then join the agents with a shared token. Verification is a single kubectl get nodes showing every board Ready.

Read that paragraph as the whole tutorial in miniature: flash, configure runtime, install server, join agents, verify. Each step below expands one clause of it, and every command is meant to be run in order on real boards. Nothing here assumes a cloud, an x86 builder, or a discrete GPU.

The topology above is deliberately simple: an AGX Orin acts as the control plane because it has the most RAM and thermal headroom, and three smaller Orin boards do the inference work. A camera feeds frames into the cluster, an inference Service routes them to a ready pod, and detections come back out. You can collapse this to a single node for testing, but the multi-node shape is what teaches the scheduling lessons that matter.

There is a deliberate division of labor here. The control plane runs the API server, scheduler, and datastore, none of which need a GPU, so putting it on the AGX is about RAM and reliability rather than raw inference throughput. The smaller Nano and NX boards are pure capacity: they carry GPU pods and nothing else. Keeping inference off the control plane is a habit worth forming early, because a model that pins the GPU and starves the API server can make the whole cluster appear to hang. The five steps that follow walk top to bottom: flash, set the runtime, install the server, join agents, verify.

Step 1: Flash JetPack and prepare each board

Flash every board with the same JetPack release. Use NVIDIA SDK Manager from an Ubuntu host, or the SD-card image for Orin Nano dev kits. Once flashed, do the first-boot setup, then confirm the L4T version on each device so they match exactly:

# Confirm L4T / JetPack version on every node before anything else
cat /etc/nv_tegra_release
# Example: # R36 (release), REVISION: 4.0 ...  -> JetPack 6.x

# Confirm CUDA is present and which arch the kernel reports
nvcc --version
uname -m            # must print: aarch64

If two boards report different L4T revisions, stop and reflash the odd one out. A cluster with mismatched CUDA runtimes will schedule pods that then fail at model load with cryptic driver errors. Treat the JetPack version as a cluster-wide invariant. The reason is subtle: Kubernetes happily places a pod on any node that satisfies its resource requests, and it has no idea that one board’s CUDA is a minor version behind. The pod schedules, the image pulls, and only at model load does TensorRT throw a driver-mismatch error. That failure looks like an application bug when it is really an infrastructure drift problem, and chasing it through application logs wastes hours.

Next, set a stable hostname and a static IP or DHCP reservation per board. K3s nodes are identified by hostname, so duplicate or shifting hostnames cause nodes to flap. Give them obvious names like orin-cp, orin-nx-1, orin-nano-2. A flapping node is not just cosmetic: when a node briefly disappears and rejoins under a new identity, the scheduler reshuffles pods, GPU pods restart, and your inference latency spikes while models reload. Pinning identity up front removes a whole class of intermittent failures that are miserable to debug after the fact.

Step 2: Make the NVIDIA runtime the default

This is the step people skip, and then GPU pods silently run on CPU. Containerd (which K3s ships internally) must use the NVIDIA runtime by default so containers inherit GPU device nodes and CUDA libraries. First confirm the runtime is registered with Docker’s view, then set it as the default:

# Verify the NVIDIA container runtime is installed (ships with JetPack)
dpkg -l | grep nvidia-container

# Set nvidia as the default runtime for Docker (and as a sanity reference)
sudo tee /etc/docker/daemon.json >/dev/null <<'EOF'
{
  "default-runtime": "nvidia",
  "runtimes": {
    "nvidia": {
      "path": "nvidia-container-runtime",
      "runtimeArgs": []
    }
  }
}
EOF
sudo systemctl restart docker

K3s does not use Docker by default; it uses its own embedded containerd. So after installing K3s we will also drop a containerd config template so the embedded runtime knows about nvidia. Hold that thought until Step 4, where we wire the device plugin in.

It is worth being precise about why this trips people up. There are two container runtimes in play that are easy to conflate. Docker, if installed, is what you use to build and test images by hand on the board. The daemon.json edit above only affects Docker. K3s, however, never touches Docker at runtime; it talks to its own bundled containerd. So setting Docker’s default runtime to nvidia proves the host can run GPU containers, but it does nothing for the pods K3s actually launches. That second configuration, the containerd template, is the one that makes Kubernetes pods GPU-capable, and forgetting it is the single most common reason a “correctly configured” cluster still runs models on the CPU.

Step 3: Install the K3s server

Pick the AGX Orin (or your beefiest board) as the control plane. Install the server with the embedded containerd and grab the node token, which the agents need to join:

# On the control-plane node (orin-cp)
curl -sfL https://get.k3s.io | sh -s - server \
  --write-kubeconfig-mode 644 \
  --disable traefik \
  --node-name orin-cp

# Grab the join token and the server IP
sudo cat /var/lib/rancher/k3s/server/node-token
hostname -I | awk '{print $1}'

We disable Traefik because most edge clusters expose services through NodePort or a lightweight ingress you control, and Traefik adds memory you may not want on a small board. The --write-kubeconfig-mode 644 flag makes the generated kubeconfig readable without sudo, which is convenient on a lab board but worth tightening before production. On a real deployment you would also pass --tls-san with the server’s stable IP or DNS name so the API certificate is valid when you connect from a laptop. Keep these defaults simple now and harden once the cluster works end to end. Verify the server came up:

sudo k3s kubectl get nodes -o wide
# orin-cp should show STATUS Ready and ARCH arm64

Step 4: Join the Jetson agents

On each worker board, install the K3s agent pointing at the server URL and token. Replace the placeholders with the values from Step 3:

# On each agent node (orin-nx-1, orin-nano-2, ...)
export K3S_URL="https://10.0.0.10:6443"
export K3S_TOKEN="K10abc...your-node-token"

curl -sfL https://get.k3s.io | K3S_URL="$K3S_URL" K3S_TOKEN="$K3S_TOKEN" \
  sh -s - agent --node-name "$(hostname)"

Joining is intentionally lightweight. The agent only needs the server URL and the shared token; it pulls everything else from the server on first contact and registers itself as a node. There is no manual certificate dance and no separate kubelet install. If a join fails, the usual culprits are a firewall blocking port 6443, a clock skew large enough to invalidate TLS, or a copy-paste error in the token. Check journalctl -u k3s-agent on the failing board and the cause is almost always one of those three. Once the node appears in kubectl get nodes, it is a full member of the cluster.

Now tell the embedded containerd on every node to default to the NVIDIA runtime. K3s reads a templated config if you provide one:

# On every node: make K3s containerd use the nvidia runtime by default
sudo mkdir -p /var/lib/rancher/k3s/agent/etc/containerd
sudo tee /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl >/dev/null <<'EOF'
{{ template "base" . }}

[plugins."io.containerd.grpc.v1.cri".containerd]
  default_runtime_name = "nvidia"

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
  runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
  BinaryName = "/usr/bin/nvidia-container-runtime"
EOF

sudo systemctl restart k3s-agent   # or k3s on the server node

Back on the control plane, confirm the whole cluster is healthy. Every board should be Ready and report arm64:

sudo k3s kubectl get nodes -o wide
# NAME         STATUS   ROLES                  AGE   ARCH
# orin-cp      Ready    control-plane,master   5m    arm64
# orin-nx-1    Ready    <none>                  2m    arm64
# orin-nano-2  Ready    <none>                  2m    arm64

Copy /etc/rancher/k3s/k3s.yaml to your laptop, swap 127.0.0.1 for the server IP, and you can drive the cluster with a normal kubectl from your desk. At this point you have a functioning Kubernetes cluster on Jetson hardware. It just cannot reach the GPU yet. That is the next section.

Take a moment to appreciate what you have and what you do not. The cluster will schedule pods, route Services, restart crashed containers, and survive a board reboot. All of that is generic Kubernetes behavior and it works identically to a cloud cluster. What is missing is GPU awareness: as far as the scheduler is concerned, these are CPU-only machines. A pod that requests nvidia.com/gpu right now would sit Pending forever, because no node advertises that resource. Closing that gap is purely additive. We deploy a DaemonSet that inspects each board, finds the iGPU, and reports it to the kubelet. Nothing about the cluster you just built changes.

GPU Scheduling and Deploying a Model

Kubernetes does not know your Jetson has a GPU until something advertises it. The NVIDIA device plugin runs as a DaemonSet on every node, detects the iGPU, and reports an nvidia.com/gpu resource to the kubelet. The scheduler then treats GPUs like CPU or memory: a pod that requests one is only placed on a node that has one free. A runtimeClassName makes sure the pod actually launches under the NVIDIA runtime.

This is the conceptual heart of the tutorial, so it is worth slowing down. Kubernetes has no built-in notion of a GPU. CPU and memory are first-class resources baked into the kubelet, but anything else, including GPUs, is an extended resource that some component must teach the cluster about. The device-plugin framework is that teaching mechanism. NVIDIA ships a plugin that knows how to find GPUs and report them, and once it runs your manifests can request nvidia.com/gpu exactly like they request CPU. The scheduler does the bin-packing, refusing to place two GPU pods on a one-GPU board. Everything that follows is just making the three required pieces, the runtime, the plugin, and the pod spec, agree.

The flow above is worth internalizing. Your pod requests nvidia.com/gpu: 1, the device plugin has already advertised that resource, the scheduler binds the pod to a node with capacity, and containerd starts it under the NVIDIA runtime so CUDA libraries and device nodes are mounted. Skip any link in that chain and the pod either stays Pending or runs blind on the CPU.

Each link fails in a distinct, recognizable way, and learning the signatures saves real time. If the device plugin is not running, no node advertises the GPU and the pod stays Pending with an “Insufficient nvidia.com/gpu” event. If the plugin runs but the runtime default is wrong, the pod schedules and starts but the container has no CUDA libraries, so the model falls back to CPU and inference is an order of magnitude slower with no error at all. If the runtimeClassName is missing, the GPU device nodes are not mounted even when the resource was granted. Reading the failure mode tells you exactly which link to inspect, which beats guessing.

Step 5: Install the device plugin and a RuntimeClass

First define a RuntimeClass named nvidia, then deploy the device plugin DaemonSet tolerated onto every node. On Jetson, the unified-memory iGPU is exposed as a single nvidia.com/gpu unit per board. The RuntimeClass is a small but essential object: it gives pods a name they can reference to opt into the NVIDIA runtime, decoupling the pod spec from the node-level containerd configuration. Without it you would have to make every container on the board use the NVIDIA runtime, which is heavier than necessary; with it, only pods that ask for runtimeClassName: nvidia get the GPU plumbing.

# runtimeclass.yaml
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: nvidia

# Apply the RuntimeClass, then the NVIDIA k8s device plugin
kubectl apply -f runtimeclass.yaml
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.16.2/deployments/static/nvidia-device-plugin.yml

# Wait for the DaemonSet, then confirm GPUs are advertised
kubectl -n kube-system rollout status ds/nvidia-device-plugin-daemonset
kubectl get nodes -o "custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
# Each Jetson node should now show GPU: 1

If a node shows <none> instead of 1, the runtime default did not take. Recheck the containerd template from Step 4 and that the plugin pod on that node is Running, not CrashLoopBackOff. The NVIDIA k8s-device-plugin README documents the exact tags compatible with each JetPack release; pin to one, do not use latest.

A word on what the device plugin actually does, because treating it as a black box leads to bad debugging. The plugin is a small process that implements Kubernetes’ device-plugin gRPC interface. On start it inventories the GPUs it can see through the NVIDIA libraries, registers with the kubelet over a Unix socket, and then streams a list of healthy device IDs. The kubelet rolls those up into the node’s allocatable resources, which is what kubectl get nodes shows. When you request a GPU, the kubelet asks the plugin to “allocate” it, and the plugin returns the environment variables and device mounts the container needs. On Jetson the integrated GPU is reported as one device, so the count is always one per board. If that count is zero, the plugin either could not see the GPU (wrong runtime) or crashed before registering (wrong tag for your JetPack).

Step 6: Deploy a vision inference container

Now the payload. We deploy a lightweight object-detection server that loads a model and exposes an HTTP endpoint. Note the three things that make it GPU-aware: runtimeClassName: nvidia, the resource request and limit for nvidia.com/gpu, and an arm64-built image. A Triton Inference Server image works the same way if you prefer a heavier serving stack.

A few design choices in the manifest deserve explanation. We set replicas: 2 so two boards each run a copy and the Service load-balances across them; one board can die without an outage. The readiness probe is not decoration. Model load on Orin can take fifteen to thirty seconds while TensorRT builds or deserializes the engine, and without a probe the Service would route traffic to a pod that is not ready, producing connection errors that look like network faults. The initialDelaySeconds: 20 gives the model time to load before the first health check. The GPU appears only as a limit, never a request, because extended resources in Kubernetes must have request equal to limit and you cannot fractionally request one; you ask for a whole GPU or none.

The image itself carries weight here. It must be an arm64 image built against the same JetPack and CUDA your boards run, and it should ideally ship a prebuilt TensorRT engine rather than building one at startup. Engine builds are slow and memo

Figure 3: Edge inference request sequence. A camera frame reaches the inference service, the pod schedules work on the Jetson GPU through the device plugin, and the prediction returns to the caller within the latency budget.

Figure 4: GitOps fleet deployment topology. A Git repository is the source of truth; a controller reconciles each remote K3s site, pulls images from a local registry, and supports drift detection and rollback across the Jetson fleet.

NVIDIA Jetson + K3s: Edge AI Cluster Tutorial (2026)

NVIDIA Jetson + K3s Edge AI Cluster Tutorial: Build, Schedule, and Run Inference

Context and Background

Architecture and Cluster Setup

Step 1: Flash JetPack and prepare each board

Step 2: Make the NVIDIA runtime the default

Step 3: Install the K3s server

Step 4: Join the Jetson agents

GPU Scheduling and Deploying a Model

Step 5: Install the device plugin and a RuntimeClass

Step 6: Deploy a vision inference container

Related

Comments

Leave a Reply Cancel reply

Tag Cloud

Categories