Edge MLOps Pipelines for Industrial IoT: 2026 Production Architecture

Edge MLOps Pipelines for Industrial IoT: 2026 Production Architecture

Edge MLOps Pipelines for Industrial IoT: 2026 Production Architecture

An edge MLOps pipeline for industrial IoT is the discipline of training models in the cloud, packaging them once into hardware-specific artefacts, distributing them to thousands of constrained gateways, and monitoring their behaviour against ground truth that is expensive and slow to collect. By 2026 this discipline has matured from “MLflow plus a shell script” into a layered reference architecture with named tools at every layer — registry, packager, signer, OTA controller, runtime, telemetry agent — that survives flaky WAN links, regulated change control, and the brutal reality that most production lines never produce labelled data on their own. This guide is the reference architecture I hand to platform teams who are building that pipeline from scratch or replatforming an early proof-of-concept into something that can run on 5,000 plant-floor devices without paging the on-call engineer every Friday night.

The reader I have in mind is a platform engineer or ML platform lead who already understands what a feature store is, has shipped a model on Kubernetes, and is now staring at a fleet of Jetson Orin modules, Siemens IPCs, or off-brand ARM gateways and wondering how to turn that into an operable system. If you are earlier in the journey, our complete technical guide to the Internet of Things covers the protocol and topology layer underneath; for the cloud-side tool-orchestration patterns that pair with what follows, see LLM tool-calling determinism patterns 2026.

Why edge MLOps is harder than cloud MLOps

Cloud MLOps has converged. The reference stack — Git for source, MLflow or Weights & Biases for experiment tracking, a model registry, an inference service behind a load balancer, Prometheus scraping latency, a feedback loop into a feature store — is well understood and shipped by every hyperscaler. Edge MLOps is harder for six concrete reasons, and any architecture that ignores them ends up rebuilt within a year.

Heterogeneous hardware. A single deployment commonly spans NVIDIA Jetson Orin Nano and AGX, Intel x86 IPCs with iGPU or Movidius VPUs, Qualcomm QCS-class SoCs, and ARM Cortex-A class boards. Each demands a different packaged artefact: TensorRT engines for Jetson, OpenVINO IR for Intel, ONNX Runtime with the appropriate execution provider for the rest, and GGUF when the workload is an on-device small language model. The cloud MLOps assumption that one container image runs everywhere is simply false.

Intermittent connectivity. Plant networks are firewalled, often air-gapped or semi-air-gapped via an industrial DMZ, and frequently lose WAN connectivity for hours. The pipeline must tolerate a gateway being offline for a week, then catching up. Any architecture that requires a synchronous call to a cloud control plane to make an inference decision is dead on arrival.

Label scarcity. A vibration-monitoring model on a pump rarely sees a labelled failure. A vision model on a packaging line sees thousands of frames per second and a defect maybe once a shift, labelled days later by a quality engineer if at all. The pipeline cannot assume ground truth arrives in minutes; it has to detect drift and trigger retraining using proxies — population statistics, shadow disagreement, anomaly scores — and queue suspect windows for sparing human review.

Strict change control. A model that controls a robotic cell, or even just suggests a setpoint to a human operator, falls under the same change-management gate as PLC firmware. ISA-95, IEC 62443, and customer-specific MoC (management of change) workflows mean every deployment must be auditable, signed, reversible, and tied to a ticket. “Push to main triggers rollout” does not survive a single audit.

Adversarial supply chain. Models are now a supply-chain attack surface. A poisoned ONNX file with a fused-op exploit, or a swapped TensorRT engine that shaves a class boundary to allow defective product through, is a real concern. SLSA provenance, sigstore-style signing, and reproducible packaging are baseline expectations in 2026, not theatre.

Operational ownership is split. The data scientist who trained the model, the platform team that runs the registry, the OT engineer who owns the gateway, and the plant manager who owns uptime are four different humans with four different on-call rotations. The pipeline has to expose each layer with the contract that group needs and hide the rest.

These six pressures shape every choice in the reference architecture that follows. Note all of them are organisational and physical before they are technical — the tooling exists; the discipline is in wiring it together so it survives a real plant.

Reference architecture: training, packaging, distribution, runtime, monitoring

The mature 2026 reference architecture has five planes, with a sixth feedback loop closing the circuit. The cloud training plane mirrors a standard MLOps stack. A control plane sits between cloud and edge and owns rollout policy. The edge runtime plane lives on the device. A telemetry plane carries observations back. Diagram arch_01 shows the end-to-end flow.

End-to-end edge MLOps pipeline for industrial IoT showing cloud training, model registry, packager, signer, CDN, control plane, edge fleet, and feedback loop

Plane 1 — Training. Distributed training on Vertex AI, SageMaker, or Azure ML, with experiments tracked in MLflow, Weights & Biases, or ClearML. NVIDIA TAO Toolkit covers the common vision-and-speech transfer-learning paths and produces artefacts that drop cleanly into Triton. PyTorch 2.x with torch.compile and torch.export is the default for greenfield work; Hugging Face Transformers via ExecuTorch is the typical path for on-device language and audio.

Plane 2 — Registry and packaging. The MLflow Model Registry, SageMaker Model Registry, or Vertex AI Model Registry holds the canonical model. A packaging job — usually a Kubeflow or Argo Workflows DAG — fans out per target: ONNX export, TensorRT engine build for each Jetson variant and CUDA version, OpenVINO IR conversion via Model Optimizer, TF-Lite conversion for low-end ARM, ExecuTorch .pte for PyTorch-Edge targets, and GGUF for on-device LLM workloads. Each packaged artefact is signed with cosign and tagged with SLSA build provenance.

Plane 3 — Distribution. The signed artefacts land in an OCI-compatible registry. An OTA controller — NVIDIA Fleet Command for Jetson-heavy fleets, AWS IoT Greengrass v2 with component deployments, Azure IoT Edge module deployments, Mender for mixed Linux fleets, or balenaCloud for container-native deployments — pulls them and orchestrates a cohorted rollout.

Plane 4 — Edge runtime. On the device, an inference server (Triton, ONNX Runtime, OpenVINO Model Server, or a custom binary linking the appropriate runtime library) loads the artefact, a feature pipeline extracts inputs from OPC UA, MQTT, or vision streams, and a sidecar OTel collector emits telemetry. Diagram arch_02 details that layout.

Plane 5 — Telemetry and monitoring. A site-level OTel collector aggregates telemetry from all devices on a plant, buffers it to a local WAL during WAN outages, and forwards to cloud Prometheus-compatible storage (VictoriaMetrics or Grafana Mimir), Loki for logs, and Tempo for traces. Drift, anomaly, and shadow-disagreement signals feed a manual review queue, which closes the loop back to Plane 1.

Feedback loop. Labelled review-queue output, sampled raw inputs, and operator overrides are exported, deduplicated, joined to ground truth where available, and added to the training set for the next iteration. The cadence is typically weekly for vision and audio, monthly for tabular telemetry, and event-driven for safety-critical drift signals.

The point of separating the planes is that each is independently replaceable. Swap Vertex AI for SageMaker; swap Triton for ONNX Runtime; swap Greengrass for Mender. The contracts at the seams — signed OCI artefact, OTel telemetry schema, rollout state machine — are what stay stable.

Edge runtime layer showing container orchestration, model server, feature pipeline and telemetry sidecar on a single industrial gateway

Model registry, packaging (ONNX, TensorRT, OpenVINO, GGUF) and signing

The registry is the boundary between research and production. A model in the registry has a name, a version, a stage (Staging, Production, Archived), a lineage pointer back to the training run, and the input/output schema as a serialised proto or signature.json. MLflow Model Registry, SageMaker Model Registry, and Vertex AI Model Registry all expose roughly the same surface; the choice is usually determined by the hyperscaler the rest of the platform lives on.

What edge MLOps adds on top is the packaged-artefact tree. A single registered model produces many packaged artefacts, one per target tuple of (hardware, runtime, precision, batch shape). For a single computer-vision model that needs to run on Jetson AGX Orin, Jetson Orin Nano, Intel Tiger Lake iGPU, and a generic CPU fallback, the packager produces at least:

  • An ONNX export (opset 17+) as the canonical interchange artefact.
  • A TensorRT engine per Jetson variant per CUDA/TensorRT version. TensorRT engines are not portable across GPU architectures, CUDA versions, or even minor TensorRT releases, so the matrix can balloon quickly. TensorRT 10/11.x is the practical 2026 baseline.
  • An OpenVINO IR (.xml + .bin) built with OpenVINO 2024.x or later for Intel targets, optionally with INT8 calibration via NNCF.
  • An ONNX Runtime artefact (the same .onnx) with an execution-provider plan that picks CUDA, DirectML, OpenVINO EP, or CPU as appropriate. ONNX Runtime 1.18+ is the practical 2026 baseline.
  • For PyTorch-native deployments, an ExecuTorch .pte produced by torch.export then to_edge and to_executorch.
  • For on-device language workloads, a GGUF quantised to Q4_K_M or Q5_K_M depending on the device class.

Each artefact is built reproducibly inside a hermetic container with pinned toolchain versions and a recorded SBOM. The build container is itself an immutable image with its own provenance — yes, you sign the thing that signs the thing.

Signing. cosign is the de-facto standard for signing OCI-stored model artefacts. The signing flow is keyless OIDC via Sigstore for staging and a hardware-backed KMS key (AWS KMS, GCP Cloud KMS, or an on-prem HSM) for production. SLSA Level 3 build provenance is emitted by the packaging DAG, attesting which Git commit, which container, and which input dataset produced this artefact. On the device, the update agent verifies signature and provenance before swapping the model. Reject-on-fail is non-negotiable; a model that does not verify must never be loaded, even if the previous model is unavailable.

Schema and contract. Alongside the artefact, the registry stores the input and output schema, the preprocessing pipeline as a serialised transformer (sklearn pipeline, a Triton ensemble config, or a small ONNX preprocessing model), and a model card with intended use, evaluation metrics, and known failure modes. The runtime refuses to load a model whose schema does not match the feature pipeline’s output contract — this is what stops “we changed feature ordering in the cloud and silently broke 800 devices” incidents.

Quantisation and precision. Edge artefacts are almost always quantised. INT8 post-training quantisation via TensorRT calibration, OpenVINO NNCF, or onnxruntime.quantization is the workhorse. FP16 is the common middle ground on Jetson. INT4 quantisation for small language models via GGUF or AWQ is increasingly common but demands calibration-set hygiene because the accuracy hit is workload-dependent. Always store the calibration dataset reference next to the quantised artefact; otherwise you cannot rebuild it deterministically a year later when the toolchain version drifts.

OTA distribution: fleet update patterns and rollback

OTA (over-the-air) distribution is the most operationally sensitive part of the pipeline. A bad rollout can brick a fleet; a slow rollout can delay a safety fix. The 2026 reference pattern is cohorted progressive delivery with atomic A/B partition rollback, illustrated in diagram arch_03.

OTA distribution showing canary, early-adopter and stable cohorts with health gates and atomic rollback to a pinned previous version

Cohorts. The controller splits the fleet into cohorts. A typical split is 1% canary, 9% early adopter, 90% stable. Cohorts are deterministic — a device’s cohort is a hash of its serial plus the rollout ID, so a misbehaving cohort can be reproduced. Cohorts are also constrained by blast-radius rules: no rollout touches more than one device per redundant pair, no rollout touches more than one cell of a production line simultaneously, and safety-critical zones lag the rest by at least one stable interval.

Health gates. Between cohorts, the controller waits for a soak window — typically minutes for benign workloads, hours for safety-relevant ones — and evaluates an SLO gate. The gate combines latency p99, error rate, watchdog reboots, and any model-specific proxy for accuracy (shadow disagreement, anomaly rate, operator override rate). If the gate fails the rollout is paused and on-call is paged. If the gate passes the next cohort proceeds. NVIDIA Fleet Command, Greengrass, Azure IoT Edge, Mender, and balenaCloud all expose primitives for cohorting and health gating; the differences are in how much policy you express declaratively versus in your own controller.

Delta updates. Models are large and links are thin. Delta-encoded updates — using bsdiff, zstd-dictionary, or OCI-zstd layer compression — typically reduce update payload substantially versus full artefacts, and proportional savings are larger when the architecture is stable across versions and only weights change. Mender and balenaCloud support delta natively; Greengrass and Azure IoT Edge usually need an extra component to implement deltas cleanly.

Atomic A/B with rollback. The device holds two slots: A holds the running model; B holds the new candidate. The update agent writes B, verifies signature, performs a local smoke test (a fixed input vector that should produce a known output within tolerance), and only then atomically swaps the active pointer. If the new model crashes the runtime watchdog within a configurable window the agent swaps back to A automatically. The previous model is pinned for at least one full update cycle so rollback never has to re-download. This pattern was popularised by Android A/B updates and Mender’s dual-rootfs; it now applies cleanly to model artefacts on top of read-only model volumes.

Bandwidth shaping. A 5,000-device fleet pulling 200 MB simultaneously will saturate any plant uplink. The controller staggers pulls across a configurable window, prefers off-shift hours, and respects per-site bandwidth budgets. Site-local mirroring — a single gateway per plant caches artefacts and the rest pull from it — is the standard pattern for sites with poor WAN. Greengrass stream manager, balena’s local mode, and Mender’s local proxy all implement this.

Change-management integration. Every rollout is associated with a change ticket. The controller refuses to start a rollout outside an approved change window for safety-classified models, and emits an audit event for every cohort transition. This is rarely fun to build but it is what makes the system survive contact with the customer’s auditor.

Edge runtimes (Triton, TF-Lite, ExecuTorch, ONNX Runtime, NVIDIA Jetson stack)

The runtime choice is decided by hardware target, model class, and team familiarity, in roughly that order. The 2026 shortlist is small and converged.

NVIDIA Triton Inference Server remains the heavyweight for Jetson Orin and any x86-with-NVIDIA-GPU edge deployment. It supports TensorRT, ONNX Runtime, PyTorch, TensorFlow, and Python backends in a single process, ensembles for multi-stage pipelines, dynamic batching, and a uniform gRPC/HTTP interface. The cost is RAM and disk footprint — Triton is not what you want on a 2 GB Cortex-A board.

ONNX Runtime is the most portable choice and has become the default fallback. With execution providers for CUDA, TensorRT, OpenVINO, DirectML, CoreML, NNAPI, QNN (Qualcomm), and CPU, a single ONNX file can run almost everywhere. ONNX Runtime 1.18+ in 2026 supports graph optimisations, INT8 inference, and a small enough core that it fits comfortably on ARM gateways. The trade-off versus Triton is no built-in serving — you wrap it in your own gRPC or REST shim — and somewhat less aggressive per-target optimisation than a hand-built TensorRT engine.

TensorFlow Lite is still the right choice for microcontroller-class targets and Android-derived industrial HMIs. TF-Lite Micro extends down to Cortex-M class. For new server-side or gateway-class work the centre of gravity has moved to ONNX Runtime and ExecuTorch, but TF-Lite remains entrenched in Coral Edge TPU and many Android-based panel deployments.

ExecuTorch (PyTorch Edge) is the 2026 choice for teams who do not want to leave the PyTorch ecosystem. torch.export produces a stable graph; to_edge and to_executorch lower it to a portable .pte that runs on a small C++ runtime with backends for XNNPACK, CoreML, MPS, Qualcomm QNN, and ARM Ethos. The story is less mature than ONNX Runtime for industrial gateway shipping but the trajectory is clear — for PyTorch-native teams ExecuTorch removes the ONNX round-trip and its accompanying op-coverage drama.

OpenVINO is the right choice on Intel iGPU and CPU targets, and its NNCF quantisation toolchain is one of the better INT8 stories in the industry. OpenVINO 2024.x ships an LLM-oriented optimisation path and good vision-transformer support.

NVIDIA Jetson stack. On Jetson devices the practical stack is JetPack (CUDA, cuDNN, TensorRT), Triton or a TensorRT-native binary, DeepStream for video pipelines, and Holoscan for higher-rate sensor fusion. JetPack 6.x in 2026 lines up with Ubuntu 22.04 base images and Triton 24.x.

Runtime sizing. A practical exercise before locking in: measure cold-start time, RSS, and tail latency for each candidate runtime on the actual target hardware with the actual model. Synthetic benchmarks lie; INT8 TensorRT on the right batch shape can be dramatically faster than ONNX Runtime on the same device, or it can be a wash if the model is memory-bound. Do not pick the runtime before you have measured it on the box.

Model server vs library. A separate model server (Triton, OpenVINO Model Server) gives you a clean upgrade path — you can upgrade the model without restarting the inference application — at the cost of one extra process and IPC. Linking ONNX Runtime or ExecuTorch directly into the app gives lower latency and simpler ops at the cost of coupling model upgrades to app upgrades. For fleets above a few hundred devices the server pattern usually wins; for tightly resource-constrained single-purpose devices the library pattern usually wins.

Drift detection without ground truth

This is the section every team underbuilds first time around. Drift detection on the edge is hard because labels are scarce and slow. The 2026 reference pattern is a layered detector stack that produces signals of increasing confidence and feeds a sparse human review queue, shown in diagram arch_04.

Drift detection without labels showing shadow model disagreement, population drift statistics, anomaly scoring and a manual review queue feeding a retraining DAG

Layer 1 — Population drift. Compute summary statistics (mean, variance, quantiles) and embedding-space distributions on inputs and on predictions, and compare to a fixed reference window from the training distribution. Population Stability Index (PSI), Kullback–Leibler divergence, and Maximum Mean Discrepancy (MMD) are the standard metrics; Evidently, NannyML, Fiddler, and Arize provide off-the-shelf implementations. Population drift is cheap to compute, gives a continuous signal, and catches most slow drifts caused by seasonal change, equipment ageing, and gradual recipe change.

Layer 2 — Shadow disagreement. Run a second model — either the previous production version or a heavier reference model — in shadow on a sampled subset of inputs and measure prediction disagreement. Disagreement that spikes when neither model has been changed is strong evidence of input distribution drift. On constrained hardware run the shadow on a sample (1–10%) or on a site-local gateway rather than every device.

Layer 3 — Anomaly scoring. An autoencoder, isolation forest, or one-class SVM trained on the training-set inputs scores each live input for “how typical is this”. Anomaly spikes correlate with drift, novel failure modes, sensor faults, and data pipeline bugs. The reference architecture treats anomaly score as a routing signal — a sample with high anomaly score is more likely to be added to the review queue.

Layer 4 — Operator override and downstream KPI. Where the model output feeds a human-in-the-loop or a downstream metric (yield, takt time, defect rate), the operator’s override rate and the downstream KPI are the strongest proxy for accuracy. A change in override rate is often the first lagging confirmation that a drift detected by Layer 1 or 2 is real.

The review queue. Suspect windows from any layer feed a manual review queue exposed in a labelling tool (Label Studio, V7, CVAT, or an internal tool). A subject-matter expert spends a bounded amount of time per week labelling the queue. The output is the new labelled batch that closes the loop.

Anti-patterns. Two failure modes to avoid. First, alert-flood: any of the four layers will produce false positives, and alerting on every signal trains the team to ignore the system. The reference pattern uses each layer as a routing signal into the review queue rather than as a paging signal; paging is reserved for confirmed downstream-KPI regression or safety-classified models. Second, silent retraining: do not let the pipeline retrain and redeploy automatically on a drift signal without human approval. The signal is hard-earned but the failure modes of an unsupervised retrain on biased recent data are worse than the drift itself.

Observability: metrics, traces, model health, anomaly logs

The observability layer is the nervous system. The 2026 reference stack is OpenTelemetry on the device, a site-level OTel collector with a local WAL for outage tolerance, and Prometheus-compatible storage plus Loki and Tempo in the cloud. Diagram arch_05 lays it out.

Edge to cloud observability stack with OTel SDK, site proxy with local WAL buffer, VictoriaMetrics, Loki, Tempo and Grafana dashboards

On-device metrics. Standard system metrics (CPU, GPU utilisation, memory, temperature, disk wear) come from the node exporter or a JetPack-specific exporter. Application metrics — inference latency p50/p95/p99, queue depth, batch size, model version, hardware accelerator utilisation, classifier output histogram — come from the inference app via OTel SDK. Per-model versioned metrics are non-negotiable; tagging by model_version is what lets you compare canary and stable cleanly.

Traces. A trace covering ingest, feature extraction, inference, and downstream emit is invaluable when chasing why a particular plant has bad latency. Sample at 1–5% on stable models; sample at 100% on canaries and on traces that span an error event.

Logs. Structured logs from the inference app, the runtime, the update agent, and the OTel collector itself, shipped via Promtail or Fluent Bit to Loki. Logs include model version, deployment ID, and device serial as fixed labels.

Outage tolerance. The site-level collector has a persistent WAL sized for the expected WAN-outage budget — typically 24–72 hours of telemetry. When WAN returns, the collector drains the WAL with backpressure and dedup. Devices buffer locally for the smaller out-of-site case. Without this, every plant network blip produces a fake “fleet went dark” incident.

Cloud aggregation. VictoriaMetrics or Grafana Mimir for metrics, Loki for logs, Tempo or Jaeger for traces, Grafana for dashboards, Alertmanager and PagerDuty/Opsgenie for routing. The choice of storage backend is largely about cardinality budget — per-device labels can explode cardinality if you tag everything; the discipline is to aggregate at the site level for high-cardinality dashboards and reserve per-device drill-down for incident response.

Model health dashboards. The dashboards platform teams actually use are not “latency by model” — they are “rollout state across the fleet”, “drift signals per model per site”, “models with stale telemetry”, and “models past their retrain SLO”. The early version of the platform has the first kind; the mature version has the second.

Alerting discipline. Page on SLO breach, not on raw metric thresholds. Pages should be tied to runbooks; runbooks should be tied to the controller actions an on-call engineer can take (pause rollout, force rollback, snapshot device, drain to safe-mode). An alert without a runbook will be acknowledged and forgotten.

Trade-offs, gotchas, and what goes wrong

Silent drift. The most common failure is silent accuracy decay. The model still answers in the right shape, latency is fine, no alarms fire, but the defect-escape rate climbs slowly over months. Mitigation is the layered drift detector and a hard SLO on downstream KPI with a human review cadence. Accept that some drift will be detected only via Layer 4 — the goal is bounded delay.

Label scarcity. As covered above, ground truth is the bottleneck. Mitigations include sparse expert review queues, programmatic labelling (Snorkel-style weak supervision), simulation-based label generation from a digital twin, and operator-override mining. None solve the problem; together they keep the retrain loop fed.

Network constraints. Plant networks are firewalled by design. The pipeline must work behind a single outbound HTTPS endpoint per site, must tolerate proxy interception, and must respect quiet windows. Anything that requires inbound connections to devices is a non-starter.

Secret rotation. Devices hold API keys, signing certificates, and TLS client certs. Rotating those across a fleet without a rollback path is a known way to brick devices. The reference pattern is short-lived OIDC-style federated credentials issued by a control plane that the device can re-fetch, plus a long-lived bootstrap credential held in a TPM or secure element. Plan rotation explicitly — do not discover at year three that nobody remembers how.

Supply-chain attacks on models. Treat model files as executable supply chain. Sign them, attest provenance, scan for known malicious-op patterns, refuse unverified artefacts, and keep an SBOM for every layer. A poisoned model that biases one class is a real attack and one the industry has only recently started defending against.

TensorRT engine portability. TensorRT engines do not survive driver upgrades, CUDA upgrades, or TensorRT minor upgrades, and they certainly do not survive GPU architecture changes. Treat the engine matrix as a real ongoing cost; budget for rebuild-on-driver-update and CI it.

Quantisation regressions. INT8 calibration is sensitive to the calibration dataset. A model that was 99.4% on FP16 and 99.1% on INT8 in the lab can drop to 96% on INT8 in the field if the calibration set was unrepresentative. Re-validate quantised artefacts against a held-out field-representative set before promotion.

Time sync. Edge ML telemetry without good NTP discipline is impossible to debug. PTP is overkill; chronyd with a site-local stratum-2 source is the practical answer. Tag every event with both device time and ingest time.

Storage wear. Industrial gateways with consumer SSDs wear out under busy logging. Size telemetry retention and log levels to the disk’s TBW budget. Use ring buffers and journald rate-limiting.

Cost. Edge MLOps does not look expensive per device. It looks expensive in aggregate when the registry has 40 model variants per logical model, the OCI store grows by GB per release, the cloud telemetry store ingests millions of series per day, and the labelling vendor charges per task. Build cost dashboards alongside health dashboards.

Practical recommendations

For a team starting this build today, the short list:

  • Pick the OTA controller first based on hardware mix, not the runtime. Greengrass, Azure IoT Edge, Fleet Command, Mender, and balenaCloud all work; choose by fleet composition and existing cloud commitment.
  • Standardise on ONNX as the registry artefact and let the packager produce hardware-specific derivatives. This keeps the registry small and reproducible.
  • Sign everything from day one with cosign. Retrofitting signing later is painful.
  • Build the review queue and drift detector before you need them. By the time you need them, you already shipped a regression you cannot see.
  • Treat the site-level OTel collector as a first-class component, not a deployment afterthought. Outage tolerance lives there.
  • Cohort and gate every rollout, even the small ones. Pure-percentage rollouts without health gates are how fleets get bricked.
  • Pin the previous model on every device. Rollback that requires a re-download is rollback you cannot perform during the incident.
  • Write the runbook for “model is hallucinating on plant X” before you launch. The runbook is the artefact that converts an architecture into an operation.
  • Measure on real hardware. Synthetic benchmarks for runtime selection waste weeks.
  • Budget for retrain cadence as a line item. A model with no retrain cadence is a model going stale.

FAQ

How is edge MLOps different from cloud MLOps?
Cloud MLOps assumes homogeneous hardware, abundant compute, persistent connectivity, and a relatively easy path from inference to ground truth. Edge MLOps assumes heterogeneous accelerators, constrained compute, intermittent links, and ground truth that is sparse and lagging. The reference stack adds a packager for hardware-specific artefacts, an OTA controller with cohorting and rollback, a runtime with on-device telemetry, and a drift detector that does not rely on labels.

Which model registry should I use?
Pick the one that aligns with your cloud and the rest of your platform: MLflow Model Registry if you are self-hosting or on a mixed cloud; SageMaker Model Registry on AWS; Vertex AI Model Registry on GCP; Azure ML on Azure. The differences in registry surface are small. The bigger choices are upstream (experiment tracking) and downstream (packager and OTA controller).

Triton vs ONNX Runtime vs ExecuTorch on the edge?
Triton when you are NVIDIA-heavy and want a uniform server abstraction across model families. ONNX Runtime when you want the broadest hardware coverage with one artefact. ExecuTorch when your team lives in PyTorch and your targets are well covered by its backends. The good news is you can mix — Triton with an ONNX Runtime backend is a common compromise.

How do you detect model drift without ground-truth labels?
With a layered detector stack: population drift on inputs and predictions, shadow disagreement against a second model, anomaly scoring on inputs, and operator-override and downstream-KPI tracking. None of these are conclusive on their own; together they route suspect windows to a sparse human review queue that produces labelled batches for retraining.

What is the right OTA pattern for an industrial fleet?
Cohorted progressive delivery with deterministic device cohorts, between-cohort SLO gates, atomic A/B partition swap on the device, automatic watchdog rollback within a configurable window, and a pinned previous version that never needs re-download. The controller integrates with change management for safety-classified models.

How do I sign and verify edge models?
Sign with cosign using keyless OIDC for staging and a hardware-backed KMS for production. Emit SLSA Level 3 build provenance from the packaging pipeline. Verify signature and provenance on the device before loading, and refuse the load on failure even if the previous model is unavailable. Maintain an SBOM for every packaged artefact.

Further reading

  • OpenTelemetry specification — https://opentelemetry.io/docs/specs/otel/
  • Sigstore / cosign documentation — https://docs.sigstore.dev/
  • SLSA supply chain framework — https://slsa.dev/
  • NVIDIA Triton Inference Server — https://github.com/triton-inference-server/server
  • ONNX Runtime — https://onnxruntime.ai/
  • ExecuTorch (PyTorch Edge) — https://pytorch.org/executorch/
  • OpenVINO Toolkit — https://docs.openvino.ai/
  • MLflow Model Registry — https://mlflow.org/docs/latest/model-registry.html
  • AWS IoT Greengrass v2 — https://docs.aws.amazon.com/greengrass/v2/
  • Azure IoT Edge — https://learn.microsoft.com/azure/iot-edge/
  • NVIDIA Fleet Command — https://www.nvidia.com/en-in/data-center/products/fleet-command/
  • Mender OTA — https://docs.mender.io/
  • balenaCloud — https://docs.balena.io/
  • Evidently AI drift detection — https://docs.evidentlyai.com/
  • ISA-95 / IEC 62443 management of change references — https://www.isa.org/
  • What is IoT — complete technical guide
  • LLM tool-calling determinism patterns 2026

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *