Zenoh for Industrial IoT: Reference Architecture (2026)

Most plants in 2026 still pipe sensor data through MQTT brokers that were sized for 2018 workloads. Then a robotics team plugs in ROS 2, a vision team brings DDS, and an OT team mandates Sparkplug — and the broker grinds. Eclipse Zenoh, the protocol-agnostic data fabric maintained by ZettaScale, is what teams reach for when those three traffic patterns must coexist without rebuilding the network. This post lays out a Zenoh industrial IoT architecture that fits a multi-cell brownfield plant: routers, peers, storages, the ROS 2 bridge, MQTT and DDS interop, and a keyspace you can actually defend in a design review. You will leave with a reference topology, capacity heuristics, and the gotchas that bite teams in month three.

What this post covers: the four-layer reference, peer-vs-router topology choices, ROS 2 and MQTT bridging, ISA-95 keyspace design, scaling and failure modes, and a practical adoption checklist.

Why Zenoh now: context for 2026

Answer-first summary: Zenoh matters in 2026 because industrial sites have to multiplex MQTT, DDS, and ROS 2 traffic over the same edge fabric without sacrificing latency. Zenoh is a pub/sub, query, and storage protocol designed for that mix, and it now ships official bridges to all three ecosystems.

Eclipse Zenoh emerged from RTI and INRIA research, and is now an Eclipse Foundation project led by ZettaScale. It is intentionally narrow in scope: one wire protocol that supports publish/subscribe, distributed queries, and pluggable storage backends. The reference implementation is in Rust, with bindings for C, C++, Python, Java, Kotlin, and Zig.

Two structural shifts in the IIoT market explain why teams adopt it in 2026.

First, the protocol monoculture is over. A modern plant has OPC UA at the controller level, Sparkplug B in the SCADA layer, ROS 2 on mobile robots, and Kafka or Pulsar at the IT boundary. No single broker spans those cleanly. Zenoh accepts that reality and offers bridges rather than asking you to rip out incumbents.

Second, edge compute is real. NVIDIA Jetson Orin, AMD Kria, and Intel NUC-class nodes sit on the line now. They want peer-to-peer publish/subscribe with sub-millisecond local hops, not a round trip to a central broker. Zenoh’s peer mode handles this; its router mode handles the rest.

If you want background on the wider messaging space first, our comparison of DDS, MQTT, and OPC UA industrial messaging protocols frames where Zenoh fits. The current post assumes you have already decided Zenoh is worth piloting.

The four-layer Zenoh reference architecture

Answer-first summary: A clean Zenoh industrial IoT architecture has four layers: edge devices that publish raw data, routers that federate cells, a storage and bridging plane that persists and translates, and consumers that read for twins, analytics, and MES. Each layer scales independently and uses Zenoh’s single wire protocol.

The architecture below is what we deploy in greenfield cells and recommend for brownfield migrations. It is intentionally boring — boring scales.

Layer 1 — Edge devices

This layer is whatever produces or consumes data at the physical floor: PLCs, drives, vision cameras, ROS 2 nodes on mobile robots, OPC UA companion-spec servers, and Modbus-to-edge converters. Devices speak to Zenoh either natively (via a zenoh-pico or zenoh-c client) or through a bridge process colocated on a gateway. Native is preferred for new builds; bridging is preferred when you cannot touch firmware.

Layer 2 — Zenoh routers

Routers form the backbone. Each production cell or plant area gets a router (or a small router cluster for HA), and routers federate via the Zenoh routing protocol. Routers do not store payloads by default — they route, deduplicate, and apply access control. A router on a modest x86 box can sustain workloads well into the hundreds of thousands of messages per second, but verify against your own payload sizes and QoS profile before sizing.

Layer 3 — Storage and bridges

Storages are pluggable Zenoh components that subscribe to a keyspace selector and persist matching samples. Common backends include RocksDB, InfluxDB, TimescaleDB, S3, and Kafka. Bridges live here too: zenoh-bridge-mqtt, zenoh-bridge-dds, and zenoh-bridge-ros2dds. This is also where you stand up a Kafka bridge if your analytics platform expects a Kafka topic interface.

Layer 4 — Consumers

Digital twins, MES, historians, and ML pipelines all subscribe here. They never talk directly to edge devices. This decoupling is the architectural payoff: you can replace the historian without retraining every PLC.

Peers, routers, and when to use which

Answer-first summary: Zenoh supports three operational modes — peer, client, and router. Use peer mode inside a single cell for sub-millisecond fan-out, client mode for resource-constrained devices, and router mode at cell boundaries and for any WAN hop. Most plants run a hybrid topology where peer meshes nest under a brokered router backbone.

Peer mode

In peer mode, every Zenoh node discovers every other node on the local network via multicast (or a configured peer list) and forms a partial mesh. There is no broker in the data path. A vision PC publishing to a robot arm in the same cell goes peer-to-peer with no intermediate hop. This is what makes Zenoh attractive for low-latency loops where adding an MQTT broker would double the round-trip time.

Client mode

Client mode is for nodes that cannot or should not participate in routing — typically resource-constrained devices using zenoh-pico, or workloads where you want strict access control. Clients connect to a designated router and rely on it for discovery and forwarding.

Router mode

Routers exist for three reasons: bridging multicast domains, terminating WAN links, and applying centralized policy (authentication, ACLs, downsampling). A router cluster gives you HA — when one router pod drops, peers reconnect to a sibling without losing subscriptions.

Hybrid is the default

In production we almost always run peer meshes inside cells, with one or two routers per cell as uplinks to a plant-wide brokered cluster. This keeps tight robotics control loops fully local, while still publishing the data to the rest of the plant and to the data center. The architecture in Figure 2 reflects this. For background on why brokered backbones matter at scale, see our Apache Pulsar geo-replication telemetry analysis — the same WAN replication arguments apply.

Bridging ROS 2, MQTT, and DDS

Answer-first summary: The Zenoh ROS2 bridge, the MQTT bridge, and the DDS bridge let an existing plant expose all three ecosystems through a single Zenoh keyspace. The ros2dds bridge is the workhorse: it discovers ROS 2 topics and republishes them as Zenoh resources, enabling cross-cell robotics coordination without retuning DDS.

This is where Zenoh earns the “protocol-agnostic edge data” framing in our keyword set. A bridge is a small process — a few tens of megabytes of RAM, single CPU core — that maps between native protocols and Zenoh.

zenoh-bridge-ros2dds

ROS 2 ships with DDS as its default middleware. That works inside a single robot, but DDS discovery floods the multicast domain as you add fleet members, and DDS does not WAN well. The zenoh-bridge-ros2dds project runs alongside each ROS 2 robot and republishes its DDS topics into Zenoh. Two outcomes follow. First, multi-cell ROS 2 stops requiring shared DDS domain IDs. Second, you can subscribe to robot data from a non-ROS consumer (a twin, a dashboard) without writing ROS client code. Our ROS 2 Jazzy on Jetson Orin warehouse robotics tutorial walks through a concrete deployment using exactly this pattern.

zenoh-bridge-mqtt

The MQTT bridge maps MQTT topics to Zenoh resource keys and vice versa. It supports MQTT 3.1.1 and 5, and Sparkplug B payloads pass through as opaque bytes. Use this to retire an old MQTT broker without rewriting clients, or to expose Sparkplug payloads to the rest of your Zenoh fabric.

zenoh-bridge-dds

A general DDS bridge that does not require ROS 2. Useful when defense-style or RTI Connext systems must federate with the plant fabric. The DDS bridge also helps when a vendor ships a black-box DDS interface — say a vision system or a motion controller — and you need its data on the broader plant fabric without touching vendor code.

Bridge sizing and placement

Each bridge process is small but matters for latency. Place bridges on the same host as the protocol they translate, not at a central hub. A zenoh-bridge-mqtt running next to its MQTT broker adds microseconds; the same bridge two hops away on a shared appliance adds milliseconds and jitter. Pin bridges to specific cores on busy gateways.

Avoiding bridge sprawl

Bridges are cheap to run but expensive to operate when they proliferate. The rule we apply: one bridge per protocol per cell, not per node. If you find yourself running ten MQTT bridges, you have an MQTT problem, not a Zenoh problem — collapse the brokers first.

Designing the keyspace: ISA-95 to Zenoh resources

Answer-first summary: Zenoh resource keys are slash-delimited UTF-8 strings, so they map cleanly onto ISA-95 hierarchies. A disciplined keyspace — enterprise, site, area, line, cell, asset, signal — pays dividends for ACLs, storage selectors, and dashboard queries. Drift here is the single most common cause of failed Zenoh rollouts.

Naming convention

Use lowercase, kebab-case segments. Keep them stable over years — clients embed these strings everywhere. A pattern that has held up well in production:

{enterprise}/{site}/{area}/{line}/{cell}/{asset}/{signal}

Examples:

acme/plant-pune/area-bodyshop/line-01/cell-weld-a/robot-fanuc-01/joint-positions
acme/plant-pune/area-bodyshop/line-01/cell-weld-a/plc-siemens-01/metrics/cycle-time
acme/plant-pune/area-paintshop/line-03/oven-temp-controller/setpoint

Selectors and wildcards

Zenoh supports two wildcards: * for a single segment and ** for multiple. So a storage that mirrors all robot joint positions across a site uses:

acme/plant-pune/**/robot-*/joint-positions

This is also the unit of authorization. Granting an analytics team subscribe rights on acme/plant-pune/**/metrics/** is one ACL line.

Anti-patterns

Three keyspace mistakes show up repeatedly. One: embedding mutable IDs (serial numbers, IP addresses) in keys — they will change. Two: mixing event payloads and metric payloads under the same parent — separate them as metrics/ and events/ subkeys. Three: using the keyspace as a database, with deeply nested keys per row. Use a storage backend for that and keep the keyspace shallow.

Version the keyspace

Treat the keyspace like a public API, because that is exactly what it is. Maintain a versioned schema document — one YAML file per top-level segment is plenty — and require a pull request to add or rename anything. Breaking changes go through a deprecation window where both old and new keys publish simultaneously for at least one production cycle. Skip this discipline and you will spend the next year chasing silent consumer breakage every time a team renames a signal.

Scaling, failure modes, and capacity planning

Answer-first summary: Scale Zenoh by adding router pods to the cluster, not by upsizing a single router. Capacity planning starts from message rate and average payload size, with storage throughput as the second constraint. Failures are mostly graceful — peers reconnect, routers re-mesh — but bridge processes are single points of translation and need their own HA pair.

Sizing the routers

The honest answer is: benchmark your own payload mix. Public benchmarks from ZettaScale demonstrate that a Zenoh router can sustain very high message rates on modern x86 hardware, but headline numbers usually reflect small payloads with no storage attached. Plan for the order of tens to low hundreds of thousands of messages per second per router pod with realistic 256-byte to 4-kilobyte industrial payloads, and validate before commitment.

A capacity planning worksheet we use in design reviews has four inputs: peak message rate per device, average payload size, subscriber fan-out factor, and storage write multiplier. Multiply them, add a 40 percent headroom buffer, and that is the per-router target. If the result exceeds what one pod sustains in your own benchmark, split by keyspace prefix across two pods rather than upgrading hardware. Horizontal beats vertical for routers.

CPU is rarely the first bottleneck on modest workloads. Network buffers and storage backend throughput usually surface first. Confirm that the router host has tuned TCP buffers — the Linux defaults are conservative — and that the storage backend can keep up with the worst-case sample rate, not the average. Storage backpressure that propagates to the router is the most common production incident pattern we see.

Router clustering and HA

A router cluster is a set of peer-connected routers. Clients and peers can connect to any pod; the cluster reconciles subscriptions. When a pod restarts, subscribers reconnect within seconds. Run at least three pods per cluster for quorum behavior on policy state.

Storage replication

Storages subscribe to a key selector and persist matching samples. Run two storage replicas per critical selector and place them on separate hosts. Storage backends have their own HA story — InfluxDB Enterprise, TimescaleDB multi-node, Kafka brokers — and Zenoh does not override that.

WAN replication

For multi-site replication, route through a designated WAN-facing router that terminates a TLS-over-QUIC or TLS-over-TCP link to the regional data center. Compress where you can; QUIC connection migration helps over flaky cellular. The diagram in Figure 5 shows the canonical pattern.

Failure modes that bite

Bridge process crash — the bridge is a single translator. Run two and front them with a health-checked supervisor.
Multicast disabled — many corporate networks block multicast. Peer mode then needs an explicit peer list, which is operationally painful. Use router mode instead.
Clock skew — Zenoh samples carry timestamps. NTP skew across cells will reorder samples at the storage layer. Run PTP where determinism matters.
Key collisions — if two teams independently invent overlapping keyspaces, samples will silently merge. A registry of top-level keys is non-optional.

Trade-offs and what goes wrong

Answer-first summary: Zenoh is not a free lunch. It is younger than MQTT and OPC UA, the tool ecosystem is thinner, and bridge debugging is harder than single-protocol debugging. The wins — protocol-agnostic federation, peer mode, edge-friendly footprint — are real, but the operational maturity gap is real too.

The Zenoh vs MQTT industrial debate

MQTT, particularly with Sparkplug B, has fifteen years of plant-floor mileage. Operators know it. SCADA vendors integrate it natively. Zenoh does not yet have that gravitational pull. The honest position: if your only requirement is plant-floor telemetry into a SCADA system, MQTT is still the safer choice. Zenoh wins when you have mixed traffic — robotics, vision, telemetry, queries — and need a single fabric for it. For a deeper protocol-vs-protocol view, see our Sparkplug B vs OPC UA PubSub comparison.

Tooling gaps

Grafana plugins, dashboard widgets, and managed service offerings exist but are sparser than MQTT’s. Expect to build some operator tooling — at minimum, a keyspace browser and a sample inspector — internally for the first six months.

Skill gaps

Most automation engineers know MQTT. Few know Zenoh. Plan for training. The good news: the mental model is small. The bad news: debugging a federated routing problem at 2 a.m. requires fluency, not familiarity.

Vendor lock-in framing

Zenoh is Eclipse-licensed and the spec is open. Implementations exist in Rust (canonical), C++, and others. Lock-in risk is low in the protocol sense. But ZettaScale is the dominant maintainer; if you need a paid support contract, your options are limited compared to MQTT.

Practical recommendations for 2026 adoption

Answer-first summary: Pilot Zenoh in one cell with mixed traffic, not in the corporate IT backbone. Get the keyspace right before scaling. Run bridges in HA pairs from day one. Decide explicitly whether each cell is peer-meshed or router-fronted, and document that decision. Most failures are governance failures, not technology failures.

A working adoption sequence:

Pick one cell with at least two protocols already running (e.g., ROS 2 and MQTT) and pilot Zenoh there for 60 to 90 days.
Design the keyspace first, in a markdown doc, reviewed by OT and IT, before any code ships.
Stand up one router pod per cell, then add a second once the keyspace stabilizes.
Bridge incumbents before retiring them. Run MQTT and Zenoh side by side for a quarter.
Add storages only after the keyspace is stable. Storages backfill keys silently — moving them later is expensive.
Instrument everything — Prometheus exporters exist; emit per-router subscription counts, sample rates, and dropped-sample counters.
Train the on-call rotation before flipping production traffic. Two engineers minimum, rotation-fluent.
Negotiate support with ZettaScale or a partner if Zenoh becomes load-bearing.

A short checklist before any production cutover:

Keyspace registry committed to version control.
Router cluster with three pods.
Bridge HA pairs for every active bridge.
TLS on every WAN router link.
ACLs scoped per top-level key.
Storage backend backed up and tested for restore.
PTP or strict NTP across cells.
Runbook for “bridge process is flapping.”

FAQ

Is Zenoh production-ready for industrial IoT in 2026?

Yes for mixed-protocol edge deployments, with caveats. Eclipse Zenoh has shipped stable releases since 2022 and is in production at multiple robotics, automotive, and defense organizations. The Rust reference implementation is well exercised. The remaining gaps are operational tooling and managed-service availability, not protocol stability. Treat it as you would any infrastructure that is mature but not ubiquitous — pilot it, instrument it heavily, and have a vendor relationship.

How does Zenoh compare to MQTT for industrial workloads?

MQTT is simpler and has more SCADA integrations. Zenoh adds three things MQTT does not natively offer: queryable storage, true peer-to-peer mode without a broker, and a unified keyspace across MQTT, DDS, and ROS 2 via bridges. If your workload is pure telemetry into a single SCADA, stay with MQTT. If you mix robotics, vision, and telemetry on one fabric, Zenoh is the better fit because the alternative is running three brokers and stitching them together.

Can I run Zenoh and ROS 2 together without rewriting nodes?

Yes. The zenoh-bridge-ros2dds runs alongside an existing ROS 2 stack and republishes DDS topics as Zenoh resources transparently. Your ROS 2 nodes do not change. The bridge is the standard mechanism NVIDIA and ZettaScale promote for cross-cell and WAN ROS 2 communication, because DDS discovery does not scale across subnets cleanly. Expect a small per-topic latency overhead from the bridge — usually well under a millisecond on local hardware.

What does a minimum Zenoh deployment look like?

The smallest defensible deployment is one router pod plus one storage backend, with peers connecting in client mode. That fits on a single industrial PC and supports a small cell of 20 to 50 devices. The next tier — two router pods, two storage replicas, one MQTT bridge HA pair — gives meaningful HA and handles a multi-cell line. Greenfield plants typically start at that second tier and add routers per cell as the rollout expands.

How do I secure a Zenoh deployment?

Zenoh supports TLS for transport, TLS client certificates for authentication, and an ACL model keyed on the resource hierarchy. The right pattern is: TLS on every link that crosses a trust boundary, mutual TLS between routers and the data center, ACLs scoped per top-level keyspace segment, and a secret-management system for client certificates. Audit logs from routers should ship to your SIEM. Treat the keyspace registry as security configuration, not documentation.

Does Zenoh replace Kafka in an IIoT stack?

No, and you should not try. Kafka is durable, partitioned, replayable storage with a strong batch-analytics ecosystem. Zenoh is a low-latency edge fabric with optional storage. The right pattern is Zenoh at the edge, a Kafka bridge in the storage layer, and Kafka as the IT-side durable queue. Most plants run both, and the bridge is the integration point. Use Zenoh storages for short-horizon edge data and Kafka for long-horizon, IT-consumed streams.

References

Eclipse Zenoh — official documentation — protocol overview, deployment modes, and bridge inventory.
ZettaScale Technology — Zenoh product page — vendor-maintained background and commercial support details.
zenoh-plugin-ros2dds — GitHub repository — source, configuration, and benchmarks for the ROS 2 bridge.
ROS Discourse — Zenoh-based middleware discussions — community posts on Zenoh adoption in ROS 2 fleets.
NVIDIA Isaac ROS documentation — reference for robotics edge deployments where Zenoh is a recurring middleware choice.
Eclipse Foundation — Zenoh project page — governance, licensing, and release history.

Written by Riju. More IIoT, digital twin, and robotics deep-dives at /about.