Apache PLC4X Tutorial: Stream PLC Tags to MQTT and Kafka

Every factory hides the same tax. A Siemens line speaks S7, a legacy press speaks Modbus, a packaging cell speaks EtherNet/IP, and a new robot cell speaks OPC UA. To get one tag out of each, you pay a vendor for a proprietary OPC server, a per-tag license, and a Windows box that nobody wants to patch. This apache plc4x tutorial shows the alternative: a single open-source library that reads any of those PLCs through one API, then bridges the data to MQTT and Kafka for the rest of your stack.

The payoff is leverage. Instead of five connectors with five license models, you write one ingestion service and change a connection string per device. The protocol differences become configuration, not code rewrites.

What this covers: the OT protocol fragmentation problem, how PLC4X’s driver model works, runnable Java to read Modbus and S7 tags, a Kafka Connect source pipeline, an MQTT and Sparkplug B publish path, edge buffering, and the gotchas that sink real deployments.

Context and Background

Operational technology never agreed on a wire protocol. Modbus, born in 1979, still moves more industrial bytes than anything else because it is dead simple and royalty-free. Siemens controllers speak S7comm, a proprietary protocol reverse-engineered by the community rather than documented openly. Rockwell and Allen-Bradley gear speaks EtherNet/IP with the CIP object model. Beckhoff uses ADS. Newer assets layer OPC UA on top. The result is a shop floor where five machines need five different client stacks to read a single temperature.

OPC UA was supposed to end this, and partly it did. A modern PLC with an OPC UA server gives you a clean, secured, information-modeled endpoint. But OPC UA does not retrofit a 2008 Modbus press or a Step 7 S7-300 that will run untouched for another decade. For that installed base, you either buy a vendor gateway per protocol or you write protocol clients yourself. Both are expensive, and both lock you in.

The economics of the proprietary path are worse than they first appear. A typical commercial OPC server licenses by tag count or by connection, so the bill grows as your data ambitions grow, punishing exactly the teams trying to instrument more of the floor. Worse, each vendor’s server runs on its own host, demands its own patching, and becomes a single point of failure that the IT team did not choose and cannot easily monitor. Multiply that across four or five protocols and you have a sprawl of fragile Windows boxes whose only job is to translate one wire format into another. That sprawl is the hidden cost the open-source approach is built to eliminate.

Apache PLC4X is the third option. It is an Apache top-level project that provides a unified, driver-based API for industrial protocols. You program against one PlcConnection interface, and a driver implementation handles the wire format for Modbus, S7, EtherNet/IP, ADS, BACnet, OPC UA, and more. It ships language bindings for Java, Go, Python, and C, so the same mental model works whether you are writing a JVM microservice or an embedded edge agent.

The key distinction from an OPC UA gateway is scope. A gateway translates one protocol into another at a network appliance you buy and rack. PLC4X is a library you embed in your own application, giving you the read/write/subscribe primitives and leaving payload shaping, transport, and buffering to you. That is more work upfront but far more flexible, and it fits naturally into a streaming architecture where you want PLC data landing in MQTT and Kafka rather than in another vendor’s historian. For where this sits in a broader event backbone, see our Sparkplug B and unified namespace guide.

It helps to understand how PLC4X stays correct across so many protocols. The drivers are not hand-written byte-by-byte for each language. They are generated from a shared protocol description, so the Java, Go, Python, and C drivers behave identically because they descend from the same specification. That is why a tag you read correctly in a Java prototype decodes the same way in a Python edge agent. It also means new protocols arrive as new descriptions rather than as four parallel rewrites, which keeps the project moving and the behavior consistent.

Architecture and Setup

PLC4X sits between your PLCs and your messaging layer as an embeddable protocol-abstraction library. You declare a connection string per device, build a read or subscribe request naming the tags you want, and the matching driver speaks the native wire protocol. The library returns typed values you then publish to MQTT, Kafka, or both. The PLC never knows it is feeding a streaming pipeline.

Figure 1: The PLC4X gateway reads heterogeneous PLCs through one driver API and fans the data out to MQTT and Kafka consumers.

Figure 1 shows the shape of the whole system. On the left, four controllers each speak their native protocol: Modbus, S7, OPC UA, and EtherNet/IP. They all terminate at a single PLC4X gateway process. On the right, the gateway publishes to an MQTT broker feeding SCADA and edge apps, and to a Kafka cluster feeding stream processing, a data lake, and a time-series database. The gateway is the only component that knows protocol-specific detail; everything downstream sees uniform messages.

This separation is the architectural win. Adding a sixth PLC that speaks BACnet means adding a connection string, not rewriting the publish path. Swapping MQTT for Kafka on a given tag group is a routing decision in the gateway, invisible to the PLC. Keeping protocol concerns on one side of a hard boundary and transport concerns on the other is what makes the system maintainable as the floor grows.

The boundary also localizes failure. When a PLC drops off the network, the fault is contained inside the gateway’s connection manager, and the rest of the pipeline keeps serving the tags that are still healthy. Contrast this with point-to-point integrations, where one flaky device can stall an entire scrape. Because PLC4X manages each connection independently, you can retry, back off, and alert per device rather than per pipeline. That granularity is what lets a single gateway supervise dozens of controllers without a single bad actor taking the whole ingestion path down with it.

Think of the gateway as having three internal layers of its own. The connection layer owns sockets, retries, and driver lifecycle. The acquisition layer owns the poll schedule and the read requests. The publish layer owns serialization, transport, and buffering. Keeping these three concerns distinct inside the process mirrors the system-level separation and makes the code far easier to reason about when something misbehaves at three in the morning.

The Driver Model and Connection Strings

PLC4X resolves a driver from the scheme of your connection string. The string format is <protocol>:<transport>://<host>[:port][?params]. The protocol prefix selects the driver; the transport selects how bytes move (usually TCP); the query parameters tune driver behavior. Here are the two we will use in this tutorial:

modbus-tcp://192.168.0.10:502?unit-identifier=1
s7://192.168.0.20?remote-rack=0&remote-slot=1&controller-type=S7_1200

The Modbus string targets a device at port 502 with Modbus unit ID 1. The S7 string targets an S7-1200 by rack and slot, the two coordinates Siemens uses to address a CPU on its backplane. You almost never need to write protocol bytes yourself; you write these strings and let the driver translate. Getting the rack and slot right is the single most common S7 setup mistake, so confirm them in TIA Portal before you debug anything else.

To pull PLC4X into a Maven project, add the core API and the driver pool. The drivers themselves are discovered on the classpath at runtime, so you depend on the specific driver artifacts you need:

<dependencies>
  <dependency>
    <groupId>org.apache.plc4x</groupId>
    <artifactId>plc4j-api</artifactId>
    <version>0.12.0</version>
  </dependency>
  <dependency>
    <groupId>org.apache.plc4x</groupId>
    <artifactId>plc4j-driver-modbus</artifactId>
    <version>0.12.0</version>
    <scope>runtime</scope>
  </dependency>
  <dependency>
    <groupId>org.apache.plc4x</groupId>
    <artifactId>plc4j-driver-s7</artifactId>
    <version>0.12.0</version>
    <scope>runtime</scope>
  </dependency>
</dependencies>

The plc4j-api artifact gives you the PlcConnection, request builders, and response types. Each driver is a separate runtime dependency so your jar carries only the protocols you actually use. That keeps the footprint small on an edge device, which matters when you are running on a DIN-rail computer with limited flash.

The Read API

Reading is the workhorse operation. You open a connection through the PlcDriverManager, build a read request naming each tag with a driver-specific address, execute it, and pull typed values from the response. Here is a complete, runnable read against a Modbus device:

import org.apache.plc4x.java.DefaultPlcDriverManager;
import org.apache.plc4x.java.api.PlcConnection;
import org.apache.plc4x.java.api.messages.PlcReadRequest;
import org.apache.plc4x.java.api.messages.PlcReadResponse;

public class ModbusReader {
    public static void main(String[] args) throws Exception {
        String url = "modbus-tcp://192.168.0.10:502?unit-identifier=1";
        try (PlcConnection conn = new DefaultPlcDriverManager()
                .getConnectionManager().getConnection(url)) {

            if (!conn.getMetadata().isReadSupported()) {
                throw new IllegalStateException("Driver cannot read");
            }

            PlcReadRequest.Builder builder = conn.readRequestBuilder();
            builder.addTagAddress("motor-temp", "holding-register:100:REAL");
            builder.addTagAddress("run-state", "coil:1:BOOL");
            PlcReadRequest request = builder.build();

            PlcReadResponse response = request.execute().get();
            float temp = response.getFloat("motor-temp");
            boolean running = response.getBoolean("run-state");
            System.out.printf("temp=%.1f running=%b%n", temp, running);
        }
    }
}

Read the tag addresses carefully, because that string is where the protocol detail lives. holding-register:100:REAL says: read a Modbus holding register starting at address 100 and decode it as a 32-bit IEEE float, which spans two consecutive registers. coil:1:BOOL reads a single coil as a boolean. The logical names motor-temp and run-state are yours; they become the keys you publish downstream, decoupling your message schema from raw register numbers.

The S7 version is identical except for the connection string and the address syntax. S7 tags reference data blocks and offsets, like %DB1:4:REAL for a float at byte 4 of data block 1, or %I0.0:BOOL for the first input bit. Because the API is the same, a service that already reads Modbus needs only a new connection and new address strings to also read an S7 PLC. That uniformity is the entire reason to use PLC4X over hand-rolled clients.

A few API details matter in production. The request.execute() call returns a CompletableFuture, so you can fire many reads concurrently and compose them without blocking a thread per PLC. The response carries a per-tag response code, not just a value, so you should check response.getResponseCode("motor-temp") and handle a partial failure where one tag in a batch resolves and another does not. Treating every read as all-or-nothing is a common beginner mistake that hides real signal. A robust loop logs the bad tag, publishes the good ones, and keeps going.

Connections are also a resource you must manage. Opening a fresh PlcConnection per read works in a demo but exhausts the PLC’s limited connection slots under load, since many controllers cap concurrent clients at a handful. In a long-running service, open the connection once, keep it alive, and reuse it across poll cycles, reconnecting only on failure. The try-with-resources block above is perfect for a one-shot script and wrong for a daemon. Caching connections per device and guarding them behind a small connection manager is the pattern that survives a real shift.

Subscriptions Versus Polling

PLC4X exposes three verbs: read, write, and subscribe. Not every driver supports all three. Modbus has no native server-push, so its driver emulates subscription by polling under the hood; you can request a subscription, but the wire traffic is still periodic reads. S7 and OPC UA support genuine event-driven subscriptions where the PLC pushes on change. Always check conn.getMetadata().isSubscribeSupported() before assuming you will get push semantics, and design your polling interval as if you are polling, because for Modbus you are.

Building the PLC-to-MQTT-and-Kafka Pipeline

There are two clean ways to get PLC4X data onto a broker. The first is the official Kafka Connect source connector, which is configuration-only and ideal when Kafka is your backbone. The second is an embedded service that reads with the PLC4X API and publishes to MQTT directly, ideal when MQTT and a unified namespace are your backbone. Most real plants run both, with Kafka for analytics and MQTT for the operational namespace.

Figure 2: A PLC4X source connector lands tags in a Kafka topic, then a Sparkplug encoder bridges to MQTT while sink connectors fan out to storage.

Figure 2 traces the Kafka-first path. The PLC4X source connector polls tags and hands records to a Kafka Connect worker, which applies a converter and any single-message transforms before writing to the plc-data topic. From there a Sparkplug encoder bridges selected tags to an MQTT sink for the operational namespace, the Schema Registry governs record structure, and sink connectors fan the same topic out to TimescaleDB and an S3 data lake. One ingestion path, many consumers.

The Kafka Connect Source Connector

PLC4X ships a Kafka Connect source connector under the “PLC4X Connectors” umbrella. You run it inside a standard Kafka Connect worker and configure it entirely in JSON. The connector polls the named tags on a schedule and emits one Kafka record per scrape. Here is a working configuration that reads two tags from an S7 PLC into a topic:

{
  "name": "plc4x-s7-source",
  "config": {
    "connector.class": "org.apache.plc4x.kafka.Plc4xSourceConnector",
    "tasks.max": "1",
    "default-topic": "plc-data",
    "sources": "press01",
    "sources.press01.connectionString": "s7://192.168.0.20?remote-rack=0&remote-slot=1",
    "sources.press01.pollReturnInterval": "1000",
    "sources.press01.bufferSize": "1000",
    "sources.press01.jobReferences": "tempJob",
    "jobs": "tempJob",
    "jobs.tempJob.interval": "500",
    "jobs.tempJob.fields": "motorTemp,runState",
    "jobs.tempJob.fields.motorTemp": "%DB1:4:REAL",
    "jobs.tempJob.fields.runState": "%I0.0:BOOL",
    "key.converter": "org.apache.kafka.connect.storage.StringConverter",
    "value.converter": "org.apache.kafka.connect.json.JsonConverter"
  }
}

The mental model is sources and jobs. A source is a PLC connection. A job is a named scrape with its own interval and field list. Here the tempJob runs every 500 milliseconds, reads motorTemp from %DB1:4:REAL and runState from %I0.0:BOOL, and tags each record onto the press01 source. The connector batches up to bufferSize records and flushes them on pollReturnInterval. You POST this JSON to the Connect REST API and the worker starts polling; no code, no redeploy.

The separation of job interval from poll-return interval is deliberate and worth dwelling on. The job interval governs how often the PLC is actually scraped, which is a load decision against the controller. The poll-return interval governs how often the connector hands accumulated records back to Kafka, which is a throughput decision against the broker. Decoupling them lets you scrape a fast signal every 500 milliseconds while still flushing to Kafka in efficient batches, rather than forcing one slow knob to compromise both. Tune the job interval to the physics of the signal and the poll-return interval to your Kafka write efficiency.

Reusing one Connect worker for many PLCs is the deployment most teams land on. You add a new source block per controller and a new job block per scrape, all inside the same connector configuration, and Kafka Connect distributes the work across tasks. This gives you centralized monitoring, restart, and offset management for the entire fleet through the standard Connect tooling you already operate. When a connector fails, you see it in the same dashboard as the rest of your data integration, which is a real operational advantage over a bespoke gateway nobody else knows how to debug.

Topic mapping is worth thinking about early. The simple approach above dumps everything into one plc-data topic, which is fine for a handful of tags. At scale, route by area or asset so consumers can subscribe narrowly and you can set retention per topic. A single-message transform can rewrite the topic from a record field, so you can fan one source into area1.press01 and area1.press02 without separate connectors.

Schema governance is the part teams underestimate. Once dozens of consumers depend on the shape of a record, a careless field rename breaks dashboards and stream jobs silently downstream. Putting the value converter behind a Schema Registry with backward-compatibility enforcement turns those breaking changes into rejected deploys rather than 2 a.m. pages. It costs a little ceremony upfront and saves you the class of outage where nobody can explain why a chart went blank. For PLC data feeding analytics that other teams build on, that contract is worth the extra moving part.

Shaping the Payload and the MQTT Path

JSON out of the box is convenient but verbose and schema-loose. For an operational namespace, Sparkplug B is the better target. Sparkplug is an MQTT payload and topic specification, defined by the Eclipse Sparkplug specification, that adds birth and death certificates, sequence numbers, and a compact Protobuf payload. It turns a bag of raw tags into a self-describing, stateful namespace that SCADA tools understand natively.

The bridge pattern is straightforward. A small consumer reads the plc-data Kafka topic, maps each tag into a Sparkplug metric, and publishes under the spBv1.0/<group>/DDATA/<node>/<device> topic structure to your MQTT broker. The same metrics can flow to an EMQX or HiveMQ cluster for the namespace while the raw Kafka topic continues to feed analytics. For running that broker tier in production, see our EMQX MQTT cluster on Kubernetes tutorial, and for modeling the namespace itself, the OPC UA PubSub over MQTT 5 tutorial covers the information-model side.

Figure 3: The request and response sequence from application code through the PLC4X driver to the PLC and onward to the MQTT broker.

Figure 3 shows one poll cycle end to end. The application builds a read request and the PLC4X driver serializes it into a Modbus or S7 frame on the wire. The PLC answers with raw tag values, the driver decodes them into a PlcReadResponse, and the app shapes the payload into a Sparkplug or JSON message. It then publishes to the MQTT broker, waits for the QoS 1 PUBACK, and loops into the next poll. The acknowledgement is what gives you at-least-once delivery rather than fire-and-forget.

Edge buffering belongs in this path, not as an afterthought. When the broker or the WAN link is down, the gateway must keep reading the PLC and persist messages locally, then replay them on reconnect. A store-and-forward buffer on local disk, drained in order once the link returns, is the difference between a five-minute network blip and a five-minute hole in your historian. Size the buffer for your worst realistic outage, not the average one.

Sparkplug B earns its complexity precisely here. Its birth and death certificates let the broker and every subscriber know whether a node is alive, so a consumer can distinguish “the value has not changed” from “the gateway is dead.” Its sequence numbers let a subscriber detect a missed message and request a rebirth, which matters when you replay a buffer out of a long outage. Plain JSON over MQTT gives you none of that state awareness; you publish a value and hope someone is listening. For an operational namespace that drives screens operators trust, the stateful semantics are not optional polish, they are the contract.

Trade-offs, Gotchas, and What Goes Wrong

Polling versus subscription is the first decision that bites. Modbus only polls, so a 100-millisecond interval across hundreds of tags can saturate a slow serial-to-TCP converter or trip a PLC’s connection limit. Subscription-capable protocols like S7 and OPC UA push on change and scale far better for sparse, bursty signals. Match your strategy to the protocol, and never assume a tight poll loop is free.

Byte order and data types cause more silent corruption than any other class of bug. A Modbus 32-bit float spans two registers, and vendors disagree on whether the high word comes first. If your float reads as a wild number, you almost certainly have a word-swap or endianness mismatch, not a wiring fault. Always verify a known value against the PLC’s own display before trusting a decoded tag, and pin the data type explicitly in the address string.

S7 optimized data block access is a classic trap. Modern TIA Portal projects mark data blocks “optimized” by default, which hides the byte offsets PLC4X needs to address a tag. You must uncheck optimized access on any DB you intend to read by absolute offset, or the %DBx:y address simply will not resolve. This single setting derails more S7 integrations than the network ever does.

Figure 4: Edge deployment with local disk buffering and store-and-forward across the OT-to-IT boundary.

Figure 4 shows where the gateway should physically live. PLC4X runs on an edge gateway on the OT VLAN, close to the controllers, with a local disk buffer and a store-and-forward path. It crosses to IT through a firewalled DMZ to the central MQTT broker and Kafka cluster. Keeping the gateway on the OT side and pushing outbound through a controlled boundary respects network segmentation, the bedrock of OT security. PLCs should never be directly reachable from the IT network; the gateway is the only thing that talks to them.

Throughput and back-pressure round out the list. If your publish path stalls and your read loop does not, you build an unbounded in-memory queue that eventually kills the process. Bound every buffer, apply back-pressure to the poll loop when the broker is slow, and prefer dropping the

Apache PLC4X Tutorial: Stream PLC Tags to MQTT and Kafka

Apache PLC4X Tutorial: Stream PLC Tags to MQTT and Kafka

Context and Background

Architecture and Setup

The Driver Model and Connection Strings

The Read API

Subscriptions Versus Polling

Building the PLC-to-MQTT-and-Kafka Pipeline

The Kafka Connect Source Connector

Shaping the Payload and the MQTT Path

Trade-offs, Gotchas, and What Goes Wrong

Related

Comments

Leave a Reply Cancel reply

Tag Cloud

Categories