Production MQTT Cluster with EMQX on Kubernetes: Complete Tutorial 2026
MQTT is the dominant publish-subscribe protocol in industrial IoT, but running a production cluster requires handling node discovery, persistence, TLS termination, and failover. EMQX 5, built on Erlang/OTP and the new Mria distributed database, has become the de facto standard for enterprise deployments, trusted by companies operating 10M+ device clusters globally. This tutorial walks you through building a battle-tested EMQX cluster on Kubernetes using the official Helm chart, complete with StatefulSet persistence, cert-manager integration for mutual TLS, Prometheus scraping, and load-testing scripts to validate throughput under realistic workloads. You will learn to deploy a production-grade cluster in under 30 minutes, then validate it with chaos tests. By the end, you’ll understand EMQX’s core-replicant architecture, why it scales horizontally, how to right-size storage and CPU, and how to troubleshoot clustering issues when things go wrong.
Why EMQX on Kubernetes in 2026
A production MQTT cluster must deliver sub-second latency, handle thousands of concurrent connections, and survive node failures without losing messages. Kubernetes gives you declarative infrastructure, automatic scheduling, and native health checks—but MQTT brokers are stateful, which means you cannot treat them like stateless microservices. EMQX solves this with Erlang’s distribution protocol, which allows nodes to form a mesh automatically, and Mria’s write-once-read-everywhere architecture for replicated state. EMQX 5 (released late 2024) introduced a cleaner separation between core nodes (which hold the full state) and replicant nodes (which are read-only replicas), making horizontal scaling both possible and predictable. Unlike Kafka or Redis, EMQX is purpose-built for MQTT: it speaks the OASIS MQTT 5.0 standard, understands QoS semantics, and provides topic-level access control. Running it on Kubernetes means you inherit cluster autoscaling, observability hooks (Prometheus), and declarative GitOps workflows—critical for industrial settings where every downtime costs money.
EMQX 5 Architecture: Core, Replicant, and the Mria Consensus Layer
EMQX 5 fundamentally differs from earlier versions. Instead of a symmetric cluster where all nodes are peers, EMQX 5 introduces a structured topology: one or more core nodes maintain the full database of subscriptions, retained messages, and authentication rules; replicant nodes cache a read-only copy and forward writes to the core. This design lets replicants scale freely without burdening the consensus protocol. The separation improves throughput by 3-5x compared to EMQX 4’s symmetric approach, because replicants do not participate in consensus voting, which was the bottleneck in earlier versions.

At the heart is Mria, a geographically-distributed database built on RocksDB (the same engine behind Facebook’s internal KV store) and Raft consensus for core nodes. When you publish a message to a replicant, it sends the write to a core node, which coordinates with other cores via Raft, then broadcasts the result back. Subscription changes (e.g., a client connects and subscribes to sensor/+/temperature) follow the same path: one core becomes the authoritative source, other cores replicate asynchronously, and replicants cache the latest view. QoS levels (0=fire-and-forget, 1=at-least-once, 2=exactly-once) are enforced at the core level, ensuring correctness even if a replicant crashes mid-flight. Unlike eventual consistency systems (e.g., DynamoDB), Mria offers strong consistency within a core cluster and monotonic read consistency for replicants, which is critical for financial or safety-critical IoT applications.
The Erlang/OTP runtime underneath provides distribution via the EPMD (Erlang Port Mapper Daemon) service: each EMQX node registers its network address on startup, and peers discover each other by hostname. In Kubernetes, this is why you must use a Headless Service—to ensure each pod gets a stable network identity (e.g., emqx-0.emqx-headless.default.svc.cluster.local). The Helm chart handles this automatically, but understanding the mechanism helps when debugging. EPMD uses TCP port 4369 and the distribution port range 6370-6379, so your Kubernetes NetworkPolicy must allow these ports between pods or clustering will silently fail.
Core Node Behavior and Consensus
A core node participates in Raft voting. If you have 3 cores, Raft requires 2 to agree on any state change (quorum = floor(n/2) + 1). This means a 3-core cluster can tolerate one core failure; a 5-core cluster tolerates two. Do not use an even number of cores—a 4-core cluster losing one node leaves a tie, and the cluster becomes read-only. In Kubernetes, you specify the number of cores in the Helm values: typically 3 for small production clusters, 5 for large ones handling millions of connections. Upgrading from 3 to 5 cores requires a rolling join operation and takes 5-10 minutes; plan this as a scheduled maintenance window.
Each core writes changes to its local RocksDB, and the Raft leader replicates to followers. Replicants pull the latest snapshot asynchronously, so they may lag by a few milliseconds. For most use cases (home automation, sensor networks, even manufacturing IoT), this is acceptable. If you need strict consistency (e.g., financial trading), MQTT’s QoS 2 still guarantees message ordering within a connection, so replicant stale reads do not break the protocol. The Mria write latency (Raft commit time) averages 5-15ms in a healthy 3-node cluster; above 100ms usually signals network congestion or disk I/O bottlenecks.
Replicant Nodes and Load Distribution
Replicant nodes are read-only copies. All MQTT subscriptions on a replicant are registered on a core (via a special internal message), and all publishes are immediately forwarded to cores. Replicants cache the subscription trie in memory to optimize routing, so they consume RAM proportional to the number of active topics and subscriptions. This is why you can scale replicants horizontally: add a new replicant Pod, it connects to a core, downloads the snapshot, and starts serving clients. No re-balancing needed. The snapshot download is the bottleneck—with a 2GB snapshot over 10Gbps network, expect 2-5 seconds; over a slow WAN, potentially minutes.
In a Kubernetes cluster with 3 core nodes and 5 replicant nodes, you scale replicant count with kubectl scale statefulset emqx-replicant --replicas=10. Cores remain constant (or grow very slowly, perhaps 3 → 5 when you exceed 2 million concurrent connections). This two-tier scaling is why EMQX beats Kafka for IoT: Kafka forces you to rebalance partitions across all brokers when you add capacity, which is painful. EMQX replicants are truly stateless.
Replicants have a subtle behavior: they forward all writes to cores, but they handle reads locally. If a replicant’s cache is out of sync with the core (e.g., during a snapshot download), clients connected to that replicant may see stale subscription state. For example, a client subscribes to sensor/+/data on replicant A, but the subscription hasn’t been replicated to replicant B’s cache yet. If that client disconnects and reconnects to replicant B, it will not see the subscription and must re-subscribe. This is acceptable in most IoT scenarios because subscriptions are issued once at startup, not continuously. If you need strict consistency (every replicant sees every subscription instantly), use only core nodes (set replicas=0 for replicants), but this limits scalability.
Deploying EMQX on Kubernetes with Helm
The official EMQX Helm chart (emqx/emqx from https://charts.emqx.io) abstracts the clustering complexity. A production deployment requires three decisions: StatefulSet replicas (number of core nodes), persistent storage (RocksDB snapshots), and network policy (TLS, RBAC).

Helm Chart Basics and Values
Start with the official Helm repository:
helm repo add emqx https://charts.emqx.io
helm repo update
Create a values.yaml to override defaults. The Helm chart (maintained by the EMQX team on https://github.com/emqx/emqx-helm) is battle-tested in production and handles all the Kubernetes-specific complexity (StatefulSet configuration, readiness probes, service discovery):
replicaCount: 3 # Core nodes; do NOT use an even number
image:
repository: emqx/emqx
tag: "5.4.0" # Pin to a specific release, not 'latest'
persistence:
enabled: true
size: 50Gi
storageClassName: "standard-rwo" # Or gp3 (AWS), premium-ssd (Azure), pd-ssd (GCP)
emqx:
cluster:
distribution:
proto: tcp
port: 4369 # EPMD port; do not change
# MQTT TCP listener (unencrypted, use only within VPC)
listeners:
mqtt:
enabled: true
port: 1883
# MQTT over TLS (use for external clients)
mqtts:
enabled: true
port: 8883
# MQTT over WebSocket (for web dashboards)
ws:
enabled: true
port: 8083
wss:
enabled: true
port: 8084
# Persistent volume for RocksDB
volumeClaimTemplate:
resources:
requests:
storage: 50Gi
storageClassName: "standard-rwo"
# Resource requests and limits (adjust per workload)
resources:
limits:
cpu: 2
memory: 2Gi
requests:
cpu: 1000m # 1 full CPU
memory: 1Gi
# Readiness and liveness probes (Helm defaults usually OK)
livenessProbe:
initialDelaySeconds: 30
periodSeconds: 10
Deploy:
helm install emqx emqx/emqx -f values.yaml -n emqx-system --create-namespace
This creates a StatefulSet with 3 pods named emqx-0, emqx-1, emqx-2, each with a 50Gi PVC. EMQX automatically forms a cluster via the Headless Service; no manual join commands needed. Verify with:
kubectl get pods -n emqx-system
# Expected: emqx-0, emqx-1, emqx-2 all Running
kubectl logs -f emqx-0 -n emqx-system
# Look for "Cluster with node emqx@emqx-0 started successfully"
Wait 2-3 minutes for all pods to reach Running and Ready (check kubectl get pods). Once ready, test connectivity:
# Port-forward to access EMQX locally
kubectl port-forward -n emqx-system svc/emqx 1883:1883 &
# In another terminal, subscribe to a test topic
mosquitto_sub -h localhost -p 1883 -t "test/topic"
# In a third terminal, publish a message
mosquitto_pub -h localhost -p 1883 -t "test/topic" -m "Hello, EMQX!"
# You should see "Hello, EMQX!" in the subscriber terminal
Congratulations! Your EMQX cluster is live. To verify clustering:
# Connect to a pod and check cluster status
kubectl exec -it emqx-0 -n emqx-system -- bash
$ emqx ctl status
Node 'emqx@emqx-0' is started
running_apps [kernel,stdlib,sasl,ssl,ssl_certfile,xmem,bear,config,lager,gproc,emqx_autocluster,emqx_connector,emqx_retainer,emqx_rule_engine,emqx]
$ emqx_ctl cluster status
Cluster status: [{running_nodes,['emqx@emqx-0','emqx@emqx-1','emqx@emqx-2']}]
If all 3 nodes appear in running_nodes, clustering is successful. If you see only 1 or 2 nodes, check the logs of missing nodes for DNS or EPMD errors.
StatefulSet and Headless Service: Why They Matter
The Helm chart creates a Headless Service (clusterIP = None) named emqx-headless. Each pod registers a DNS name: emqx-0.emqx-headless.emqx-system.svc.cluster.local. This is crucial. When EMQX starts, it runs inet:gethostbyname(hostname) to resolve its own IP, and the Erlang distribution layer uses these stable addresses to form the cluster. If you used a regular Service with a cluster IP, all pods would resolve to the same IP, and Erlang’s node discovery would fail. The DNS name must be stable and resolvable from every pod; if your Kubernetes DNS (CoreDNS by default) is misconfigured, clustering will fail silently.
The StatefulSet guarantees ordered, stable identity—critical for Raft. Pod 0 always becomes the Raft leader initially, making it easier to reason about state. When Pod 1 comes up, it connects to Pod 0, downloads the Mria snapshot, and joins as a follower. If you delete Pod 1, Kubernetes restarts it with the same name and identity, and it re-joins without data loss (its RocksDB volume is re-attached). This is why StatefulSets are essential for MQTT brokers: a Deployment (which assigns random names to pods) would break clustering on every restart.
Key StatefulSet mechanics:
– Ordinal ordering: Pods are created/deleted in order (0, 1, 2). Pod 0 must be healthy before Pod 1 starts, so the cluster leader is always available first.
– Stable storage: Each pod’s PVC persists even if the pod is deleted. When Pod 1 is restarted, it mounts the same 50Gi PVC and RocksDB data remains.
– DNS stability: Pod names are deterministic; emqx-0 always resolves to the IP of the first pod in the ordered set.
Persistence: RocksDB Snapshots and WAL
EMQX persists state to RocksDB, which sits on the PVC. RocksDB uses a write-ahead log (WAL) and periodic snapshots. If a pod crashes, Kubernetes restarts it on the same or a different node, re-attaches the PVC, and EMQX replays the WAL. This is safer than pure in-memory storage (which loses all state on crash), but slower than a dedicated database like PostgreSQL (RocksDB is embedded, so no network round-trips, but disk I/O is the bottleneck).
For typical IoT workloads (1K-10K subscriptions per node), a 50Gi PVC is overkill. Sizing: each subscription in Mria takes ~200 bytes, so 10K subscriptions = 2MB. But RocksDB includes compaction overhead, write-ahead logs, and snapshot copies. The Mria database also stores authentication rules, plugin state, and retained messages, which can grow unexpectedly. Use:
- Small cluster (< 100K subscriptions): 20Gi per core
- Medium (100K – 1M): 50Gi per core
- Large (> 1M): 100Gi+ per core, plus enable RocksDB compression in EMQX config
Enable compression by adding to emqx.emqx.db:
emqx:
emqx:
db:
backend: rocksdb
dir: /var/lib/emqx/data/mnesia
compression: true
Compression reduces disk usage by 50-70% but adds CPU overhead (5-15% CPU increase). For I/O-bound clusters (e.g., high publish rate), enable compression. For CPU-bound clusters (e.g., high subscription churn), disable it. Also configure RocksDB level-based compaction to smooth writes:
emqx:
emqx:
db:
compaction: level
# Disable auto-compaction to batch it during off-peak
auto_compaction: false
Monitor PVC usage with kubectl top pvc and set alerts for > 70% full. When approaching capacity, scale cores (which distributes subscriptions across more RocksDB instances) rather than expanding PVCs (which is harder to shrink later). Keep snapshots at 30-50% of PVC size for operational headroom.
MQTT TLS and Cert-Manager Integration
Production deployments must encrypt all traffic. EMQX supports TLS on ports 8883 (MQTT over TLS) and 8084 (WebSocket over TLS). You have two options: self-signed certificates (development only) or certificates from a CA (production). TLS also enables mutual TLS (mTLS) authentication, where clients prove their identity with a certificate instead of a username/password—critical for device-to-device communication where leaked credentials are catastrophic.
The easiest path is cert-manager on Kubernetes, which automatically provisions and renews certificates. Install cert-manager first:
helm repo add jetstack https://charts.jetstack.io
helm repo update
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager --create-namespace \
--set installCRDs=true
Create a self-signed ClusterIssuer for development (or use Let’s Encrypt for production):
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: selfsigned-issuer
spec:
selfSigned: {}
---
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: emqx-cert
namespace: emqx-system
spec:
secretName: emqx-tls
issuerRef:
name: selfsigned-issuer
kind: ClusterIssuer
commonName: emqx.emqx-system.svc.cluster.local
dnsNames:
- emqx.emqx-system.svc.cluster.local
- emqx-0.emqx-headless.emqx-system.svc.cluster.local
- emqx-1.emqx-headless.emqx-system.svc.cluster.local
- emqx-2.emqx-headless.emqx-system.svc.cluster.local
- "*.emqx-headless.emqx-system.svc.cluster.local" # Wildcard for future pods
duration: 2160h # 90 days
renewBefore: 720h # 30 days before expiry
This creates a TLS certificate valid for the EMQX pods’ hostnames. Cert-manager stores the cert and key in the secret emqx-tls in the emqx-system namespace. The renewBefore setting triggers renewal 30 days before expiry, so there’s always a valid cert in-memory.
Update the Helm values to use the cert:
emqx:
listeners:
mqtts:
enabled: true
port: 8883
ssl:
enabled: true
certfile: "/etc/emqx/certs/tls.crt"
keyfile: "/etc/emqx/certs/tls.key"
# Optional: require client certs for mTLS
# cacertfile: "/etc/emqx/certs/ca.crt"
# verify: verify_peer
# fail_if_no_peer_cert: true
# Mount the cert secret
securityContext:
runAsNonRoot: false
extraVolumes:
- name: tls-secret
secret:
secretName: emqx-tls
extraVolumeMounts:
- name: tls-secret
mountPath: /etc/emqx/certs/
readOnly: true
Now clients can connect with mosquitto_sub -h emqx.example.com -p 8883 --cafile ca.crt. For production, obtain a certificate from Let’s Encrypt via cert-manager’s DNS-01 challenge (preferred because it doesn’t require an ingress, and it works for both MQTT and MQTT over TLS). For mTLS, generate client certificates and distribute them securely to edge devices—a common pattern in industrial settings using PKIX infrastructure or Kubernetes secrets.
MQTT 5.0 Features: Topic Aliases, Response Topics, and Shared Subscriptions
MQTT 5.0 (standardized by OASIS in 2019) introduced features that EMQX natively supports. Three are crucial for scalability and efficient resource usage:
Topic Aliases reduce message size and bandwidth. Instead of sending the full topic string in every publish, a client and broker agree on a 2-byte integer alias (0-65535). For a sensor sending sensor/building-42/floor-3/zone-a/temperature every second, this saves ~50 bytes per message (the topic string itself). At 10K sensors publishing 1Hz each, that’s 500KB/s less bandwidth. EMQX stores aliases in a hash table, so lookup is O(1). Enable in EMQX via the MQTT listener config:
emqx:
listeners:
mqtt:
mqtt:
max_topic_alias: 65535
Set max_topic_alias to 65535 for IoT deployments; values below 100 limit efficiency. Clients (e.g., mosquitto) negotiate the alias count in CONNECT, so ensure both broker and client support it.
Response Topics and Correlation Data let publishers request replies without out-of-band coordination. A sensor publishes to sensor/reading with response_topic=control/sensor-42/response and a correlation ID in the MQTT v5 properties. The broker delivers this metadata to subscribers, allowing them to reply asynchronously to control/sensor-42/response. This is critical for command-and-control patterns in industrial IoT: no need for fixed reply queues, topic naming conventions, or coordinator state. The response topic is simply a client-generated string, so millions of sensors can each have unique response addresses without server state.
Shared Subscriptions allow multiple subscribers to compete for messages on a single queue. Instead of all subscribers receiving a copy of each message (fan-out), the broker load-balances (round-robin, or random, or weighted) among them. Subscribe to $share/group-1/sensor/+/data, and EMQX distributes each incoming message to one member of group-1. If a client in the group disconnects, EMQX re-balances future messages to the remaining members. This is essential for horizontal scaling: add a new worker, and it automatically joins the competition without coordinator overhead (compare to Kafka consumer groups, which require explicit coordination via Zookeeper/KRaft).
Replicant nodes efficiently handle MQTT 5 features because the subscription routing (which includes shared group assignment) is centralized on cores, and replicants just cache the trie. This is why replicants scale: they do not coordinate among themselves. When you add a replicant, it downloads the snapshot of shared subscription group state and immediately begins routing. Latency for add/remove is O(snapshot download), not O(nodes), so you can scale replicants 10x faster than Kafka consumers.
Monitoring: Prometheus Scraping and Grafana Dashboards
Production clusters need observability. EMQX exposes Prometheus metrics on /api/v5/prometheus/stats (authenticated endpoint) and /metrics (plain HTTP, default off). Enable the metrics listener in Helm values:
emqx:
listeners:
metrics:
enabled: true
port: 18083
Create a ServiceMonitor for Prometheus Operator (if running):
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: emqx-monitor
namespace: emqx-system
spec:
selector:
matchLabels:
app.kubernetes.io/name: emqx
endpoints:
- port: metrics
interval: 30s
path: /metrics
Key metrics to watch:
emqx_connections_count: Total active MQTT connections (should grow/shrink smoothly); a delta < -5000 in 30s suggests a replicant crashemqx_messages_publish_dropped_total: Lost publishes (high count = topic alias mismatch or QoS 0 overflow); should be near-zero in healthy clustersemqx_messages_qos2_received_total/emqx_messages_qos2_sent_total: Exactly-once delivery rate; track rate of increase to verify message flowemqx_mria_lag: Replicant sync lag behind core (should be < 100ms); above 5s indicates snapshot download bottleneck or slow RocksDB diskemqx_cluster_nodes_running: Number of healthy cluster nodes (should match expected count); falling below quorum (e.g., < 2 of 3 cores) switches cluster to read-onlyemqx_recv_pkt_connect,emqx_send_pkt_connack: Client connection rate; anomalies here signal auth failures or network instability
Build a Grafana dashboard with these time-series panels, and alert if emqx_cluster_nodes_running < floor(core_count/2) + 1 (quorum threshold). Set a second alert for emqx_mria_lag > 1s to catch replication issues early.
Horizontal Scaling and Load Testing
Scaling EMQX on Kubernetes is simple but requires care. You have two paths:
Scaling Replicant Nodes (recommended for most workloads):
kubectl scale statefulset emqx-replicant --replicas=10 -n emqx-system
Each new replicant fetches the snapshot from the nearest core and serves clients. No subscription re-balancing needed. Clients connect to the Service (not individual pods), so Kubernetes round-robins them automatically. Expect each replicant to consume ~500MB-1GB RAM for subscription caches with 1M active subscriptions; monitor memory usage during scaling.
Scaling Core Nodes (only when cores are saturated):
kubectl scale statefulset emqx-core --replicas=5 -n emqx-system
This triggers a Raft reconfiguration, which is slower (5-15 minutes, not seconds). Do this sparingly—cores should remain stable, and you should grow replicants first. During reconfiguration, the cluster remains writable (Raft is a consensus protocol), but latency spikes. Always scale cores during a maintenance window, not during peak traf
