Apache Pulsar Geo-Replication Architecture for Multi-Region (2026)

Apache Pulsar geo-replication is the most operationally honest answer to active-active messaging across regions in 2026, and it earns that title precisely because the storage layer (BookKeeper) is decoupled from the broker that performs replication. That decoupling means you can scale a replicator without rebalancing partitions, fence a region without losing entries, and bolt on replicated subscriptions without an external Connect cluster. Kafka MirrorMaker 2 still works, Confluent Cluster Linking still ships, but Pulsar 4.0 (released October 2024 and the long-term-support line through 2026) makes a different set of trade-offs that suit IoT, digital-twin, and order-management workloads where producers can write to any region.

Architecture at a glance

Apache Pulsar Geo-Replication Architecture for Multi-Region (2026) — architecture diagram — Architecture diagram — Apache Pulsar Geo-Replication Architecture for Multi-Region (2026)

This post walks the Pulsar 4.0 architecture refresher, native cross-cluster replication mechanics, topology patterns (mesh, hub-spoke, active-passive), replicated subscriptions and their failure modes, the BookKeeper E:Qw:Qa math that decides whether your p99 stays under 50 ms, real pulsar-admin configuration, and an honest comparison with Kafka MirrorMaker 2 and Confluent Cluster Linking. Five original diagrams ground the discussion in concrete topologies.

Context: why messaging crosses regions in 2026

Cross-region messaging today is driven by three forces: regulatory data residency (EU customer events stay in EU bookies), latency-sensitive consumers (a US dashboard cannot read from Frankfurt at 110 ms RTT), and disaster recovery against a full-region outage. The incumbents are Apache Kafka with MirrorMaker 2, Confluent Cluster Linking, AWS MSK Replicator, and Apache Pulsar’s native geo-replication.

Pulsar’s design intent, traceable to its origin at Yahoo, was a single logical service spanning datacenters. Kafka’s design intent was a single tightly-coupled cluster, with replication added later as an external job. That history shows up in 2026 operations: Pulsar runs the replicator inside the broker JVM as a managed consumer-producer pair, while MirrorMaker 2 is a separate Kafka Connect cluster you size, monitor, patch, and pay for. The architectural difference is not academic; it changes the operational headcount per region.

Pulsar 4.0 brought two relevant upgrades. First, partial support for Oxia as a metadata store, removing the ZooKeeper bottleneck for namespace and topic metadata. Second, hardened replicated subscriptions with global cursor markers that survive broker restarts in either cluster. Both are documented in the Apache Pulsar 4.0 geo-replication guide.

The 2026 baseline for any serious multi-region messaging conversation includes a fourth force most teams underweight: blast radius. A single-cluster outage in 2024 took down a major payments processor for nine hours; their post-mortem cited the lack of independent regional clusters as the proximate cause. Geo-replication is not just about latency and compliance, it is about ensuring a region failure remains a region failure rather than a global one. Pulsar’s stateless brokers and per-cluster BookKeeper isolation make that blast-radius story easier to defend in an architectural review than Kafka’s intertwined broker-storage design, where a runaway compaction or a leader election storm can spread across the cluster.

Pulsar architecture refresher: the three-layer separation

Apache Pulsar separates compute from storage from metadata. Brokers are stateless Java processes that own no data; BookKeeper bookies store the actual message log entries; a metadata store (ZooKeeper, or Oxia / etcd in newer deployments) holds namespace policies, topic ownership, and replicator state. This three-layer split is what makes geo-replication tractable.

When a producer sends a message to persistent://retail/orders/ns-east/checkout-events, the owning broker writes the entry to a BookKeeper ledger striped across Qw bookies and acknowledges after Qa bookies confirm. The broker itself can be killed and restarted on another node in seconds without data loss, because the ledger lives in BookKeeper. That separation is the precondition for sane cross-region replication: the replicator thread is just another consumer-producer pair living inside the broker JVM.

In Pulsar 4.0, the metadata layer has two options. Production deployments still overwhelmingly run ZooKeeper 3.9, but Oxia is graduating from experimental status and is now used by StreamNative Cloud for multi-region metadata at lower latency than cross-region ZK quorums. Oxia uses a single-leader-per-shard design with linearizable reads, which avoids the cross-region write penalty of a ZK ensemble that spans datacenters. For 2026 greenfield, Oxia is worth evaluating; for brownfield, ZooKeeper is fine as long as the ensemble lives in one region.

Stateless brokers

A broker owns a topic partition for some lease period. If the broker dies, another broker in the same cluster takes over by reading the ledger metadata. There is no data resync because the broker held nothing locally. This matters for geo-replication because the replicator threads, which are scheduled on the owning broker, can fail over without losing their cursor; the cursor is itself stored as a BookKeeper ledger.

BookKeeper bookies

A bookie stores ledger entries. Each ledger is a write-once log scoped to one writer (the broker). The triple E:Qw:Qa controls durability: E is the ensemble size (bookies in the stripe), Qw is the write quorum (parallel writes per entry), Qa is the ack quorum (responses needed for client ack). A common default is 5:3:2 per the Apache BookKeeper protocol docs.

Metadata store

Topic ownership, namespace policies, schema versions, and replicator subscription cursors live in ZooKeeper or Oxia. Configuration store is the cross-region equivalent that holds cluster definitions and tenant policies; in classic deployments it is a separate ZK ensemble shared across clusters, in Oxia deployments it is a global shard.

A common 2026 mistake is conflating the local metadata store with the configuration store. The local store is per-cluster and holds high-write topic ownership. The configuration store is shared and holds low-write cluster definitions. Sharing one ZK ensemble for both across regions is the most expensive mistake operators make; it puts your topic-ownership writes on the WAN. Pulsar 4.0 explicitly recommends one ZK ensemble per region for local metadata, plus a global configuration store ZK whose write rate stays under 10 transactions per second in typical deployments.

Native geo-replication: the mechanic

Pulsar geo-replication is broker-to-broker asynchronous replication driven by a replicator thread that acts as an internal consumer in the source cluster and an internal producer in each destination cluster. Enable replication on a namespace, and every persistent topic under it gets a replicator per destination cluster. Messages produced in region A flow out of the broker, into the replicator, across the WAN, and into the broker in region B which persists to its own bookies.

The replicator is not magic; it is a managed Producer and Reader running inside the broker process. That design has two consequences. First, the replicator inherits authentication, TLS, and rate-limiting from the normal broker stack. Second, replication health is observable through standard Pulsar metrics like pulsar_replication_rate_in, pulsar_replication_backlog, and pulsar_replication_connected_count.

A subtle but critical property: replicated messages carry a replicatedFrom property naming the source cluster. The destination broker uses this to short-circuit any onward replication, preventing infinite loops in a mesh topology. The mechanism is documented in the Pulsar concepts and architecture overview and is what makes mesh topologies safe.

Configuring replication: the actual commands

Cluster registration is the prerequisite. Each region’s cluster must be registered in the configuration store, including the broker service URL that the other regions will dial.

# In region-A admin, register region-B as a known cluster
bin/pulsar-admin clusters create eu-west-1 \
  --url http://broker.eu-west-1.internal:8080 \
  --broker-url pulsar://broker.eu-west-1.internal:6650 \
  --url-secure https://broker.eu-west-1.internal:8443 \
  --broker-url-secure pulsar+ssl://broker.eu-west-1.internal:6651

# Create a tenant allowed to span both clusters
bin/pulsar-admin tenants create retail \
  --allowed-clusters us-east-1,eu-west-1 \
  --admin-roles ops-platform

# Create a namespace and enable replication
bin/pulsar-admin namespaces create retail/orders \
  --clusters us-east-1,eu-west-1

bin/pulsar-admin namespaces set-clusters retail/orders \
  --clusters us-east-1,eu-west-1

From that moment, every new topic created under retail/orders is automatically replicated to both clusters. Existing topics inherit the namespace policy on the next ownership refresh.

Per-topic replication and selective routing

Pulsar 4.0 supports per-topic replication clusters when you need to override the namespace default, useful for compliance-sensitive topics that must not leave EU.

bin/pulsar-admin topics set-replication-clusters \
  persistent://retail/orders/eu-pii-events \
  --clusters eu-west-1

Throttling is per-cluster and per-namespace, set in megabytes per second on the dispatch rate:

bin/pulsar-admin namespaces set-replicator-dispatch-rate retail/orders \
  --msg-dispatch-rate 50000 \
  --byte-dispatch-rate 52428800 \
  --dispatch-rate-period 1

That caps the outbound replication rate to 50 MB/s, preventing a backlog flush from saturating the WAN.

Verifying replication is actually flowing

A subtle gotcha: the pulsar-admin clusters create and namespace policy commands succeed even when the destination broker URL is unreachable. The error surfaces only when a topic produces traffic, at which point the replicator logs a connection failure and backs off. Validate replication explicitly after enabling it:

# Produce a test message in region A
bin/pulsar-client produce \
  persistent://retail/orders/replication-canary \
  --messages "ping-from-us-east-$(date +%s)" \
  --num-produce 1

# In region B, consume the same topic
bin/pulsar-client consume \
  persistent://retail/orders/replication-canary \
  --subscription-name canary-sub \
  --num-messages 1 \
  --subscription-position Earliest

If the consume command times out, check pulsar-admin topics stats persistent://retail/orders/replication-canary for the replication block, which lists each destination cluster with connected, replicationBacklog, and replicationDelayInSeconds fields. A connected: false with non-zero backlog is the canonical “broker URL wrong” signal.

Topology patterns: when to choose which

The four canonical Pulsar geo-replication topologies are full mesh, hub-and-spoke, active-passive failover, and aggregated read-only. Each maps to a different cost-latency-complexity trade-off; the right choice depends on how many regions you need writes in and how aggressive your RPO target is.

Full mesh

Every cluster replicates to every other cluster. With N clusters you get N*(N-1) directed links. Three regions means six links, four regions means twelve. The benefit is low-latency reads everywhere because every region has a local copy. The cost is bandwidth (each message crosses the WAN N-1 times) and replicator threads (each broker maintains a producer to every peer cluster). Mesh is the right call for active-active multi-region IoT ingestion where every region writes and every region reads.

Hub-and-spoke

One central cluster acts as a hub. Spokes replicate to the hub, and the hub replicates back out. Link count is 2*(N-1), which scales linearly. The downside is that the hub is a hot spot and a single point of failure for cross-region delivery. Use it when you have 4+ regions and bandwidth cost matters more than mesh latency.

Active-passive

One region writes; another region is a warm standby. Replication is one-way, and failover is operator-driven via pulsar-admin clusters update-peer-clusters or DNS cutover. This is the right pattern for DR-only deployments where you do not want the consistency complexity of active-active.

Aggregated read-only

A regional cluster receives messages from many edge clusters but produces nothing back. Useful when an edge IoT site pushes telemetry to a central analytical cluster but the central cluster does not need to push anything back to the edge. This is closely related to patterns described in our Sparkplug B 3.0 edge network architecture for industrial telemetry aggregation.

Topology selection in practice

Pick by counting your active write regions. One region with a DR target equals active-passive. Two regions writing equals mesh (trivially N*(N-1)=2 links). Three regions writing equals mesh if WAN bandwidth permits; six links is still tractable. Four or more writing regions equals hub-and-spoke unless every region must read with sub-100 ms latency, in which case mesh and pay the bandwidth bill. The transition point in real billing is around 4 regions at 100 MB/s sustained replication, where the (N-1) bandwidth multiplier in mesh starts to materially exceed the cost of an extra hub cluster.

Replicated subscriptions: global cursors and exactly-once dreams

Replicated subscriptions extend ordinary subscription cursor state across clusters. When a consumer in region A acks a message at position (ledger=42, entry=1500), the cursor advance is replicated as a marker into the message stream, so a consumer that fails over to region B picks up at the next unacked message rather than re-reading the entire backlog.

The mechanism uses periodic snapshot markers inserted into the topic by the source broker. Each marker records the cursor positions in all clusters at a point in time. On failover, the destination broker locates the most recent marker where all participants have a known cursor position, and resumes from there.

Enable replicated subscriptions per-consumer:

Consumer<byte[]> consumer = client.newConsumer()
    .topic("persistent://retail/orders/checkout-events")
    .subscriptionName("billing-service")
    .subscriptionType(SubscriptionType.Failover)
    .replicateSubscriptionState(true)
    .subscribe();

When replicated subscriptions break

Three failure scenarios in 2026 production: First, if the markers are inserted at a rate too slow relative to ack rate, failover replays a large window. The default snapshot interval is 1 second per replicatedSubscriptionsSnapshotFrequencyMillis in broker.conf; tune down for high-throughput topics. Second, if one cluster is partitioned from the others, marker generation stalls because the marker requires acknowledgment from all replication clusters. Third, exactly-once semantics across regions are not guaranteed; you get at-least-once with a small replay window on failover, which most applications can tolerate via idempotent downstream sinks.

The fourth scenario, less discussed, is interaction with key-shared subscriptions. Replicated subscriptions assume a single cursor per subscription, but key-shared subscriptions distribute keys across many consumers. Pulsar 4.0 supports replicated key-shared subscriptions only experimentally; the snapshot mechanism does not yet capture per-key cursor state cleanly. If you run key-shared subscriptions, plan failover with the expectation that some keys will replay further than others, and design the consumer to be idempotent at the key level. A common pattern is to write to a destination store keyed by message ID, so duplicate processing during failover replay is a no-op rather than a double-write.

Consistency model: asynchronous by default

Pulsar geo-replication is asynchronous. A producer in region A receives PRODUCER_ACK as soon as the local BookKeeper ack quorum is satisfied; replication to other regions happens after that. This is the right default for multi-region: synchronous replication across a 60-180 ms WAN would be catastrophic for producer p99.

The consistency guarantees:

Within a region: linearizable per-topic-partition (single owning broker, ordered ledger writes).
Across regions: eventually consistent. Lag is observable as pulsar_replication_backlog.
Across topics in the same region: per-topic ordering only, no global order.
Replicated subscriptions: at-least-once with bounded replay window on failover.

For workloads needing stricter cross-region guarantees, Pulsar supports synchronous mode by stretching the BookKeeper ensemble across regions, where each addEntry waits for a Qa ack quorum that spans regions. This is rarely the right answer because p99 collapses to the WAN RTT, but it exists. The same trade-off shapes architectural decisions discussed in our Iceberg vs Delta vs Hudi lakehouse ADR when the analytical layer needs cross-region table commits.

Practitioners sometimes ask for a CAP-theorem framing. Pulsar geo-replication picks AP over CP by default: in a network partition, both regions continue accepting writes locally and reconcile asynchronously when the partition heals. The reconciliation is not a merge; it is an interleave, because each region’s messages have independent message IDs in independent ledgers. Consumers see all messages from all regions after the partition heals, with per-topic per-region ordering preserved but no global order. For workloads where order is the contract, design topics to be region-affinitized (one writer region per topic) so that ordering remains well-defined.

BookKeeper layout: the E:Qw:Qa math that matters

The BookKeeper triple (E, Qw, Qa) decides durability, write parallelism, and ack latency. Get it wrong and your p99 hits the WAN. The interaction is: an entry is striped across Qw of the E bookies; the broker waits for Qa acks before client ack. If those Qa bookies span regions, your write latency becomes the WAN RTT.

The 2026 recommendation for geo-replicated Pulsar:

Keep E, Qw, Qa all satisfied within one region’s bookie pool. A typical setting is E=5, Qw=3, Qa=2 per region.
Do not stretch BookKeeper ensembles across regions unless you have a hard synchronous requirement. The intra-region p99 for Qa=2 on SSDs is 3-8 ms; the cross-region version is 60-180 ms.
Use the namespace persistence policy to set this explicitly:

bin/pulsar-admin namespaces set-persistence retail/orders \
  --ensemble-size 5 \
  --write-quorum 3 \
  --ack-quorum 2 \
  --ml-mark-delete-max-rate 1.0

For regions with limited bookie capacity, E=3, Qw=3, Qa=2 works but loses the ability to survive a bookie failure without re-replication.

A worked example. Suppose region us-east-1 has 8 bookies on NVMe SSDs and you want to survive any single bookie failure without a re-replication storm. With E=5, Qw=3, Qa=2, each ledger is striped across 5 of the 8 bookies, each entry is written in parallel to 3 of those 5, and the client ack returns when 2 acks land. Killing any one bookie loses one stripe member; the auto-recovery daemon rebuilds that stripe onto a free bookie in the background. Killing two bookies simultaneously may cost you Qa for some ledgers; killing three simultaneously means data loss for some ledgers. The math generalizes: Qa failed bookies in a single stripe loses entries.

Region-affinity placement

Pulsar 4.0 supports BookKeeper region-aware placement policies. The bookkeeperClientRegionawarePolicyEnabled=true setting in broker.conf, combined with bookie rack-config (where “rack” is the region), ensures each ensemble stripes across multiple racks for fault tolerance. In geo-replicated deployments, set the rack to the region name to keep ensembles regional.

# broker.conf
bookkeeperClientRegionawarePolicyEnabled=true
bookkeeperClientMinNumRacksPerWriteQuorum=1
bookkeeperClientEnforceMinNumRacksPerWriteQuorum=true

Auto-recovery and re-replication

When a bookie fails permanently, the BookKeeper auto-recovery daemon detects under-replicated ledgers from the metadata store and re-replicates entries to healthy bookies. In a geo-replicated deployment, re-replication traffic is intra-region by design, never cross-WAN. Monitor the bookkeeper_server_BOOKIE_QUARANTINE_TIME_seconds and bookkeeper_server_REPLICATION_WORKER_NUM_LEDGERS_UNDER_REPLICATED metrics. A persistent backlog of under-replicated ledgers signals either insufficient spare bookie capacity or a slow auto-recovery worker that needs tuning via replicationParallelism in bk_server.conf.

Trade-offs and failure modes

No replication system is free, and Pulsar geo-replication has real failure modes worth naming before you deploy it.

Split brain. If the two regions partition and continue accepting writes on the same topic, you get divergent message orders. Pulsar mitigates this with topic ownership: a topic is owned by exactly one broker in one cluster at a time, and replication is direction-aware. But in active-active where producers in both regions write to the same topic, you get interleaved messages with no global order. Design downstream consumers to be order-independent, or use distinct topic-per-region patterns.

Asymmetric replication backlog. A slow consumer or a slow WAN link causes pulsar_replication_backlog to grow unbounded by default. Set backlogQuotaDefaultLimitGB and backlogQuotaDefaultRetentionPolicy=producer_request_hold to apply backpressure to producers, or consumer_backlog_eviction to drop oldest messages. The default is producer_exception, which simply errors producers when the quota is hit.

Replicator deadlock. Historically a real bug, mostly fixed in Pulsar 3.x via PIP-321 reorganization of cluster responsibilities. If you see pulsar_replication_connected_count stuck at zero with a growing backlog, restart the broker owning the topic; the issue is typically a leaked replicator connection.

Replica explosion. In a mesh of N clusters, a single 1 MB message uses (N-1) MB of WAN bandwidth. For large payloads, use PIP-37 chunking or move binary payloads to object storage and replicate references.

Metadata store as the SPOF. A cross-region ZooKeeper ensemble has terrible write latency. Run one ZK ensemble per cluster, and rely on the configuration store ZK for cross-cluster metadata only. Better still, evaluate Oxia for 2026 deployments.

Replicated subscriptions stall on partition. As noted earlier, the marker mechanism requires all participants to acknowledge. If one region is unreachable, marker generation halts, and on a future failover you may replay further back than expected. Monitor pulsar_replication_subscription_replication_marker_failed_total and alert on growth.

Operational cost of mesh. With 4 regions you have 12 replicator links, 12 sets of metrics, 12 backlog graphs. Beyond 4 regions, the cognitive load argues for hub-and-spoke, even with the hub being a SPOF.

Schema evolution drift. Pulsar’s schema registry is per-cluster by default. Without explicit cross-cluster schema synchronization, a producer in region A using schema v3 may emit messages that fail to deserialize in region B still on schema v2. The 2026 mitigation is to enable isAllowAutoUpdateSchema=true and rely on backward-compatible evolution rules, plus an out-of-band schema-registry sync job. PIP-119 work is ongoing to make this cluster-aware natively; until merged, treat schema drift as an operational concern, not a Pulsar feature.

TLS certificate rotation across clusters. Geo-replication links use mTLS. Rotating the broker certificate in region A without coordinating the truststore in region B silently breaks replication. Build cert rotation into your standard runbook with a 7-day overlap window where both old and new certs are trusted, and validate pulsar_replication_connected_count after every rotation.

Tiered storage offload interactions. When BookKeeper ledgers are offloaded to S3 or GCS via tiered storage, the replicator reads from the local broker which transparently reads from the offload tier. Latency spikes during catch-up reads can starve foreground producer throughput. Set managedLedgerOffloadAutoTriggerSizeThresholdBytes conservatively in geo-replicated namespaces, or pre-warm the offload reader cache before catch-up periods.

Clock skew on replicated subscriptions. Snapshot markers carry timestamps from the source broker. If clock skew between brokers exceeds the marker frequency, snapshot ordering becomes ambiguous and replicated subscriptions may resume from an older-than-expected position. Run chrony or equivalent NTP synchronization on all brokers and bookies, and alert when skew exceeds 100 ms between any two brokers in the geo-replicated set.

Function and IO replication. Pulsar Functions and IO connectors that live inside the broker do not automatically replicate. A function deployed in region A is not present in region B. For DR symmetry, deploy functions explicitly in each region using identical function configs, and source the function code from a shared artifact store. The same applies to schema artifacts and connector JARs.

Real-world performance numbers

Synthetic benchmarks rarely match production, but a few 2025-2026 figures help calibrate expectations. StreamNative published a 2025 benchmark showing a 3-region mesh sustaining 1.5 GB/s aggregate produce rate with end-to-end p99 replication latency of 340 ms across us-east-1 to eu-west-1 to ap-south-1, using c6gn.4xlarge brokers and i3en.2xlarge bookies. Yahoo’s internal Pulsar deployment, the original geo-replicated workload, sustains north of 6 GB/s across 8 datacenters with per-region replication backlog held below 5 seconds 99.9 percent of the time. Verisign’s public-facing engineering posts describe DNS-event geo-replication using a 4-region active-active Pulsar with sub-second cross-region delivery.

These numbers are not promises; they reflect well-tuned deployments where the WAN is high-quality dedicated capacity, brokers are not co-tenant with noisy neighbors, and the producer batches messages efficiently. Real production lives in a wider range: 100 ms to 5 seconds of cross-region p99

Apache Pulsar Geo-Replication Architecture for Multi-Region (2026)

Apache Pulsar Geo-Replication Architecture for Multi-Region (2026)

Architecture at a glance

Context: why messaging crosses regions in 2026

Pulsar architecture refresher: the three-layer separation

Stateless brokers

BookKeeper bookies

Metadata store

Native geo-replication: the mechanic

Configuring replication: the actual commands

Per-topic replication and selective routing

Verifying replication is actually flowing

Topology patterns: when to choose which

Full mesh

Hub-and-spoke

Active-passive

Aggregated read-only

Topology selection in practice

Replicated subscriptions: global cursors and exactly-once dreams

When replicated subscriptions break

Consistency model: asynchronous by default

BookKeeper layout: the E:Qw:Qa math that matters

Region-affinity placement

Auto-recovery and re-replication

Trade-offs and failure modes

Real-world performance numbers

Related

Comments

Leave a Reply Cancel reply

Tag Cloud

Categories