PostgreSQL vs Distributed SQL (CockroachDB, YugabyteDB): A Multi-Tenant SaaS ADR

Most teams that migrate off PostgreSQL onto a distributed SQL database do it a full order of magnitude too early — and pay for the mistake in latency, cost, and operational scar tissue for years afterward. The choice at the heart of postgres vs distributed sql is not “which database is more scalable.” It is “which failure mode can my team survive”: the ceiling of a single write node, or the tax of running consensus on every commit. Those are different bets, and the honest answer for the median multi-tenant SaaS in 2026 is still a single PostgreSQL primary with replicas — until a specific, nameable requirement forces the other way.

This post is written as a real Architecture Decision Record. It states the context, the decision drivers, the options with their mechanisms, and the consequences you actually sign up for. It does not pretend distributed SQL is a strictly better Postgres, and it does not pretend Postgres scales forever.

What this covers: the Postgres scaling path (vertical, replicas, partitioning, Citus, PgBouncer), distributed SQL internals (Raft, range sharding, distributed transactions, hybrid logical clocks, follower reads), the true latency and cost of consensus, multi-region behavior, and a decision framework you can apply this quarter.

Context and Background

The typical trigger is a growth chart. A multi-tenant SaaS crosses a few thousand tenants, the largest tenant’s tables cross a few hundred million rows, and someone in an architecture review says the word “shard.” From there the conversation drifts, almost inevitably, toward CockroachDB or YugabyteDB, because “distributed SQL” sounds like the version of your database that never needs this meeting again. This is scaling anxiety, and it is a genuinely expensive emotion when it drives infrastructure decisions.

The anxiety is not baseless. A single PostgreSQL primary accepts all writes on one machine. There is exactly one node that can commit a transaction, and vertical scaling has a ceiling set by the largest instance your cloud sells and the point where a single WAL stream, checkpoint, and autovacuum cycle stop keeping up. Read replicas scale reads, not writes. So the fear — “we will hit a write wall and have no move left” — is structurally real.

What the anxiety gets wrong is the timeline and the magnitude. Modern PostgreSQL on a large cloud instance comfortably handles tens of thousands of transactions per second for well-indexed OLTP, and the practical write ceiling for most SaaS workloads sits far above where teams start panicking. The scaling path also has more rungs than people remember: partitioning, logical sharding by tenant, Citus, and aggressive connection pooling each buy an order of magnitude before you exhaust the model. Before adopting a new consistency model and a new operational discipline, it is worth reading how multi-region active-active architecture actually behaves, because that is the requirement most often used to justify distributed SQL, and it is also the one that hurts the most in practice.

Distributed SQL — the “NewSQL” lineage descended from Google Spanner — genuinely solves horizontal write scaling and multi-region strong consistency. It does so by paying for every write with a consensus round, which is a real, permanent, per-transaction cost. The PostgreSQL project’s own documentation on high availability and replication lays out the single-primary model plainly; the trade you are evaluating is whether to leave that model. This ADR exists to make that trade explicit rather than emotional.

The Decision: Options and Architecture

Decision: For a single-region multi-tenant SaaS below roughly the low tens of thousands of write transactions per second, choose PostgreSQL with read replicas and tenant-aware partitioning; adopt Citus only when a single primary genuinely saturates. Choose distributed SQL (CockroachDB or YugabyteDB) only when you have a hard requirement for multi-region low-latency writes, survive-a-region availability, or write throughput beyond a single node — and accept its latency and cost as the price.

Figure 1: The two candidate architectures. Option A funnels writes through one primary (fanning out to replicas and optional Citus shards); Option B accepts writes on any node and replicates each range via Raft.

The left side of Figure 1 is Option A: application servers connect through PgBouncer to a single primary that streams WAL to read replicas, with Citus as an optional sharding layer for the largest tables. Every write lands on one node; reads scale horizontally across replicas. The right side is Option B: a symmetric cluster where any node can accept a query, route it to the range leader that owns the affected keys, and commit only after a Raft quorum agrees. The architectural difference is where write authority lives — concentrated in one place, or spread across ranges — and everything downstream follows from that.

Option A — PostgreSQL with replicas, partitioning, and Citus

Option A is the incumbent, and its strength is that it is boring in the best sense. One primary owns all writes. You scale vertically first: bigger instance, more RAM for cache, faster NVMe, more IOPS. You scale reads by adding streaming replicas and routing read-only traffic to them, accepting a few milliseconds to low-hundreds-of-milliseconds of replication lag depending on load. You control the connection explosion — Postgres backends are processes, and each idle connection costs memory and scheduler attention — by fronting the database with PgBouncer in transaction-pooling mode, which lets thousands of client connections share a few hundred server connections.

When one table gets too large, you use native declarative partitioning to split it by range or hash, so the planner prunes to the relevant partition and autovacuum works on smaller chunks. When one primary genuinely cannot keep up with writes, you reach for Citus, which shards tables across worker nodes and distributes queries from a coordinator. Citus keeps the Postgres wire protocol and most of the SQL surface, so it is an extension of the model rather than a rewrite. For a multi-tenant SaaS the natural distribution key is tenant_id, which co-locates each tenant’s rows on one shard and makes the common single-tenant query a single-node query.

Option B — Distributed SQL (CockroachDB, YugabyteDB)

Option B replaces the single-primary model with a cluster of peers. Data is split into ranges (CockroachDB) or tablets (YugabyteDB) — contiguous key spans that are individually replicated, typically three ways. Each range elects a Raft leader that serializes writes for that range; a write commits when a quorum (two of three replicas) has durably appended the entry to its Raft log. There is no single write node: different ranges have leaders on different machines, so write capacity scales by adding nodes and letting ranges rebalance. Both databases speak the PostgreSQL wire protocol, so most drivers and many queries port with modest changes.

CockroachDB layers a serializable transaction engine over this range storage and uses hybrid logical clocks (HLC) to order transactions without a special clock like Spanner’s TrueTime. An HLC combines a physical timestamp with a logical counter, so every event gets a globally comparable timestamp even when machine clocks drift slightly; the database uses a bounded maximum-clock-offset assumption to decide when it must wait a moment before serving a read to preserve consistency. YugabyteDB splits the stack explicitly: YSQL reuses actual PostgreSQL query-layer code on top of DocDB, a distributed document store that shards into tablets and replicates each via Raft. Because YSQL runs real Postgres query-processing code, a larger fraction of Postgres SQL, functions, and behavior carries over than in a from-scratch reimplementation — that reuse is Yugabyte’s central compatibility argument. Both databases give you horizontal write scaling and the ability to place replicas across regions — the two capabilities Option A structurally lacks — and both pay for that with a consensus round on the write path that a single Postgres primary simply does not have.

Decision matrix

Driver	Option A: Postgres + replicas/Citus	Option B: Distributed SQL
Write scale ceiling	One primary; Citus extends by sharding	Horizontal; add nodes to add write capacity
Single write latency	Local fsync; sub-millisecond to low ms	Consensus round; higher, region-dependent
Multi-region writes	Hard; single primary is one region	Native; replicas placed across regions
Consistency	Strong on primary; replica lag on reads	Serializable (CRDB) / strong on DocDB
Operational burden	Well-known; failover needs tooling	Self-healing but new mental model
Cost	Efficient; one big node + replicas	Higher; 3x+ replication and more nodes
Ecosystem maturity	Deepest in the industry	Postgres-wire compatible, not identical

The matrix is the ADR in miniature. Option A wins on latency, cost, and ecosystem for the common case. Option B wins on write scale-out, multi-region writes, and automatic survival of node and region loss. The decision is which column’s wins match your actual, current requirements — not which sounds more future-proof.

Deeper Analysis: How They Scale and Where They Hurt

The interesting engineering is in the mechanisms, because the mechanisms determine the bills and the pages. Start with the write path, because that is where distributed SQL charges its rent.

Figure 2: A distributed SQL commit. The gateway routes the write to the range leader, which replicates the log entry to followers and commits only after a quorum acknowledges.

Figure 2 shows the sequence. A client sends a write to any node (the gateway). The gateway routes it to the Raft leader for the affected range. The leader appends the entry to its log and ships it to followers. Only after a quorum — two of three replicas — durably acknowledges does the leader mark the entry committed and return success. In a single region with nodes in the same or nearby availability zones, that quorum round is on the order of a millisecond or two of added latency versus a local Postgres fsync. That is the cheap case, and it is genuinely cheap.

The latency cost of consensus is a floor, not an average

The expensive case is multi-region. If your three replicas sit in regions separated by, say, 30–70 ms of round-trip network time, then every write must wait for at least one remote replica to acknowledge. Consensus needs a majority, so a write cannot commit faster than the round-trip to the nearest replica that forms a quorum with the leader. Spread three replicas across three distant regions and your write latency inherits the inter-region RTT as a hard floor — tens of milliseconds per commit, every commit, no caching around it. (These figures are order-of-magnitude, cloud-and-topology dependent, not benchmarks.)

This is the single most important number in the entire decision. A single-region Postgres primary commits a small transaction in well under a millisecond of database time. A globally-distributed strongly-consistent write cannot, by the physics of the speed of light and the math of majority quorums, do the same. Distributed SQL does not remove this cost; it makes it survivable and consistent rather than eliminating it. Follower reads — serving reads from a nearby non-leader replica at a slightly stale but consistent timestamp — are the standard mitigation for read latency, but they do nothing for write latency. If your workload is write-heavy and latency-sensitive and multi-region, you are paying the consensus tax on the critical path and there is no way around it.

Range splits, hotspots, and the sequential-key trap

Distributed SQL scales by splitting ranges as they grow and rebalancing them across nodes. This is elegant until your access pattern fights it. The classic failure is a monotonically increasing key — an auto-increment ID or a timestamp — used as the leading column of the primary key. Because ranges are ordered by key, all new inserts land at the end, in the single range that owns the highest keys, whose single Raft leader becomes a hotspot. You bought a distributed system and funneled all your writes through one node anyway. The fixes (hash-sharded indexes in CockroachDB, hash partitioning in YugabyteDB) work, but they are things you must know to do; the failure is silent until it isn’t.

Distributed transactions and the 2PC cost

A transaction that touches keys owned by different ranges — different leaders on different nodes — cannot commit with a single Raft round. It needs a distributed commit protocol, effectively a form of two-phase commit coordinated across the involved ranges, layered on top of each range’s own consensus. That means more round-trips, intent/lock records written and later resolved, and a larger window in which contention between transactions turns into retries. Under serializable isolation, contended transactions get aborted and must be retried by the application, so your code must handle transaction-retry loops — a real difference from the mostly-optimistic world of single-node Postgres. Keep transactions single-range (single-tenant, thanks to a tenant_id-led key) and you avoid most of this; the moment you write across tenants or across ranges, you pay.

Postgres partitioning and Citus mechanics

Option A’s scaling is more manual but more predictable. Native partitioning turns one huge table into a set of child tables under a parent; the planner prunes to the relevant child, and operations like vacuum, index builds, and old-data drops act on one partition at a time. This is how you keep a billion-row events table manageable on a single primary. Citus goes further, hash-distributing a table across worker nodes on a chosen key. With tenant_id as the distribution column, a query filtered to one tenant is routed to exactly one shard and runs as ordinary local Postgres — you keep single-node latency for the common path while distributing storage and write load across workers. The Citus documentation describes the coordinator/worker model in detail. The catch is that cross-shard queries and cross-tenant transactions become distributed operations with their own coordination cost — the same fundamental tax as Option B, just opt-in.

Figure 3: Three tenancy models. Isolation, density, and operational cost trade off against each other; the right choice depends on tenant count and blast-radius tolerance.

Tenancy models decide more than you think

Figure 3 shows the three ways to lay out multi-tenant data, and the choice interacts strongly with both options. Database-per-tenant gives the strongest isolation and trivial per-tenant restore, but connection counts and catalog overhead explode past a few hundred tenants and it strains any pooler. Schema-per-tenant gives clean logical separation and easy per-tenant operations, but tens of thousands of schemas bloat the system catalog and slow planning. Row-level with a tenant_id column gives maximum density and one schema to migrate, at the cost of needing disciplined row-level security and every index and query carrying the tenant filter. For high tenant counts, row-level is usually the answer — and it is also the model that maps cleanly onto Citus distribution keys and onto distributed SQL range locality, so the tenancy decision and the scaling decision are not independent.

Trade-offs, Gotchas, and What Goes Wrong

The surprising latency is the first thing teams underestimate. Engineers benchmark a distributed SQL cluster in a single AZ, see numbers close to Postgres, and conclude the tax is negligible — then deploy multi-region and watch p99 write latency multiply as every commit waits on a cross-region quorum. The consensus cost was always there; the single-AZ test just hid it. Always benchmark in the topology you will actually run.

Migration cost is the second. Postgres-wire compatibility is real but not total. Sequences behave differently, some extensions are unavailable, certain pg_catalog internals and advisory-lock patterns differ, and query plans diverge because the optimizer is a different piece of software reasoning about distributed data. A query that was instant on Postgres can become a full-cluster scan on a distributed engine if it can’t be pushed down to the right ranges. Plan on re-validating your hot queries, not just re-pointing the connection string.

Operational maturity cuts the other way from the marketing. Distributed SQL is genuinely more self-healing — lose a node and the cluster re-replicates ranges and keeps serving without a manual failover. But when something goes wrong at the consensus or rebalancing layer, you are debugging Raft leadership, range hotspots, and clock skew, and the pool of engineers who can do that at 3 a.m. is far smaller than the pool who can read a Postgres EXPLAIN. You are trading a well-understood failure mode (primary failover) for an unfamiliar one (distributed-systems pathology).

Then there is cost and lock-in. Three-way replication means your storage and write work are multiplied before you add any nodes for capacity, so a distributed SQL cluster is structurally more expensive than one big primary plus replicas for the same logical dataset. A Postgres primary with two replicas also stores the data three times, but only the primary does write work and the replicas are cheap followers; in a distributed cluster every node is doing consensus work, so the compute multiplier is real, not just a storage one. Costs blow up further when cross-region traffic — every consensus round is network egress — meets cloud data-transfer pricing, which is billed per gigabyte and adds up quickly under write-heavy multi-region load. And while the Postgres wire protocol eases entry, the operational surface, tuning knobs, and behavioral quirks are vendor-specific; leaving is not as simple as arriving, because your schema choices, retry logic, and topology assumptions all bake in over time.

One more gotcha deserves naming: schema changes. On a single Postgres primary, most DDL is fast and well understood, and the community has mature online-migration tooling. In a distributed cluster, schema changes are themselves distributed operations that roll out across ranges and nodes, and while both CockroachDB and YugabyteDB support online schema change, the semantics, timing, and failure behavior differ from what a Postgres-native team expects. Do not assume your existing migration playbook transfers unchanged.

Practical Recommendations

Choose PostgreSQL unless you can name the specific requirement that Postgres cannot meet. “We might scale someday” is not that requirement; “we contractually must serve writes in three regions with single-digit-millisecond local latency and survive a full region loss” is. The default is Option A, and the burden of proof sits on Option B.

Figure 4: The decision path. Single-region and within a primary’s write ceiling routes to Postgres; hard multi-region writes or beyond-single-node throughput routes to distributed SQL.

Work Figure 4 top to bottom. If you are single-region and below a single primary’s write ceiling, choose Postgres with replicas — done. If you are single-region but past the write ceiling, and you can shard cleanly by tenant, choose Postgres plus Citus before you leave the ecosystem. Only if you need multi-region writes, or cannot shard by tenant and are genuinely past a single node, does distributed SQL earn the adoption.

Decision checklist — choose Postgres unless:
– You need low-latency writes served from multiple regions, not just reads.
– You must survive the loss of an entire region with no manual failover.
– Your write throughput is provably beyond a single large primary and cannot be sharded by tenant.
– Your team has (or will hire) genuine distributed-systems operational depth.
– You have benchmarked the distributed option in your real multi-region topology and accept the write-latency floor.

If fewer than two of those are true, Option A is almost certainly correct, and you should invest the saved complexity budget in partitioning, pooling, and a tested failover runbook instead. Revisit the decision at each 10x of growth, not continuously — thrash is its own cost. Pair this with your broader data-platform choices, such as how you store analytical data in the Iceberg vs Delta vs Hudi lakehouse ADR, so the transactional and analytical layers are decided coherently rather than one panic at a time.

Frequently Asked Questions

Is CockroachDB just a faster PostgreSQL?

No. CockroachDB speaks the PostgreSQL wire protocol and supports much of its SQL, but it is a different engine: range-based storage, Raft consensus per range, serializable isolation, and hybrid logical clocks. That makes it horizontally scalable and multi-region capable, but every write pays a consensus round and cross-region writes inherit inter-region latency as a floor. For single-region OLTP a Postgres primary is typically lower latency and cheaper. Treat CockroachDB as a distributed system with a Postgres-shaped front door, not as Postgres with the scaling limits removed.

When does a single Postgres primary actually run out of write capacity?

Later than most teams assume. A large cloud instance with fast NVMe, generous RAM for cache, PgBouncer transaction pooling, and well-designed indexes handles tens of thousands of OLTP transactions per second for many workloads. You usually hit connection-management, vacuum, or IOPS problems — all fixable — before you hit a true single-node write wall. Native partitioning and then Citus sharding extend the ceiling further. Benchmark your real workload before assuming you’ve reached the limit; the limit is often a tuning problem in disguise, not an architectural one.

How is YugabyteDB’s architecture different from CockroachDB?

YugabyteDB separates its layers explicitly. Its query layer, YSQL, reuses actual PostgreSQL query-processing code, which improves compatibility. Underneath sits DocDB, a distributed document store that shards data into tablets and replicates each tablet via Raft. CockroachDB is more monolithic, with its own SQL layer over range-based storage and HLC-ordered serializable transactions. Both give horizontal scale and multi-region strong consistency through Raft, but Yugabyte’s reuse of Postgres query code is its headline compatibility argument. See the vendors’ own architecture docs for the authoritative details rather than relying on summaries.

What is the real latency cost of distributed consensus?

In a single region with nearby nodes, consensus adds roughly a millisecond or two per write versus a local fsync — usually acceptable. Across regions it is dominated by network round-trip time: a write cannot commit faster than the trip to the nearest replica that completes a majority, so distant regions add tens of milliseconds per commit, unavoidably. Follower reads reduce read latency but not write latency. If your workload is write-heavy, latency-sensitive, and multi-region, budget for that floor explicitly — it is physics plus quorum math, not a tuning artifact.

Can I stay on Postgres and still go multi-region?

Partially. You can place read replicas in other regions to serve low-latency local reads, and you can run disaster-recovery standbys elsewhere. What single-primary Postgres cannot do is accept low-latency writes in multiple regions simultaneously with strong consistency — all writes still funnel to the one primary’s region. If local reads and DR are enough, Postgres covers it. If you truly need active-active multi-region writes, that is the requirement that justifies distributed SQL, and it is worth confirming the need is real before paying for it.

Does Citus make Postgres a distributed SQL database?

Citus makes Postgres horizontally scalable for the right workloads, but it is not the same as CockroachDB or YugabyteDB. It shards tables across workers from a coordinator and keeps the Postgres wire protocol and most SQL. For multi-tenant SaaS sharded by tenant_id, single-tenant queries stay fast and local while storage and write load spread across workers. What Citus does not natively give you is transparent multi-region strong-consistency writes with automatic region-loss survival; cross-shard transactions still carry distributed-coordination cost. It extends the Postgres model rather than replacing its consistency semantics.

PostgreSQL vs Distributed SQL (CockroachDB, YugabyteDB): A Multi-Tenant SaaS ADR (2026)