Every serious database supports replication. The configuration options sound similar across systems — primary, replica, sync, async, logical, physical, quorum — but the choices you make under those names determine how your application behaves when things break. Replication is where database internals leak into application design: the difference between a write being “durable” and “durable enough to survive the primary going down” is the difference between losing one transaction and losing fifteen minutes of orders.
This post walks through how replication actually works in PostgreSQL, the topology and consistency choices you’ll have to make, and a decision tree for picking what to use when. We’ll stick to Postgres for code, but the patterns transfer to MySQL, MongoDB, and most other systems with minor adjustments.
This is Part 2 of the Distributed Systems series. Part 1 covered sharding, and assumed you understood replication well enough to design the migration path. If that section felt fast, this is where it slows down.
1. Why Replicate
Five reasons, in roughly the order they show up as a system grows.
- High availability. The primary dies — the replica is promoted and serves traffic. Without a replica, you’re restoring from backup and apologizing to users for an hour.
- Read scaling. Analytics dashboards, reporting jobs, and bulk exports run against replicas instead of competing with transactional traffic on the primary.
- Geographic distribution. A replica in another region serves local reads at 5ms instead of 150ms across the ocean.
- Disaster recovery. A replica in another data centre (or another cloud) means a fire, a misconfiguration, or a deletion in one location doesn’t end your business.
- Major-version upgrades. Logical replication lets you stream from a Postgres 14 primary to a Postgres 17 replica, then cut over with seconds of downtime. The alternative is
pg_upgradewith hours of downtime, and a much scarier rollback story.
Most teams reach for replication for HA first, then add read replicas as they grow, then geography when international users complain.
2. The Vocabulary
Before any setup, the words. A lot of replication confusion comes from these terms being used loosely.
Primary (or leader, master in older docs) — the node that accepts writes. Postgres has exactly one primary per cluster. The modern term is primary; master is being phased out across most projects.
Replica (or standby, follower, secondary) — a node that receives the primary’s changes and applies them locally. May or may not serve reads, depending on configuration.
Write-Ahead Log (WAL) — Postgres writes every change to the WAL before applying it to the heap. Replication, at its core, is “ship the WAL from primary to replica, and replay it there.”
Replication lag — how far behind the replica is, measured either in bytes (how much WAL hasn’t been applied yet) or time (how long ago the latest applied transaction happened on the primary). Lag is almost always non-zero in async setups.
Synchronous vs asynchronous — does the primary wait for the replica to acknowledge each commit before returning success to the client? Sync = wait (safer, slower). Async = don’t wait (faster, you may lose committed data on a primary crash).
Physical replication — replicas are byte-for-byte copies of the primary at the storage level. The WAL is shipped and replayed as-is. Same Postgres version on both sides required.
Logical replication — replicas receive row changes (INSERT, UPDATE, DELETE) decoded from the WAL, not raw byte changes. The replica can be a different Postgres major version, a different schema (with caveats), or even a different database system.
Topology — the shape of the replication relationships. Single-leader (one primary, many replicas), multi-leader (multiple primaries replicating to each other), or leaderless (no designated primary).
Quorum — a write or read is considered successful if it’s acknowledged by a majority (or some configured threshold) of nodes. Used in leaderless systems and in Postgres synchronous-with-N configurations.
Consistency model — what a client can observe across reads after a write. Strong / linearizable means “as if there’s one machine.” Eventual means “you’ll see it eventually.” More nuance in Section 6.
3. The Three Real Replication Strategies
There are technically four ways databases replicate. Three of them matter.
Statement-based (historical, mostly avoided)
The primary writes a log of every SQL statement; replicas replay them. MySQL’s original replication worked this way.
Problem: non-deterministic statements. INSERT INTO logs (id, created_at) VALUES (DEFAULT, NOW()) produces different results on every machine. Same with RAND(), UUID(), statements with LIMIT and no ORDER BY, and anything depending on table contents. You end up with replicas that silently diverge.
Postgres has never used statement-based replication. MySQL moved away from it as the default in 5.6+. If your database advertises “SQL replication,” investigate carefully — you’re almost certainly looking at this category.
Physical / WAL Streaming (Postgres default)
The primary streams its raw WAL records to the replica. The replica writes them to disk and replays them, producing a byte-for-byte identical copy of the primary’s storage.
# postgresql.conf on the primary
wal_level = replica # WAL detail level (replica is enough for streaming)
max_wal_senders = 10 # how many concurrent replicas can stream
max_replication_slots = 10 # slots reserved for replicas
# On the new replica machine: clone the primary's data directory
pg_basebackup -h primary.internal -U replicator -D /var/lib/postgresql/data \
-P -R -X stream
# pg_basebackup with -R creates standby.signal and configures
# primary_conninfo, so the replica starts up and immediately
# starts streaming.
When to use: the default for HA and read replicas. It’s fast, low-overhead, and supports streaming replication of every operation including DDL (Data Definition Language — ALTER TABLE, CREATE INDEX, etc.).
When not to use:
- Replica needs to be a different Postgres major version → use logical instead.
- Replica needs a different schema (some tables only, transformed columns) → use logical.
- Replicating between Postgres and a different database system → not possible with physical.
Logical Replication (publication / subscription)
The primary’s WAL is decoded into a stream of row-level changes (row INSERTED into orders, column updated on users.id=42) which subscribers consume. Subscribers can be different Postgres versions, different schemas, or external systems.
-- on the primary (publisher)
ALTER SYSTEM SET wal_level = 'logical'; -- requires a restart
SELECT pg_reload_conf();
CREATE PUBLICATION app_pub FOR TABLE users, orders, payments;
-- or: CREATE PUBLICATION all_pub FOR ALL TABLES;
-- on the replica (subscriber)
CREATE SUBSCRIPTION app_sub
CONNECTION 'host=primary.internal dbname=app user=replicator password=...'
PUBLICATION app_pub;
Once the subscription is created, Postgres performs an initial table copy, then streams ongoing changes. Lag is visible in pg_stat_subscription.
When to use:
- Major-version upgrades (PG 14 → PG 17 with no downtime).
- Replicating a subset of tables (instead of the whole cluster).
- Streaming changes to non-Postgres systems (Debezium reads logical replication slots and emits to Kafka).
- Cross-database migrations.
When not to use:
- HA — you want physical streaming for that.
- Replicating DDL changes — logical replication doesn’t carry schema changes. You apply them on both sides manually or with tooling.
- Replicating sequences — they’re not replicated. You manage them out-of-band during cutover.
Trigger-based (last-resort, application-level)
Triggers on every table capture changes into a queue, which an application process consumes. Slonik and Bucardo are the classic examples. Useful before logical replication existed (pre-Postgres 10); rarely the right answer today.
4. Sync Modes — The Latency / Durability Trade
Replication is a stream by default — the primary doesn’t wait for replicas to acknowledge anything. That’s fine for most workloads. When zero data loss matters (financial transactions, audit logs), you make it synchronous.
Postgres exposes this via synchronous_commit and synchronous_standby_names.
# postgresql.conf on the primary
# Which replicas count as synchronous, and how many must ack
synchronous_standby_names = 'ANY 2 (replica1, replica2, replica3)'
# What "ack" means at commit time
synchronous_commit = 'on' # default
synchronous_commit has five levels, ordered from fastest-and-riskiest to safest-and-slowest:
| Level | Primary returns success when… | You lose on a primary crash |
|---|---|---|
off | WAL is in memory, not yet on disk | Last few transactions, even on local disk |
local | WAL is fsync’d on the primary | Nothing locally; replicas may lag |
remote_write | WAL is received by the replica’s memory | Tiny window if replica also crashes |
on (default) | WAL is fsync’d on the replica’s disk | Effectively zero data loss on failover |
remote_apply | WAL is applied on the replica (rows visible) | Zero — replica is fully caught up |
The cost of moving up this ladder is latency. Every commit now includes a round-trip to the replica. A remote_apply setup with a replica in another region adds ~10–50ms to every write transaction. That’s fine for a billing system; catastrophic for a high-write event ingestion service.
The pragmatic answer most teams settle on:
- Default to async (
synchronous_commit = local) for typical workloads. remote_writewith one nearby replica when you want HA with low write latency.onorremote_applywith a quorum (ANY 2 (...)) for genuinely-can’t-lose-this data, with multiple replicas so any one can ack without blocking.
Two-replica gotcha: synchronous_standby_names = 'FIRST 1 (r1, r2)' means “always wait for r1 specifically.” If r1 dies, all your writes block. Use ANY 1 (r1, r2) to allow either replica to ack — much better availability profile.
5. Topologies — Single-Leader, Multi-Leader, Leaderless
The shape of who-replicates-to-whom. Three families.
Single-Leader (95% of systems)
One primary accepts all writes. N replicas receive changes. Reads can come from either, with consistency caveats.
┌──────────┐
│ Primary │ ← all writes
└────┬─────┘
WAL ───┼───────┐
┌────▼─────┐ │ ┌──────────┐
│ Replica1 │ └─► Replica2 │
└──────────┘ └──────────┘
▲ ▲
reads reads
Strengths: simple to reason about, no write conflicts, vanilla Postgres supports it natively, every transactional guarantee you’re used to still holds.
Weaknesses: the primary is a single point of failure for writes. Failover takes seconds to minutes (until a replica is promoted). Writes can’t be served from multiple regions simultaneously.
This is what your Postgres setup is using today unless you’ve actively configured otherwise. Use it until you have a specific reason not to.
Multi-Leader
Two or more primaries, each accepting writes, replicating to each other.
┌──────────┐ ┌──────────┐
│ Primary │◄───────►│ Primary │
│ us │ bi-dir │ eu │
└──────────┘ └──────────┘
▲ ▲
writes writes
Useful when writes need to be served close to users in multiple regions, or for active-active disaster recovery.
The hard part is conflict resolution. If us and eu both update users.email for the same row at roughly the same time, which one wins? Standard strategies:
- Last-write-wins — keep the update with the latest timestamp. Simple, but you silently drop one of the writes.
- Application-defined merge — application sees both versions and picks. Works for additive structures (sets, counters via CRDTs).
- Region-pinned writes — each row is “owned” by one region, only that region writes to it. Avoids conflicts by design but loses some of the locality benefit.
Postgres in this space: core Postgres doesn’t support multi-leader replication. The mature options are extensions:
- pgEdge (formerly EDB BDR) — commercial distributed Postgres with built-in multi-master and conflict resolution.
- pgactive — AWS’s open-source fork of BDR, used inside RDS Multi-AZ for active-active.
- Citus — primarily a sharding extension but supports some multi-master patterns through its coordinator.
The honest read: multi-leader Postgres is a niche, expert-only configuration. Most teams that think they need it actually need single-leader with regional read replicas — writes go cross-region to the primary, reads stay local.
Leaderless
No designated primary. Any node can accept writes and any node can serve reads. Writes are sent to multiple nodes in parallel; reads are also sent to multiple nodes and the most recent wins.
This is the Dynamo model — Cassandra, ScyllaDB, DynamoDB, Riak all use variants of it.
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Node A │ │ Node B │ │ Node C │
└─────────┘ └─────────┘ └─────────┘
▲ ▲ ▲
└────────────┼────────────┘
any node accepts
reads and writes
The consistency knob is the R + W > N rule:
N= total replicas of each piece of data.W= how many replicas must acknowledge a write.R= how many replicas must respond to a read.
If R + W > N, the read and write quorums overlap by at least one node — so any read is guaranteed to see at least one node that has the latest write. That’s how leaderless systems give you “tunable consistency” without a primary.
Common configurations:
N=3, W=2, R=2 → strong-ish reads, can tolerate 1 node down
N=3, W=3, R=1 → fast reads, slower writes, can't tolerate write-side failures
N=3, W=1, R=1 → fastest possible, eventual consistency only
Conflict resolution still has to happen. Leaderless systems usually attach a version vector or timestamp to every write and use last-write-wins (sometimes vector clocks for more correctness).
Postgres in this space: Postgres does not natively do leaderless replication. The model is fundamentally incompatible with Postgres’s strict serializability and ACID transactions. If you need leaderless, you’re using Cassandra, DynamoDB, ScyllaDB, or similar — not Postgres.
Worth knowing about anyway because:
- Production systems often pair Postgres for transactional data with a leaderless store for high-write event data (session state, feature flags, telemetry). Knowing both models lets you pick the right tool.
- The R+W>N idea is portable — it shows up in custom consensus protocols, in CRDTs, and in some edge-caching strategies.
6. Consistency Models in Plain Language
Replication is when consistency models stop being academic and start being a bug in your application. Five worth knowing.
Strong consistency / linearizability. Reads always see the most recent committed write, as if there’s only one machine. Single-node Postgres gives you this for free. Across replicas, you get it only if you always read from the primary, or use synchronous replication with remote_apply, or use a system with consensus (Raft) baked in. You’d notice this when: you don’t notice it. Things just work.
Eventual consistency. Reads may return stale data, but if writes stop, all replicas eventually converge. Default for async replication. You’d notice this when: a user updates their profile, refreshes the page, and sees the old value because the page hit a replica that hasn’t caught up.
Read-your-writes consistency. A given user always sees their own writes immediately, even if other users might see stale data. Stronger than eventual, weaker than strong. You’d notice this when: you implement this poorly and a user’s “Save” button appears to do nothing on refresh — the most common cause of users repeatedly clicking Save and producing duplicates.
Monotonic reads. Once a user has seen a particular write, they will not see any earlier state of that data on a subsequent read, even if they hit a different replica. Prevents the “looks like it went backwards” experience. You’d notice this when: a user sees their order in the order list, refreshes, and it disappears because the second read hit a less-caught-up replica.
Bounded staleness. Reads may be stale, but never by more than X seconds or Y transactions. Useful when “freshness within 5 seconds” is acceptable but “minutes-old data” is not. You’d notice this when: you intentionally configure it to keep dashboards responsive without making them outright wrong.
Strong consistency is the safest default. Every weakening is something the application has to handle. Most production systems land on eventual consistency with read-your-writes and monotonic reads — the application is allowed to see slightly stale data from other users’ updates, but always sees its own writes correctly.
7. Replica Reads — The Rules
Reading from replicas is the most common way teams hurt themselves with replication. The lag is real, the application usually ignores it, and the bugs are subtle.
When stale data is fine:
- Analytics dashboards (a 10-second-old aggregate is indistinguishable from a current one).
- Recommendation systems (the model doesn’t update that often anyway).
- Search results.
- Public-facing read-only views (blog posts, product catalogs, documentation).
When stale data is not fine:
- Anything immediately after a write by the same user (“did my comment post?”).
- Authorization checks (“does this user still have admin access?”).
- Inventory or financial decisions (“does this user have funds?”).
- Any read that drives a subsequent write decision.
Four patterns for handling this:
Pattern 1: Read primary after write
The simplest correct approach. After a write, the same request reads from the primary, not a replica. The application explicitly tracks which connection to use.
# pseudocode
def update_profile(user_id, changes):
with primary_db.transaction() as tx:
tx.execute("UPDATE users SET ... WHERE id = ?", user_id, ...)
# Read from primary for the rest of this request
return tx.execute("SELECT * FROM users WHERE id = ?", user_id)
Strength: rock solid. Weakness: offloads no reads to replicas during write-paths.
Pattern 2: Sticky sessions
A user’s session is bound to a specific replica (or to the primary) for some window after a write. Used in larger systems with many replicas — keeps things consistent for that user without sending all their traffic to the primary.
Pattern 3: LSN tracking (read-your-writes)
After a write on the primary, you record the Log Sequence Number (LSN) — Postgres’s monotonically-increasing position in the WAL. On the next read, you pass the LSN; the replica delays the query (or routes to primary) until it has replayed up to that LSN.
-- after a write on the primary, record the LSN
SELECT pg_current_wal_lsn(); -- e.g., '0/3001028'
-- on the replica side, wait until that LSN has been applied
SELECT pg_wal_replay_wait('0/3001028', timeout => '500ms');
SELECT ... -- now safe to query
This is the closest thing to “always see your own writes” without sending all reads to the primary. It’s slightly more code but cleaner than sticky sessions.
Pattern 4: Read-only replica with a freshness contract
Document explicitly which workloads are allowed to use the replica (analytics, exports, reports) and which must use the primary (anything user-facing immediately after a write). Enforce in code via separate connection pools — primary_pool and replica_pool — and require any new query to be reviewed against the contract.
This is what most production teams settle on. It’s not magic; it’s just “be deliberate.”
8. Postgres Replication, End to End
Putting it together. A canonical HA setup looks like this:
┌──────────────────────────┐
│ Application tier │
└──────────────┬───────────┘
pgbouncer
┌──────────────┴───────────┐
│ Patroni-managed Postgres │
│ │
│ ┌──────────────────┐ │
│ │ Primary │ │
│ │ (accepts writes)│ │
│ └────────┬─────────┘ │
│ │ WAL stream │
│ ┌────────▼─────────┐ │
│ │ Standby #1 │ │
│ │ sync replica │ │
│ └──────────────────┘ │
│ ┌────────▼─────────┐ │
│ │ Standby #2 │ │
│ │ async replica │ │
│ │ serves reads │ │
│ └──────────────────┘ │
└──────────────────────────┘
The pieces:
- Patroni (or Stolon) — automated failover manager. Watches the primary’s health, promotes a standby when needed, reconfigures pgbouncer or DNS so the application reconnects.
- pgbouncer — connection pooler. Important not just for connection economy but because it handles the brief disconnect during failover gracefully.
- Sync replica (#1) — close-network replica with
synchronous_commit = on. Guarantees zero data loss on failover. - Async replica (#2) — further from the primary, lag is allowed. Handles read traffic, also a candidate for promotion in cross-region disaster scenarios.
Monitoring lag is non-negotiable. On the primary:
SELECT
application_name,
state,
pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS pending_bytes,
pg_wal_lsn_diff(sent_lsn, write_lsn) AS write_lag_bytes,
pg_wal_lsn_diff(write_lsn, flush_lsn) AS flush_lag_bytes,
pg_wal_lsn_diff(flush_lsn, replay_lsn) AS replay_lag_bytes,
write_lag,
flush_lag,
replay_lag
FROM pg_stat_replication;
Alert on replay_lag exceeding your SLA. A common threshold: anything over 30 seconds on a standby that serves reads triggers a page; anything over 5 minutes on any standby triggers an investigation.
Failover testing is the difference between a system that fails over and a system that advertises it can fail over. The expected discipline:
- Do a controlled failover every quarter.
- Time it. Verify the application reconnects.
- Verify replica lag was zero at the moment of promotion (otherwise you’ve lost data).
- Document the recovery time and any rough edges.
Teams that don’t do this discover during the actual outage that their Patroni config was subtly wrong, or that pgbouncer didn’t refresh DNS, or that one application server didn’t reconnect cleanly.
9. Decision Tree
| What you need | Use |
|---|---|
| HA against primary loss, single region | Single-leader, async streaming, Patroni for failover |
| HA with zero-data-loss tolerance | Single-leader with synchronous_commit = on, ANY N quorum across ≥ 2 nearby replicas |
| Read scaling for reports / analytics | Async streaming replicas, route OLAP traffic via separate connection pool |
| Major-version upgrade with seconds of downtime | Logical replication PG-old → PG-new, application cutover |
| Replicate a subset of tables to another system | Logical replication + Debezium / CDC |
| Cross-region active-active writes | Multi-leader (pgEdge / pgactive) — or rethink with regional read replicas + cross-region primary |
| High-write event data, no transactions needed | Leaderless DB (Cassandra / DynamoDB) — not Postgres |
| Strict read-your-writes after a user action | Read from primary, or LSN tracking via pg_wal_replay_wait |
| Globally-consistent reporting | Read primary only, or use synchronous quorum + read from sync replica |
10. Common Mistakes Checklist
- Treating replicas as backups. Replicas mirror corruption and accidental
DELETEs in real time. You still needpg_basebackup+ WAL archives. - Reading from replicas in a write-then-read flow without thinking about lag. Save button appears to “not work.”
- Synchronous commit on every replica. Latency dies; one slow replica blocks all writes.
-
synchronous_standby_names = 'FIRST 1 (r1)'instead of'ANY 1 (r1, r2)'. If r1 dies, writes block. - No failover automation. Manual failover during a 3 AM outage takes 30+ minutes; automated takes 30 seconds.
- Untested failover. The first real failover discovers the bug.
- No lag monitoring. You find out the replica is hours behind because someone queried it for a report.
- Multi-leader without a conflict-resolution strategy. Data silently diverges.
- Letting application code use
primary_poolandreplica_poolinterchangeably. Subtle bugs from stale reads everywhere. - Forgetting that logical replication doesn’t carry DDL. Schema changes have to be applied to both sides manually.
- Forgetting that sequences aren’t replicated by logical replication. After cutover, your
bigserialcolumns start at 1.
11. Closing
Replication is one of those topics where the building blocks are straightforward but every combination has a real-world implication. Single-leader async streaming is the right answer 95% of the time. Logical replication is the right answer for version upgrades and CDC. Multi-leader and leaderless are real but rare, and worth understanding mostly so you can recognize when someone is reaching for them prematurely.
The skill isn’t memorizing which knob does what. It’s developing the instinct for “where would lag show up here?” before you ship the feature — because every replicated database has lag, and every async system can lose its most recent writes, and the application either accounts for that or doesn’t.
Next in the Distributed Systems series: consensus protocols. When you need stronger guarantees than single-leader async can give you — distributed transactions that actually commit atomically, or leader election that doesn’t split-brain — you reach for Raft or Paxos. We’ll walk through what they actually do, why they’re hard, and where you find them in production (etcd, Consul, CockroachDB, every modern distributed system you can name).
Related reading
- Sharding Postgres: When One Database Stops Being Enough — Part 1 of this series; the architectural step that often precedes thinking about replication topology.
- Postgres Partitioning: Strategies, Advantages, and Pitfalls — how partitioning interacts with WAL streaming and logical replication.
- ACID, Read Phenomena, and Isolation Levels — the local-transaction guarantees that replication is trying (or not trying) to preserve.
- PostgreSQL Indexing Deep Dive — relevant when designing replica-only read workloads and partial-index reads against followers.