◐ system-design/replication.md
10. Replication: Single-Leader, Multi-Leader, Leaderless
Replication = keeping the same data on multiple machines. Why? Availability (survive node loss), read scaling (more replicas = more read capacity), latency (replica close to user), disaster recovery (cross-region copi…
~6 min read·updated 5/29/2026
10. Replication: Single-Leader, Multi-Leader, Leaderless
Replication = keeping the same data on multiple machines. Why? Availability (survive node loss), read scaling (more replicas = more read capacity), latency (replica close to user), disaster recovery (cross-region copies).
The hard part is what happens when writes conflict, replicas fall behind, or a leader fails.
10.1 The three flavors
| Architecture | Examples |
|---|---|
| Single-leader | Postgres, MySQL, MongoDB (replica set), Kafka partition |
| Multi-leader | Bidirectional MySQL/Postgres setups, CouchDB, Cassandra (with quorum), Spanner cross-region |
| Leaderless | Dynamo, Cassandra, Riak |
10.2 Single-leader (primary/replica, "master/slave")
One node is the leader; clients send writes only to the leader. The leader propagates the changes to followers. Reads can go to any node (leader or follower).
Synchronous vs asynchronous replication
- Sync: leader waits for follower acknowledgment before committing. No data loss on leader failure. But if the follower is slow/down, the leader stalls.
- Async: leader commits immediately, ships to followers in the background. Fast. Leader failure can lose un-shipped writes.
- Semi-sync: at least one follower must ack. Compromise. Used by MySQL, Postgres (
synchronous_standby_names).
In practice almost everyone runs async with at least one optional sync follower for safety.
Implementing replication
- Statement-based: replicate the SQL statement. Breaks on nondeterministic functions (
NOW(),RAND()), triggers, side effects. Mostly abandoned. - WAL shipping (Write-Ahead Log): copy the byte-level log. Tightly couples to storage version. Postgres native.
- Logical (row-based) replication: capture row-level changes (insert/update/delete with values). Decoupled from storage. MySQL row-based binlog, Postgres logical replication. Enables CDC.
- Trigger-based: app or DB triggers write to a side table; ext process ships. Flexible, slow.
Adding a new follower
Snapshot the leader at a known LSN (log sequence number) → copy snapshot to follower → follower starts streaming WAL from that LSN. No downtime needed.
Handling follower failure
Follower has its log up to LSN X. On reconnect, requests changes since X from leader. Leader retains WAL until all followers catch up (or hits configured retention).
Handling leader failure (failover)
Hard. Steps:
- Detect failure (timeout heuristics; risky — gray failures fool you).
- Promote a replica (the most up-to-date one).
- Reconfigure clients to write to the new leader.
- Old leader recovers; demote it.
Things that go wrong:
- Data loss: async replicas are behind; promoted leader missing writes that the old leader committed.
- Split brain: both old and new leaders alive simultaneously, accepting conflicting writes.
- STONITH ("Shoot The Other Node In The Head"): brutal but necessary fencing.
- Operator confusion: stale dashboards, wrong promotion target.
GitHub had a famous 2018 outage from a botched MySQL failover; Postgres has Patroni / pg_auto_failover; managed services (RDS, Aurora, Cloud SQL) hide most of this.
Read scaling: replication lag
Async replication ⇒ followers lag the leader. Lag = seconds, sometimes minutes under load. Three pathologies:
- Read your own writes: user updates profile, immediately reads it from a replica that hasn't caught up → sees old data. Solution: route their reads to the leader for a window post-write, or use a session token (replica refuses if its log < session token).
- Monotonic reads: user refreshes; first read hits replica A (caught up), second hits replica B (behind) → time appears to go backward. Solution: route the user's reads to the same replica.
- Consistent prefix: causally ordered events appear out of order. (Question/answer chat with reply arriving before question.) Solution: same partition for causally related ops; or use causal consistency primitives.
These are consistency models, covered in chapter 12.
10.3 Multi-leader
More than one node accepts writes. Each leader replicates its writes to the others.
When to use
- Multi-datacenter: leader in each DC; local writes are fast; cross-DC replication is async. Used by some MySQL/Postgres deployments and CouchDB.
- Offline clients: each client is a "leader" of its local data; sync on reconnect (think Apple Notes, Trello).
- Collaborative editing: similar to offline; CRDTs / OT (Google Docs).
Conflict resolution
The hard problem of multi-leader: same row updated on two leaders concurrently.
Strategies:
- Last-writer-wins (LWW): pick by timestamp. Loses data silently. Common but dangerous.
- App-defined merge: merge function reconciles (e.g., sum counters, union sets).
- CRDTs (Conflict-free Replicated Data Types): data structures designed to merge deterministically. Counters, sets, maps. Used in Riak, Redis (CRDT module), collaborative editing.
- OT (Operational Transformation): alternative; transform operations against each other. Google Docs uses OT.
- Manual reconciliation: surface conflicts to the user (Git merge conflicts, Trello "version conflict").
Multi-leader adds operational complexity. Most teams should not pick it lightly.
10.4 Leaderless replication (Dynamo style)
Any replica accepts writes. There is no leader. Reads contact multiple replicas; clients (or a coordinator) reconcile.
Examples: DynamoDB-the-paper (Amazon, 2007), Cassandra, Riak, Voldemort.
Quorum reads/writes
N = number of replicas; W = writes acked before success; R = reads contacted.
Strong consistency requires W + R > N.
- N=3, W=2, R=2 → strong (overlap of 1 replica between any read and write).
- N=3, W=1, R=1 → fast but eventual.
- N=3, W=3, R=1 → fast reads, slow writes.
Sloppy quorum + hinted handoff
If a target replica is down, the write goes to a "stand-in" replica with a hint. When the original comes back, the hint is replayed. Improves availability; weakens consistency (a quorum may not actually overlap with future reads).
Read repair + anti-entropy
- Read repair: on read, if responses disagree, write the latest version to the stale replica.
- Anti-entropy (background): a process compares replicas (Merkle trees, see chapter 17) and resyncs differences.
Concurrent writes & vector clocks
Two clients write to different replicas concurrently. Without ordering, you get conflicts. Vector clocks (chapter 14) detect concurrency; the system either picks one (LWW), keeps both (siblings), or asks the app to merge.
When leaderless is right
- Multi-region writes with high availability.
- Workloads tolerant of conflict resolution / convergence.
- Predictable, uniform write distribution.
When it's wrong
- Strict ordering or consistency required.
- Frequent concurrent updates to the same key.
10.5 Cross-region replication
Three patterns:
- Single-leader in one region; async followers in others. Reads local, writes cross-region. Failover slow. Common.
- Multi-leader, one per region. Writes local in each. Conflict-prone.
- Synchronous quorum across regions (Spanner, CockroachDB). Strong consistency globally; writes pay cross-region RTT. Expensive but powerful.
Spanner's trick: TrueTime (chapter 14) lets it order globally without a single leader, while still serializing transactions.
10.6 Backups vs replication
Replication ≠ backup.
- A replica copies your writes — including
DELETE FROM users. - Backups protect against logic errors, ransomware, human deletes.
- Always have point-in-time-recovery (PITR) backups separate from replicas.
- Test restores. An untested backup is not a backup.
10.7 Operational notes
- Monitor replication lag as a first-class SLI. Pages should fire at ~10-30s lag.
- Failover drills: schedule them; practice in staging. Inexperienced teams fumble the first real failover.
- Connection routing: clients need to discover the new leader. Use a proxy (PgBouncer with leader detection, ProxySQL), service discovery (Consul/etcd), or managed solution.
10.8 What Google interviewers want
- You can articulate sync vs async trade-offs.
- You know failover is hard and what fails.
- You can reason about read-your-writes and explain how to fix it.
- You know quorum math (W + R > N).
- You know when you'd pick multi-leader (rarely) and what you give up.
Key takeaways
- Single-leader is the default. Async + semi-sync is the production sweet spot.
- Replication lag forces consistency-model decisions: read-your-writes, monotonic, consistent prefix.
- Multi-leader trades simplicity for write availability. Resolve conflicts deliberately.
- Leaderless (Dynamo) trades strict ordering for write availability + uniform load. Quorum gives partial consistency.
- Replication ≠ backup.
// 1 view