system-design/replication.md

10. Replication: Single-Leader, Multi-Leader, Leaderless

Replication = keeping the same data on multiple machines. Why? Availability (survive node loss), read scaling (more replicas = more read capacity), latency (replica close to user), disaster recovery (cross-region copi…

~6 min read·updated 5/29/2026

10. Replication: Single-Leader, Multi-Leader, Leaderless

Replication = keeping the same data on multiple machines. Why? Availability (survive node loss), read scaling (more replicas = more read capacity), latency (replica close to user), disaster recovery (cross-region copies).

The hard part is what happens when writes conflict, replicas fall behind, or a leader fails.

10.1 The three flavors

ArchitectureExamples
Single-leaderPostgres, MySQL, MongoDB (replica set), Kafka partition
Multi-leaderBidirectional MySQL/Postgres setups, CouchDB, Cassandra (with quorum), Spanner cross-region
LeaderlessDynamo, Cassandra, Riak

10.2 Single-leader (primary/replica, "master/slave")

One node is the leader; clients send writes only to the leader. The leader propagates the changes to followers. Reads can go to any node (leader or follower).

Synchronous vs asynchronous replication

  • Sync: leader waits for follower acknowledgment before committing. No data loss on leader failure. But if the follower is slow/down, the leader stalls.
  • Async: leader commits immediately, ships to followers in the background. Fast. Leader failure can lose un-shipped writes.
  • Semi-sync: at least one follower must ack. Compromise. Used by MySQL, Postgres (synchronous_standby_names).

In practice almost everyone runs async with at least one optional sync follower for safety.

Implementing replication

  1. Statement-based: replicate the SQL statement. Breaks on nondeterministic functions (NOW(), RAND()), triggers, side effects. Mostly abandoned.
  2. WAL shipping (Write-Ahead Log): copy the byte-level log. Tightly couples to storage version. Postgres native.
  3. Logical (row-based) replication: capture row-level changes (insert/update/delete with values). Decoupled from storage. MySQL row-based binlog, Postgres logical replication. Enables CDC.
  4. Trigger-based: app or DB triggers write to a side table; ext process ships. Flexible, slow.

Adding a new follower

Snapshot the leader at a known LSN (log sequence number) → copy snapshot to follower → follower starts streaming WAL from that LSN. No downtime needed.

Handling follower failure

Follower has its log up to LSN X. On reconnect, requests changes since X from leader. Leader retains WAL until all followers catch up (or hits configured retention).

Handling leader failure (failover)

Hard. Steps:

  1. Detect failure (timeout heuristics; risky — gray failures fool you).
  2. Promote a replica (the most up-to-date one).
  3. Reconfigure clients to write to the new leader.
  4. Old leader recovers; demote it.

Things that go wrong:

  • Data loss: async replicas are behind; promoted leader missing writes that the old leader committed.
  • Split brain: both old and new leaders alive simultaneously, accepting conflicting writes.
  • STONITH ("Shoot The Other Node In The Head"): brutal but necessary fencing.
  • Operator confusion: stale dashboards, wrong promotion target.

GitHub had a famous 2018 outage from a botched MySQL failover; Postgres has Patroni / pg_auto_failover; managed services (RDS, Aurora, Cloud SQL) hide most of this.

Read scaling: replication lag

Async replication ⇒ followers lag the leader. Lag = seconds, sometimes minutes under load. Three pathologies:

  1. Read your own writes: user updates profile, immediately reads it from a replica that hasn't caught up → sees old data. Solution: route their reads to the leader for a window post-write, or use a session token (replica refuses if its log < session token).
  2. Monotonic reads: user refreshes; first read hits replica A (caught up), second hits replica B (behind) → time appears to go backward. Solution: route the user's reads to the same replica.
  3. Consistent prefix: causally ordered events appear out of order. (Question/answer chat with reply arriving before question.) Solution: same partition for causally related ops; or use causal consistency primitives.

These are consistency models, covered in chapter 12.

10.3 Multi-leader

More than one node accepts writes. Each leader replicates its writes to the others.

When to use

  • Multi-datacenter: leader in each DC; local writes are fast; cross-DC replication is async. Used by some MySQL/Postgres deployments and CouchDB.
  • Offline clients: each client is a "leader" of its local data; sync on reconnect (think Apple Notes, Trello).
  • Collaborative editing: similar to offline; CRDTs / OT (Google Docs).

Conflict resolution

The hard problem of multi-leader: same row updated on two leaders concurrently.

Strategies:

  • Last-writer-wins (LWW): pick by timestamp. Loses data silently. Common but dangerous.
  • App-defined merge: merge function reconciles (e.g., sum counters, union sets).
  • CRDTs (Conflict-free Replicated Data Types): data structures designed to merge deterministically. Counters, sets, maps. Used in Riak, Redis (CRDT module), collaborative editing.
  • OT (Operational Transformation): alternative; transform operations against each other. Google Docs uses OT.
  • Manual reconciliation: surface conflicts to the user (Git merge conflicts, Trello "version conflict").

Multi-leader adds operational complexity. Most teams should not pick it lightly.

10.4 Leaderless replication (Dynamo style)

Any replica accepts writes. There is no leader. Reads contact multiple replicas; clients (or a coordinator) reconcile.

Examples: DynamoDB-the-paper (Amazon, 2007), Cassandra, Riak, Voldemort.

Quorum reads/writes

N = number of replicas; W = writes acked before success; R = reads contacted.

Strong consistency requires W + R > N.

  • N=3, W=2, R=2 → strong (overlap of 1 replica between any read and write).
  • N=3, W=1, R=1 → fast but eventual.
  • N=3, W=3, R=1 → fast reads, slow writes.

Sloppy quorum + hinted handoff

If a target replica is down, the write goes to a "stand-in" replica with a hint. When the original comes back, the hint is replayed. Improves availability; weakens consistency (a quorum may not actually overlap with future reads).

Read repair + anti-entropy

  • Read repair: on read, if responses disagree, write the latest version to the stale replica.
  • Anti-entropy (background): a process compares replicas (Merkle trees, see chapter 17) and resyncs differences.

Concurrent writes & vector clocks

Two clients write to different replicas concurrently. Without ordering, you get conflicts. Vector clocks (chapter 14) detect concurrency; the system either picks one (LWW), keeps both (siblings), or asks the app to merge.

When leaderless is right

  • Multi-region writes with high availability.
  • Workloads tolerant of conflict resolution / convergence.
  • Predictable, uniform write distribution.

When it's wrong

  • Strict ordering or consistency required.
  • Frequent concurrent updates to the same key.

10.5 Cross-region replication

Three patterns:

  1. Single-leader in one region; async followers in others. Reads local, writes cross-region. Failover slow. Common.
  2. Multi-leader, one per region. Writes local in each. Conflict-prone.
  3. Synchronous quorum across regions (Spanner, CockroachDB). Strong consistency globally; writes pay cross-region RTT. Expensive but powerful.

Spanner's trick: TrueTime (chapter 14) lets it order globally without a single leader, while still serializing transactions.

10.6 Backups vs replication

Replication ≠ backup.

  • A replica copies your writes — including DELETE FROM users.
  • Backups protect against logic errors, ransomware, human deletes.
  • Always have point-in-time-recovery (PITR) backups separate from replicas.
  • Test restores. An untested backup is not a backup.

10.7 Operational notes

  • Monitor replication lag as a first-class SLI. Pages should fire at ~10-30s lag.
  • Failover drills: schedule them; practice in staging. Inexperienced teams fumble the first real failover.
  • Connection routing: clients need to discover the new leader. Use a proxy (PgBouncer with leader detection, ProxySQL), service discovery (Consul/etcd), or managed solution.

10.8 What Google interviewers want

  • You can articulate sync vs async trade-offs.
  • You know failover is hard and what fails.
  • You can reason about read-your-writes and explain how to fix it.
  • You know quorum math (W + R > N).
  • You know when you'd pick multi-leader (rarely) and what you give up.

Key takeaways

  • Single-leader is the default. Async + semi-sync is the production sweet spot.
  • Replication lag forces consistency-model decisions: read-your-writes, monotonic, consistent prefix.
  • Multi-leader trades simplicity for write availability. Resolve conflicts deliberately.
  • Leaderless (Dynamo) trades strict ordering for write availability + uniform load. Quorum gives partial consistency.
  • Replication ≠ backup.

// 1 view

main
UTF-8·typescript