system-design/consensus.md

15. Consensus: Paxos, Raft, ZAB

Consensus = a group of nodes agreeing on a single value (or a sequence of values), even when some nodes fail. It's the kernel of every reliable distributed system.

~6 min read·updated 5/29/2026

15. Consensus: Paxos, Raft, ZAB

Consensus = a group of nodes agreeing on a single value (or a sequence of values), even when some nodes fail. It's the kernel of every reliable distributed system.

15.1 The problem

A set of N nodes must agree on a value. Constraints:

  • Agreement: all non-faulty nodes decide the same value.
  • Validity: the decided value was proposed by some node.
  • Termination: every non-faulty node eventually decides.

The FLP impossibility (chapter 13) says you can't satisfy all three deterministically in a fully asynchronous system with one crash. Real algorithms add timing assumptions (failure detectors).

15.2 What consensus enables

  • Leader election (the most common use): elect one node as primary; if it dies, elect another.
  • Atomic commit across multiple participants.
  • Distributed locks.
  • Group membership (which nodes are alive?).
  • Replicated state machine: all replicas apply the same operations in the same order — same final state.

15.3 Paxos

Lamport (1989, published 1998). Foundational; legendary for being notoriously hard to understand. Powers Chubby (Google), Spanner (variant), Cassandra (lightweight transactions).

Roles

  • Proposer: proposes values.
  • Acceptor: votes on proposals.
  • Learner: learns the chosen value.

In practice, every node plays all roles.

Two phases

Phase 1 (Prepare):

  • Proposer picks proposal number N (must be unique and monotonic per proposer).
  • Sends prepare(N) to majority of acceptors.
  • Acceptor: if N > any prior N seen, promise not to accept lower; reply with the highest-numbered accepted value (if any).

Phase 2 (Accept):

  • Proposer collects responses. If any response had a previously-accepted value, the proposer must use that value (preserves consistency). Otherwise it can pick its own.
  • Sends accept(N, value).
  • Acceptor: if it hasn't promised anything higher, accept.

If a majority accept, the value is chosen.

Why it's safe

A majority must overlap any other majority. So once a value is accepted by a majority, any future proposer will see it (in phase 1) and propose the same value.

Why it's hard

  • Single-decree Paxos chooses one value. For a sequence (state machine), you need Multi-Paxos: stable leader, each instance is one decree, log of decrees.
  • Live-lock if proposers race: they keep preempting each other. Solution: stable leader election.
  • The paper is dense; production implementations diverge wildly.

Variants

  • Multi-Paxos: stable leader, batch-friendly.
  • Fast Paxos: 2 message rounds in the no-conflict case.
  • EPaxos (Egalitarian Paxos): no leader; each node can propose; useful for geo-distributed systems.

15.4 Raft

Ongaro & Ousterhout (2014). Designed explicitly for understandability. Now the dominant choice for new systems: etcd, Consul, CockroachDB, TiKV, RethinkDB, MongoDB (custom Raft variant).

Three roles

  • Leader: the only node accepting client writes. Only one at a time.
  • Followers: passive; respond to leader's requests.
  • Candidate: a follower who timed out and is running for election.

Terms

Time is divided into terms (integers, monotonic). Each term has at most one leader.

Leader election

  • Followers expect heartbeats from leader.
  • If timeout (randomized 150-300ms typical), follower → candidate, increments term, votes for itself, requests votes from others.
  • Each node votes for at most one candidate per term, granting if candidate's log is at least as up-to-date as its own.
  • Majority votes → become leader; send heartbeats.

Log replication

  • Client sends command to leader.
  • Leader appends to its log; sends AppendEntries to followers.
  • Once majority ack, leader marks committed; applies to state machine; replies to client.
  • Followers apply the same command in order.

Safety properties

  • Leader completeness: any committed entry is in every future leader's log.
  • State machine safety: same index, same command across all nodes.

Achieved by: only granting votes to candidates with up-to-date logs; leader doesn't commit entries from previous terms until at least one current-term entry is committed.

Why Raft beat Paxos in popularity

  • Clear leader election separated from log replication.
  • Strong leader (followers are simple).
  • Membership change protocol (joint consensus).
  • Snapshotting protocol included.
  • The paper is genuinely readable.

Operational quirks

  • Quorum size: N=3 tolerates 1 failure, N=5 tolerates 2. Going N=4 doesn't help (still need 3 to commit).
  • Latency: every commit is at least 1 RTT to the slowest of (N+1)/2 nodes.
  • Throughput: limited by the leader's I/O.

15.5 ZAB (ZooKeeper Atomic Broadcast)

The protocol behind Apache ZooKeeper. Predates Raft, similar spirit. Provides total order broadcast.

ZooKeeper itself: a coordination service with a hierarchical namespace (znodes), watch notifications, and ephemeral nodes (auto-cleaned on session loss). Used for: leader election, locks, config, service discovery.

Used by: Kafka (until KIP-500 / KRaft removed it), HBase, Solr, many big systems.

15.6 What you build with consensus

Leader election

"Whoever holds the lock at path /election is leader." Lock is fenced (holds a sequence number); if leader dies (session timeout), lock auto-released, others race.

Distributed lock

PUT /lock/foo with a TTL via consensus. Most expensive primitive. Use sparingly.

Atomic counters / sequence numbers

Increment a value via consensus. Slow but sometimes necessary (e.g., generating tweet IDs).

Configuration store

etcd / ZooKeeper / Consul stores cluster config. Watchers notify on change.

Membership / service discovery

The cluster's view of "who's alive."

15.7 Where consensus is wrong to use

It's slow (RTTs across quorum). Don't use for:

  • Every user request (use a single leader and replicate the log via Raft, but serve reads from leader/cache).
  • High-throughput data writes (use partitioned single-leader replication).
  • Decisions that don't actually need cluster-wide agreement (most don't).

The principle: consensus serializes a small set of metadata decisions. Data flows on the side, replicated through other means.

15.8 Strong consistency is consensus (for writes)

Linearizable register = consensus. If you want every read to see the latest write, every write must go through agreement.

Therefore:

  • Linearizable systems are slow under load (consensus latency).
  • They have a hard cap on write throughput (single leader bottleneck).
  • They don't span continents well (RTT × commit count).

This is why most planet-scale systems are NOT linearizable across regions; they shard, and each shard has its own consensus group.

15.9 Spanner-specific: Paxos groups + TrueTime

Spanner shards data into groups; each group runs Paxos for replication. TrueTime orders transactions across groups.

The combination = distributed strict serializability. Unique in the industry until very recently. CockroachDB, YugabyteDB approximate it without atomic clocks.

15.10 Byzantine fault tolerance (BFT)

Crash faults: nodes stop. Byzantine faults: nodes lie or send conflicting info. BFT consensus tolerates Byzantine faults at cost of higher message complexity (PBFT, Tendermint).

You generally only need BFT for:

  • Public blockchains (no trust between nodes).
  • Hostile environments (defense, finance edge cases).

Internal systems assume crash faults; trust-but-verify with audits is enough.

15.11 Common consensus pitfalls

  • Cluster size: 3 or 5, not 4. 7 only if you have a real reason. Bigger = slower commits.
  • Geographic distribution: spreading the quorum across regions = high commit latency. Most systems keep the quorum within a region; cross-region replication is async.
  • Disk failures: a corrupt log on one node can poison the protocol. Snapshot regularly; verify.
  • Split brain via clock skew: stale leader keeps committing because it didn't notice the term changed. Fence with leader leases.

15.12 What you should be able to do in an interview

  • Explain why consensus is hard (FLP).
  • Sketch Raft's leader election and log replication in 5 minutes.
  • Justify quorum size (3 or 5).
  • Articulate why "every write through consensus" doesn't scale and how partitioning + per-partition consensus does.
  • Recognize when a problem reduces to consensus (locks, leader, atomic commit, membership).

Key takeaways

  • Consensus = N nodes agreeing despite failures. Raft is the practical default.
  • Quorum requires majority; cluster sizes 3 or 5.
  • Use consensus for metadata (leader, config, membership) and per-partition replication, not every user write.
  • Spanner = Paxos per shard + TrueTime for global ordering.
  • BFT is overkill for trusted internal systems.

// 1 view

main
UTF-8·typescript