◐ system-design/consensus.md
15. Consensus: Paxos, Raft, ZAB
Consensus = a group of nodes agreeing on a single value (or a sequence of values), even when some nodes fail. It's the kernel of every reliable distributed system.
~6 min read·updated 5/29/2026
15. Consensus: Paxos, Raft, ZAB
Consensus = a group of nodes agreeing on a single value (or a sequence of values), even when some nodes fail. It's the kernel of every reliable distributed system.
15.1 The problem
A set of N nodes must agree on a value. Constraints:
- Agreement: all non-faulty nodes decide the same value.
- Validity: the decided value was proposed by some node.
- Termination: every non-faulty node eventually decides.
The FLP impossibility (chapter 13) says you can't satisfy all three deterministically in a fully asynchronous system with one crash. Real algorithms add timing assumptions (failure detectors).
15.2 What consensus enables
- Leader election (the most common use): elect one node as primary; if it dies, elect another.
- Atomic commit across multiple participants.
- Distributed locks.
- Group membership (which nodes are alive?).
- Replicated state machine: all replicas apply the same operations in the same order — same final state.
15.3 Paxos
Lamport (1989, published 1998). Foundational; legendary for being notoriously hard to understand. Powers Chubby (Google), Spanner (variant), Cassandra (lightweight transactions).
Roles
- Proposer: proposes values.
- Acceptor: votes on proposals.
- Learner: learns the chosen value.
In practice, every node plays all roles.
Two phases
Phase 1 (Prepare):
- Proposer picks proposal number N (must be unique and monotonic per proposer).
- Sends
prepare(N)to majority of acceptors. - Acceptor: if N > any prior N seen, promise not to accept lower; reply with the highest-numbered accepted value (if any).
Phase 2 (Accept):
- Proposer collects responses. If any response had a previously-accepted value, the proposer must use that value (preserves consistency). Otherwise it can pick its own.
- Sends
accept(N, value). - Acceptor: if it hasn't promised anything higher, accept.
If a majority accept, the value is chosen.
Why it's safe
A majority must overlap any other majority. So once a value is accepted by a majority, any future proposer will see it (in phase 1) and propose the same value.
Why it's hard
- Single-decree Paxos chooses one value. For a sequence (state machine), you need Multi-Paxos: stable leader, each instance is one decree, log of decrees.
- Live-lock if proposers race: they keep preempting each other. Solution: stable leader election.
- The paper is dense; production implementations diverge wildly.
Variants
- Multi-Paxos: stable leader, batch-friendly.
- Fast Paxos: 2 message rounds in the no-conflict case.
- EPaxos (Egalitarian Paxos): no leader; each node can propose; useful for geo-distributed systems.
15.4 Raft
Ongaro & Ousterhout (2014). Designed explicitly for understandability. Now the dominant choice for new systems: etcd, Consul, CockroachDB, TiKV, RethinkDB, MongoDB (custom Raft variant).
Three roles
- Leader: the only node accepting client writes. Only one at a time.
- Followers: passive; respond to leader's requests.
- Candidate: a follower who timed out and is running for election.
Terms
Time is divided into terms (integers, monotonic). Each term has at most one leader.
Leader election
- Followers expect heartbeats from leader.
- If timeout (randomized 150-300ms typical), follower → candidate, increments term, votes for itself, requests votes from others.
- Each node votes for at most one candidate per term, granting if candidate's log is at least as up-to-date as its own.
- Majority votes → become leader; send heartbeats.
Log replication
- Client sends command to leader.
- Leader appends to its log; sends
AppendEntriesto followers. - Once majority ack, leader marks committed; applies to state machine; replies to client.
- Followers apply the same command in order.
Safety properties
- Leader completeness: any committed entry is in every future leader's log.
- State machine safety: same index, same command across all nodes.
Achieved by: only granting votes to candidates with up-to-date logs; leader doesn't commit entries from previous terms until at least one current-term entry is committed.
Why Raft beat Paxos in popularity
- Clear leader election separated from log replication.
- Strong leader (followers are simple).
- Membership change protocol (joint consensus).
- Snapshotting protocol included.
- The paper is genuinely readable.
Operational quirks
- Quorum size: N=3 tolerates 1 failure, N=5 tolerates 2. Going N=4 doesn't help (still need 3 to commit).
- Latency: every commit is at least 1 RTT to the slowest of (N+1)/2 nodes.
- Throughput: limited by the leader's I/O.
15.5 ZAB (ZooKeeper Atomic Broadcast)
The protocol behind Apache ZooKeeper. Predates Raft, similar spirit. Provides total order broadcast.
ZooKeeper itself: a coordination service with a hierarchical namespace (znodes), watch notifications, and ephemeral nodes (auto-cleaned on session loss). Used for: leader election, locks, config, service discovery.
Used by: Kafka (until KIP-500 / KRaft removed it), HBase, Solr, many big systems.
15.6 What you build with consensus
Leader election
"Whoever holds the lock at path /election is leader." Lock is fenced (holds a sequence number); if leader dies (session timeout), lock auto-released, others race.
Distributed lock
PUT /lock/foo with a TTL via consensus. Most expensive primitive. Use sparingly.
Atomic counters / sequence numbers
Increment a value via consensus. Slow but sometimes necessary (e.g., generating tweet IDs).
Configuration store
etcd / ZooKeeper / Consul stores cluster config. Watchers notify on change.
Membership / service discovery
The cluster's view of "who's alive."
15.7 Where consensus is wrong to use
It's slow (RTTs across quorum). Don't use for:
- Every user request (use a single leader and replicate the log via Raft, but serve reads from leader/cache).
- High-throughput data writes (use partitioned single-leader replication).
- Decisions that don't actually need cluster-wide agreement (most don't).
The principle: consensus serializes a small set of metadata decisions. Data flows on the side, replicated through other means.
15.8 Strong consistency is consensus (for writes)
Linearizable register = consensus. If you want every read to see the latest write, every write must go through agreement.
Therefore:
- Linearizable systems are slow under load (consensus latency).
- They have a hard cap on write throughput (single leader bottleneck).
- They don't span continents well (RTT × commit count).
This is why most planet-scale systems are NOT linearizable across regions; they shard, and each shard has its own consensus group.
15.9 Spanner-specific: Paxos groups + TrueTime
Spanner shards data into groups; each group runs Paxos for replication. TrueTime orders transactions across groups.
The combination = distributed strict serializability. Unique in the industry until very recently. CockroachDB, YugabyteDB approximate it without atomic clocks.
15.10 Byzantine fault tolerance (BFT)
Crash faults: nodes stop. Byzantine faults: nodes lie or send conflicting info. BFT consensus tolerates Byzantine faults at cost of higher message complexity (PBFT, Tendermint).
You generally only need BFT for:
- Public blockchains (no trust between nodes).
- Hostile environments (defense, finance edge cases).
Internal systems assume crash faults; trust-but-verify with audits is enough.
15.11 Common consensus pitfalls
- Cluster size: 3 or 5, not 4. 7 only if you have a real reason. Bigger = slower commits.
- Geographic distribution: spreading the quorum across regions = high commit latency. Most systems keep the quorum within a region; cross-region replication is async.
- Disk failures: a corrupt log on one node can poison the protocol. Snapshot regularly; verify.
- Split brain via clock skew: stale leader keeps committing because it didn't notice the term changed. Fence with leader leases.
15.12 What you should be able to do in an interview
- Explain why consensus is hard (FLP).
- Sketch Raft's leader election and log replication in 5 minutes.
- Justify quorum size (3 or 5).
- Articulate why "every write through consensus" doesn't scale and how partitioning + per-partition consensus does.
- Recognize when a problem reduces to consensus (locks, leader, atomic commit, membership).
Key takeaways
- Consensus = N nodes agreeing despite failures. Raft is the practical default.
- Quorum requires majority; cluster sizes 3 or 5.
- Use consensus for metadata (leader, config, membership) and per-partition replication, not every user write.
- Spanner = Paxos per shard + TrueTime for global ordering.
- BFT is overkill for trusted internal systems.
// 1 view