15. Consensus: Paxos, Raft, ZAB

Consensus = a group of nodes agreeing on a single value (or a sequence of values), even when some nodes fail. It's the kernel of every reliable distributed system.

15.1 The problem

A set of N nodes must agree on a value. Constraints:

Agreement: all non-faulty nodes decide the same value.
Validity: the decided value was proposed by some node.
Termination: every non-faulty node eventually decides.

The FLP impossibility (chapter 13) says you can't satisfy all three deterministically in a fully asynchronous system with one crash. Real algorithms add timing assumptions (failure detectors).

15.2 What consensus enables

Leader election (the most common use): elect one node as primary; if it dies, elect another.
Atomic commit across multiple participants.
Distributed locks.
Group membership (which nodes are alive?).
Replicated state machine: all replicas apply the same operations in the same order — same final state.

15.3 Paxos

Lamport (1989, published 1998). Foundational; legendary for being notoriously hard to understand. Powers Chubby (Google), Spanner (variant), Cassandra (lightweight transactions).

Roles

Proposer: proposes values.
Acceptor: votes on proposals.
Learner: learns the chosen value.

In practice, every node plays all roles.

Two phases

Phase 1 (Prepare):

Proposer picks proposal number N (must be unique and monotonic per proposer).
Sends prepare(N) to majority of acceptors.
Acceptor: if N > any prior N seen, promise not to accept lower; reply with the highest-numbered accepted value (if any).

Phase 2 (Accept):

Proposer collects responses. If any response had a previously-accepted value, the proposer must use that value (preserves consistency). Otherwise it can pick its own.
Sends accept(N, value).
Acceptor: if it hasn't promised anything higher, accept.

If a majority accept, the value is chosen.

Why it's safe

A majority must overlap any other majority. So once a value is accepted by a majority, any future proposer will see it (in phase 1) and propose the same value.

Why it's hard

Single-decree Paxos chooses one value. For a sequence (state machine), you need Multi-Paxos: stable leader, each instance is one decree, log of decrees.
Live-lock if proposers race: they keep preempting each other. Solution: stable leader election.
The paper is dense; production implementations diverge wildly.

Variants

Multi-Paxos: stable leader, batch-friendly.
Fast Paxos: 2 message rounds in the no-conflict case.
EPaxos (Egalitarian Paxos): no leader; each node can propose; useful for geo-distributed systems.

15.4 Raft

Ongaro & Ousterhout (2014). Designed explicitly for understandability. Now the dominant choice for new systems: etcd, Consul, CockroachDB, TiKV, RethinkDB, MongoDB (custom Raft variant).

Three roles

Leader: the only node accepting client writes. Only one at a time.
Followers: passive; respond to leader's requests.
Candidate: a follower who timed out and is running for election.

Terms

Time is divided into terms (integers, monotonic). Each term has at most one leader.

Leader election

Followers expect heartbeats from leader.
If timeout (randomized 150-300ms typical), follower → candidate, increments term, votes for itself, requests votes from others.
Each node votes for at most one candidate per term, granting if candidate's log is at least as up-to-date as its own.
Majority votes → become leader; send heartbeats.

Log replication

Client sends command to leader.
Leader appends to its log; sends AppendEntries to followers.
Once majority ack, leader marks committed; applies to state machine; replies to client.
Followers apply the same command in order.

Safety properties

Leader completeness: any committed entry is in every future leader's log.
State machine safety: same index, same command across all nodes.

Achieved by: only granting votes to candidates with up-to-date logs; leader doesn't commit entries from previous terms until at least one current-term entry is committed.

Why Raft beat Paxos in popularity

Clear leader election separated from log replication.
Strong leader (followers are simple).
Membership change protocol (joint consensus).
Snapshotting protocol included.
The paper is genuinely readable.

Operational quirks

Quorum size: N=3 tolerates 1 failure, N=5 tolerates 2. Going N=4 doesn't help (still need 3 to commit).
Latency: every commit is at least 1 RTT to the slowest of (N+1)/2 nodes.
Throughput: limited by the leader's I/O.

15.5 ZAB (ZooKeeper Atomic Broadcast)

The protocol behind Apache ZooKeeper. Predates Raft, similar spirit. Provides total order broadcast.

ZooKeeper itself: a coordination service with a hierarchical namespace (znodes), watch notifications, and ephemeral nodes (auto-cleaned on session loss). Used for: leader election, locks, config, service discovery.

Used by: Kafka (until KIP-500 / KRaft removed it), HBase, Solr, many big systems.

15.6 What you build with consensus

Leader election

"Whoever holds the lock at path /election is leader." Lock is fenced (holds a sequence number); if leader dies (session timeout), lock auto-released, others race.

Distributed lock

PUT /lock/foo with a TTL via consensus. Most expensive primitive. Use sparingly.

Atomic counters / sequence numbers

Increment a value via consensus. Slow but sometimes necessary (e.g., generating tweet IDs).

Configuration store

etcd / ZooKeeper / Consul stores cluster config. Watchers notify on change.

Membership / service discovery

The cluster's view of "who's alive."

15.7 Where consensus is wrong to use

It's slow (RTTs across quorum). Don't use for:

Every user request (use a single leader and replicate the log via Raft, but serve reads from leader/cache).
High-throughput data writes (use partitioned single-leader replication).
Decisions that don't actually need cluster-wide agreement (most don't).

The principle: consensus serializes a small set of metadata decisions. Data flows on the side, replicated through other means.

15.8 Strong consistency is consensus (for writes)

Linearizable register = consensus. If you want every read to see the latest write, every write must go through agreement.

Therefore:

Linearizable systems are slow under load (consensus latency).
They have a hard cap on write throughput (single leader bottleneck).
They don't span continents well (RTT × commit count).

This is why most planet-scale systems are NOT linearizable across regions; they shard, and each shard has its own consensus group.

15.9 Spanner-specific: Paxos groups + TrueTime

Spanner shards data into groups; each group runs Paxos for replication. TrueTime orders transactions across groups.

The combination = distributed strict serializability. Unique in the industry until very recently. CockroachDB, YugabyteDB approximate it without atomic clocks.

15.10 Byzantine fault tolerance (BFT)

Crash faults: nodes stop. Byzantine faults: nodes lie or send conflicting info. BFT consensus tolerates Byzantine faults at cost of higher message complexity (PBFT, Tendermint).

You generally only need BFT for:

Public blockchains (no trust between nodes).
Hostile environments (defense, finance edge cases).

Internal systems assume crash faults; trust-but-verify with audits is enough.

15.11 Common consensus pitfalls

Cluster size: 3 or 5, not 4. 7 only if you have a real reason. Bigger = slower commits.
Geographic distribution: spreading the quorum across regions = high commit latency. Most systems keep the quorum within a region; cross-region replication is async.
Disk failures: a corrupt log on one node can poison the protocol. Snapshot regularly; verify.
Split brain via clock skew: stale leader keeps committing because it didn't notice the term changed. Fence with leader leases.

15.12 What you should be able to do in an interview

Explain why consensus is hard (FLP).
Sketch Raft's leader election and log replication in 5 minutes.
Justify quorum size (3 or 5).
Articulate why "every write through consensus" doesn't scale and how partitioning + per-partition consensus does.
Recognize when a problem reduces to consensus (locks, leader, atomic commit, membership).

Key takeaways

Consensus = N nodes agreeing despite failures. Raft is the practical default.
Quorum requires majority; cluster sizes 3 or 5.
Use consensus for metadata (leader, config, membership) and per-partition replication, not every user write.
Spanner = Paxos per shard + TrueTime for global ordering.
BFT is overkill for trusted internal systems.