system-design/distributed-fundamentals.md

13. Distributed Systems Fundamentals: Faults & Networks

A distributed system is one where "a computer you didn't even know existed can render your computer unusable" (Lamport). Understanding *why* distributed systems are hard is the first step to designing them well.

~6 min read·updated 5/29/2026

13. Distributed Systems Fundamentals: Faults & Networks

A distributed system is one where "a computer you didn't even know existed can render your computer unusable" (Lamport). Understanding why distributed systems are hard is the first step to designing them well.

13.1 The fundamental problem: partial failure

In a single process, things mostly work or fully crash. Across machines, anything can fail independently. A request might:

  • arrive
  • not arrive
  • arrive late
  • arrive multiple times
  • arrive but the response is lost

You usually can't tell which. Timeouts are guesses, not verdicts.

This single fact drives most distributed-systems design.

13.2 The unreliable network

Networks drop packets, delay packets, reorder packets, partition (split between groups of nodes), route asymmetrically.

Causes you'll meet in production:

  • TCP buffer overflow at intermediate switches → drops
  • BGP misconfiguration → routes through wrong continent
  • Microbursts at top-of-rack switches → ms-scale spikes
  • Rolling DNS update → some clients see old IPs for hours
  • A misbehaving NIC → ~7% packet loss to one rack
  • A datacenter loses power → instant partition
  • "Clock-source mismatch" → TLS handshake fails

The model you should always assume: asynchronous network with crash-recovery faults. No upper bound on message delay. Nodes can crash and come back.

13.3 Detecting failures

A node that doesn't reply might be:

  • Dead (crashed)
  • Stuck (GC pause, swapping, infinite loop)
  • Slow (overloaded, queueing)
  • Network-isolated (partition)
  • Just slow this RTT

You set a timeout. If it expires, you decide "failed" — and you're occasionally wrong both ways.

Timeouts: how to choose

  • Too short: false positives → unnecessary failovers, retries amplify load.
  • Too long: real failures linger → cascading slowness.
  • Adaptive timeouts (track RTT distribution per peer, set to P99 + margin).
  • Phi-accrual failure detector: outputs a confidence ("how dead is this node?") rather than a binary. Used in Cassandra, Akka.

Heartbeats

Periodic "I'm alive" messages. Common but flawed: a node can heartbeat fine but be unable to serve real requests (deadlock, exhausted threads, GC paused on critical path).

Better: synthetic transactions that exercise the actual code path.

13.4 The eight fallacies of distributed computing

Coined at Sun in the 90s. Every one of these is a thing engineers assume that isn't true:

  1. The network is reliable.
  2. Latency is zero.
  3. Bandwidth is infinite.
  4. The network is secure.
  5. Topology doesn't change.
  6. There is one administrator.
  7. Transport cost is zero.
  8. The network is homogeneous.

If you find yourself assuming any of these, you have a bug.

13.5 Unreliable clocks

Two kinds:

Time-of-day clock (wall clock)

Returns "current calendar time." Synchronized via NTP. Drifts. Can jump backward (NTP correction) or forward.

Don't use it for ordering or duration measurement.

Monotonic clock

Counts elapsed time from an arbitrary epoch. Never goes backward. Use this for: timeouts, durations, rate limits.

Clock skew across machines

Even with NTP, machines drift ~10s of ms apart routinely; sometimes seconds during NTP issues.

Implications:

  • "Last-writer-wins" using wall-clock timestamps loses writes when clocks skew.
  • Database transaction ordering by physical clock is broken.
  • Distributed locks with wall-clock TTLs can be released early/late.

(Chapter 14 covers logical clocks and Spanner's TrueTime.)

13.6 Process pauses

Even on one machine, processes pause. JVM GC pauses can be 100s of ms. Linux can swap a process out for seconds under memory pressure. VM live migration can pause guests.

Implications:

  • A leader might believe its lock is still valid; but the rest of the cluster has long since timed it out and elected a new leader. Old leader writes → split brain.
  • Fence with fencing tokens: a monotonically increasing number granted with the lock; storage refuses any token < its highest seen.

13.7 Knowledge, truth, and lies

Important framing from DDIA:

  • A node cannot trust its own judgment about whether it's working. (It might be the slow one.)
  • Truth is defined by the majority (quorum). What most nodes agree on is the truth.
  • A liar (Byzantine fault) is when a node sends conflicting info or wrong data. Most non-blockchain systems assume no Byzantine faults.

The split-brain problem

Two leaders both think they're in charge. They issue conflicting writes. Recovery requires reconciling — often requires manual intervention or losing one side's data.

Prevention: only one quorum can elect a leader; fencing tokens; STONITH.

13.8 Consensus is the answer

Many distributed problems reduce to consensus: leader election, distributed locks, atomic commits, group membership. Solving consensus = solving them all.

Consensus algorithms (Paxos, Raft, Zab) have a majority requirement: at least ⌊N/2⌋ + 1 nodes must agree. With N=3, you tolerate 1 failure; N=5 tolerates 2. (See chapter 15.)

This is why ZooKeeper / etcd clusters are typically 3 or 5 nodes.

13.9 The FLP impossibility

Fischer, Lynch, Paterson (1985): in a fully asynchronous system with even one crash failure, no consensus algorithm always terminates.

Practical implication: real consensus algorithms add timing assumptions (failure detectors). They terminate in practice; theoretical guarantees are weakened.

13.10 The two-generals problem

Two generals on opposite hills must coordinate an attack. Every messenger can be captured. Can they ever be sure they'll attack at the same time?

Provably no. Each protocol that requires a final ack adds another ack required.

Implication: you cannot build a perfectly-reliable distributed atomic protocol over an unreliable network. Real protocols (2PC, distributed transactions) work in practice but have failure modes.

13.11 Idempotency: the saving grace

If an operation is idempotent (safe to repeat), retries become safe. The network can lose acks; you retry; everything's fine.

Make operations idempotent by:

  • Idempotency keys: client generates a UUID; server stores (key, result) for a window; if same key arrives again, return cached result.
  • Conditional writes: UPDATE ... WHERE version = X (CAS).
  • Set-based ops: "set status to X" is idempotent; "increment by 1" is not.

Stripe's API has Idempotency-Key header on every POST. Every payment system you build should follow.

13.12 At-most-once vs at-least-once vs exactly-once

When messages flow through a queue or RPC:

  • At-most-once: send and pray; never retry. Simple, lose data.
  • At-least-once: retry until ack. Always works; duplicates possible.
  • Exactly-once: at-least-once + idempotent receiver = effectively once.

There is no true "exactly once delivery." There is only at-least-once delivery + idempotent processing. Kafka's "exactly-once" is semantic exactly-once via transactional producers + idempotent consumers + offsets.

13.13 Cascading failures

A single slow node → backed-up requests → upstream queues fill → upstream times out → upstream's clients retry → load amplifies → more nodes overloaded → cascade.

Patterns to prevent:

  • Circuit breakers: open after N failures; fail fast; periodically test.
  • Bulkheads: separate connection pools per dependency, so one failure doesn't drain all.
  • Backpressure: signal upstream to slow down.
  • Load shedding: drop low-priority requests under load, preserve critical ones.
  • Rate limiting: cap inbound work.
  • Hedging: send the same request to two replicas after a small delay; use whichever responds. Tames tail latency.

13.14 Designing for failure

Mantras:

  • Assume everything fails. Replicas die, networks split, processes pause, disks corrupt.
  • Design rollback before rollforward. Every change must be reversible.
  • Health is a question, not a fact. Use synthetic traffic.
  • Prefer idempotency over correctness-via-luck.
  • Test failure modes. Chaos engineering is not optional.

Key takeaways

  • Partial failure = the defining property of distributed systems.
  • Networks lose, delay, reorder, and partition. Clocks lie. Processes pause.
  • Truth = majority. Single nodes can't be trusted alone.
  • Consensus solves leader election, locks, atomic commit — at the cost of latency.
  • "Exactly once" is at-least-once + idempotency.
  • Cascading failures are how systems die. Plan for them: circuit breakers, bulkheads, load shedding.

// 1 view

main
UTF-8·typescript