system-design/distributed-data-structures.md

17. Probabilistic Structures: Bloom, Consistent Hashing, Merkle, HLL, Count-Min

A handful of clever data structures show up everywhere in distributed systems. Knowing them is interview table stakes.

~6 min read·updated 5/29/2026

17. Probabilistic Structures: Bloom, Consistent Hashing, Merkle, HLL, Count-Min

A handful of clever data structures show up everywhere in distributed systems. Knowing them is interview table stakes.

17.1 Bloom Filter

Question: "Is X in this huge set?" Bloom filter answers: "definitely not" or "probably yes."

How it works

  • Bit array of size m (initially all zero).
  • k independent hash functions.
  • Insert(x): set bits at h1(x), h2(x), ..., hk(x).
  • Query(x): check those bits; if any is 0, x is definitely absent. If all are 1, x is probably present (false positives possible).

Properties

  • No false negatives.
  • False positive rate: ~(1 - e^(-kn/m))^k. For 1% FPR with n=1M, m≈10MB.
  • Insert and query are O(k), independent of n.
  • Cannot delete from a standard Bloom filter (would corrupt). Counting Bloom filter (each "bit" is a small counter) supports delete.

Where you'll see it

  • LSM trees: each SSTable has a Bloom filter; reads skip files that "definitely don't" contain the key.
  • Caches: pre-check before going to backing store ("cache penetration" defense).
  • Web crawlers: "have I seen this URL?"
  • Distributed systems: deduplication of seen events.
  • Spell checkers: word in dictionary?

Variants

  • Counting Bloom filter (deletes).
  • Cuckoo filter (deletes + slightly better FPR; more complex).
  • Quotient filter (cache-friendly).

17.2 Consistent Hashing

Problem: distribute keys across N nodes such that adding/removing a node moves few keys.

Naive hash(key) mod N fails: change N, every key remaps.

How it works

  • Place all nodes on a "ring" by hashing their identifiers (e.g., name + IP).
  • For each key: hash to a point on the ring; assign to the next node clockwise.
  • Add a node: it takes over keys from its successor — only ~K/N keys move.
  • Remove a node: its keys go to the next node — only those keys move.

Virtual nodes (vnodes)

Each physical node gets multiple positions on the ring (e.g., 100-200 vnodes per node). Smooths out load (otherwise random position assignment causes large imbalances).

Where it shines

  • Memcached / Redis client-side sharding.
  • Cassandra / Dynamo: data partitioning + replication (replicate to next R nodes clockwise).
  • CDN cache servers.
  • Load balancers with sticky-by-key routing.

Trade-off

  • Solves rebalancing pain.
  • Doesn't solve hot keys: if one key is super-hot, it still lives on one (set of) node(s).

Rendezvous (HRW) hashing

Alternative: for each key, compute hash(key, node) for every node; pick the node with the highest score. Same K/N rebalancing property; doesn't need a ring; weights nodes naturally. Often simpler to implement.

17.3 Merkle Trees

A binary tree of hashes. Leaves = hashes of data blocks; each internal node = hash of its children.

Properties

  • Root hash uniquely identifies the entire data set.
  • Detect difference between two trees in O(log N) by comparing roots, then recursing only on differing branches.

Used in

  • Git: every commit, tree, and blob is hashed; commit history is Merkle DAG.
  • Blockchains: Bitcoin block headers contain a Merkle root of all transactions.
  • Anti-entropy in Cassandra / Dynamo: replicas exchange Merkle trees to find diffs and resync only the differing chunks.
  • Filesystems: ZFS, IPFS use Merkle DAGs for integrity.
  • Synchronization protocols: rsync-like diffing.

Implementation note

For varying-size data, balance leaves carefully. Otherwise depth differs and comparison breaks.

17.4 HyperLogLog (HLL)

Question: count distinct items in a stream/large set with limited memory.

Exact COUNT(DISTINCT x) requires storing every item — infeasible at billions.

How it works (simplified)

  • Hash each input to a bit string.
  • Track the longest run of leading zeros seen across all hashes.
  • The longer the run, the more distinct items you've probably seen (rare events imply large universe).
  • Use multiple "buckets" by partitioning hash bits and average for accuracy.

Properties

  • Memory: ~12 KB to estimate cardinalities up to billions with ~2% error.
  • Mergeable: union of HLLs is straightforward (max per bucket).
  • O(1) update.

Used in

  • Redis (PFADD, PFCOUNT, PFMERGE).
  • BigQuery (APPROX_COUNT_DISTINCT).
  • Druid, Presto, Snowflake, ClickHouse.
  • Real-time analytics: unique users per ad campaign per minute, across many shards.

Gotcha

For small cardinalities (< ~100), the estimate can be inaccurate. HLL++ (Google's improvement) corrects with linear counting at low cardinalities.

17.5 Count-Min Sketch

Question: estimate frequency of items in a stream.

Exact frequency table = O(distinct items) memory. Count-min uses fixed memory.

How it works

  • 2D array of d rows × w columns. d hash functions.
  • Increment(x): for each row i, increment arr[i][hi(x)].
  • Query(x): return min over rows of arr[i][hi(x)].

Properties

  • Always overestimates (collisions inflate counts), never underestimates.
  • Error bound: with d = ln(1/δ), w = e/ε, P(error > εN) < δ. Tunable.
  • Tiny memory (KBs to handle gigabytes of input).

Used in

  • Heavy-hitters detection: top-K most-frequent items.
  • Network monitoring: detect DDoS sources without per-IP counters.
  • Stream analytics: trending searches.

Variants

  • Conservative update: only increment min cells; better accuracy.
  • Top-K sketch: maintain heap of top items, with frequencies via CMS.

17.6 Skip List

Probabilistic alternative to balanced BSTs. Layered linked lists; higher layers skip more nodes.

  • Search/insert/delete: O(log N) expected.
  • Used in: Redis sorted sets (ZSET) internals; LevelDB / RocksDB memtables; many lock-free concurrent structures.

Why over a tree: easier to implement lock-free; simpler concurrent updates.

17.7 Trie

Tree where edges represent characters. Useful for prefix queries.

  • Autocomplete: type "go" → traverse to "go" node; emit completions.
  • IP routing: trie of IP prefixes (BGP routers use compressed tries).
  • Spell check: check word existence + suggest.

Variants:

  • Compressed / radix trie (PATRICIA): collapse single-child chains.
  • Suffix trie / suffix array: substring queries; used in bioinformatics.

17.8 Geohash / S2 / H3

(See chapter 35 — Uber.) Encode lat/lon to a 1D string/integer that preserves locality (nearby locations → similar codes).

  • Geohash (alphanumeric): coarse but easy.
  • S2 (Google): hierarchical, spherical (doesn't distort at poles).
  • H3 (Uber): hexagonal cells; uniform distance; no border discontinuities.

17.9 LSM Tree (recap from chapter 5)

Yes, it's a "data structure." The combination of memtable (skip list) + SSTables (sorted runs on disk) + Bloom filters + compaction.

The LSM is the unsung hero of modern distributed storage.

17.10 Vector Clocks (recap from chapter 14)

Map of node ID → counter. Tracks causality across distributed events.

17.11 CRDTs (Conflict-Free Replicated Data Types)

Data structures where all replicas converge regardless of order. Two flavors:

  • State-based (CvRDT): each replica sends its state; merge function is commutative, associative, idempotent.
  • Operation-based (CmRDT): replicas send operations; reliable causal broadcast.

Common types:

  • G-Counter: grow-only counter (one slot per replica; sum).
  • PN-Counter: increment + decrement (two G-counters).
  • G-Set: grow-only set.
  • OR-Set: observed-remove set; each element tagged with unique add-ID.
  • LWW-Register: timestamped scalar.
  • RGA / WOOT / Yjs / Automerge: collaborative text editing.

Used in: Riak (CRDT types), Redis CRDT module, collaborative apps (Figma, Linear, Notion partly).

17.12 SSTables (Sorted String Tables)

Immutable, sorted file of key-value pairs. Internal block structure for fast lookup. Foundation of all LSM-based storage.

  • Read: binary search index → seek to block → scan.
  • Compression: per-block, with prefix sharing in keys.
  • Bloom filter accompanies each SSTable.

17.13 RAFT Log (recap)

Log + state machine. Each entry is (term, index, command). Replicated via consensus. Apply to state machine in order.

17.14 What you should be able to derive in an interview

  • Bloom filter: false positives only; used in LSM trees and caches.
  • Consistent hashing: K/N keys move on resize; vnodes for balance.
  • Merkle tree: efficient diff between large sets.
  • HLL: count distinct in tiny memory; mergeable.
  • Count-min sketch: frequency estimation; always overestimates.

Key takeaways

  • Probabilistic structures trade exactness for huge memory savings.
  • Bloom + consistent hashing + Merkle = the trio that makes Cassandra/Dynamo possible.
  • HLL and Count-Min Sketch power real-time analytics at scale.
  • These are interview gold: knowing them signals you understand how big systems work.

// 1 view

main
UTF-8·typescript