◐ system-design/distributed-data-structures.md
17. Probabilistic Structures: Bloom, Consistent Hashing, Merkle, HLL, Count-Min
A handful of clever data structures show up everywhere in distributed systems. Knowing them is interview table stakes.
~6 min read·updated 5/29/2026
17. Probabilistic Structures: Bloom, Consistent Hashing, Merkle, HLL, Count-Min
A handful of clever data structures show up everywhere in distributed systems. Knowing them is interview table stakes.
17.1 Bloom Filter
Question: "Is X in this huge set?" Bloom filter answers: "definitely not" or "probably yes."
How it works
- Bit array of size m (initially all zero).
- k independent hash functions.
- Insert(x): set bits at h1(x), h2(x), ..., hk(x).
- Query(x): check those bits; if any is 0, x is definitely absent. If all are 1, x is probably present (false positives possible).
Properties
- No false negatives.
- False positive rate: ~(1 - e^(-kn/m))^k. For 1% FPR with n=1M, m≈10MB.
- Insert and query are O(k), independent of n.
- Cannot delete from a standard Bloom filter (would corrupt). Counting Bloom filter (each "bit" is a small counter) supports delete.
Where you'll see it
- LSM trees: each SSTable has a Bloom filter; reads skip files that "definitely don't" contain the key.
- Caches: pre-check before going to backing store ("cache penetration" defense).
- Web crawlers: "have I seen this URL?"
- Distributed systems: deduplication of seen events.
- Spell checkers: word in dictionary?
Variants
- Counting Bloom filter (deletes).
- Cuckoo filter (deletes + slightly better FPR; more complex).
- Quotient filter (cache-friendly).
17.2 Consistent Hashing
Problem: distribute keys across N nodes such that adding/removing a node moves few keys.
Naive hash(key) mod N fails: change N, every key remaps.
How it works
- Place all nodes on a "ring" by hashing their identifiers (e.g., name + IP).
- For each key: hash to a point on the ring; assign to the next node clockwise.
- Add a node: it takes over keys from its successor — only ~K/N keys move.
- Remove a node: its keys go to the next node — only those keys move.
Virtual nodes (vnodes)
Each physical node gets multiple positions on the ring (e.g., 100-200 vnodes per node). Smooths out load (otherwise random position assignment causes large imbalances).
Where it shines
- Memcached / Redis client-side sharding.
- Cassandra / Dynamo: data partitioning + replication (replicate to next R nodes clockwise).
- CDN cache servers.
- Load balancers with sticky-by-key routing.
Trade-off
- Solves rebalancing pain.
- Doesn't solve hot keys: if one key is super-hot, it still lives on one (set of) node(s).
Rendezvous (HRW) hashing
Alternative: for each key, compute hash(key, node) for every node; pick the node with the highest score. Same K/N rebalancing property; doesn't need a ring; weights nodes naturally. Often simpler to implement.
17.3 Merkle Trees
A binary tree of hashes. Leaves = hashes of data blocks; each internal node = hash of its children.
Properties
- Root hash uniquely identifies the entire data set.
- Detect difference between two trees in O(log N) by comparing roots, then recursing only on differing branches.
Used in
- Git: every commit, tree, and blob is hashed; commit history is Merkle DAG.
- Blockchains: Bitcoin block headers contain a Merkle root of all transactions.
- Anti-entropy in Cassandra / Dynamo: replicas exchange Merkle trees to find diffs and resync only the differing chunks.
- Filesystems: ZFS, IPFS use Merkle DAGs for integrity.
- Synchronization protocols: rsync-like diffing.
Implementation note
For varying-size data, balance leaves carefully. Otherwise depth differs and comparison breaks.
17.4 HyperLogLog (HLL)
Question: count distinct items in a stream/large set with limited memory.
Exact COUNT(DISTINCT x) requires storing every item — infeasible at billions.
How it works (simplified)
- Hash each input to a bit string.
- Track the longest run of leading zeros seen across all hashes.
- The longer the run, the more distinct items you've probably seen (rare events imply large universe).
- Use multiple "buckets" by partitioning hash bits and average for accuracy.
Properties
- Memory: ~12 KB to estimate cardinalities up to billions with ~2% error.
- Mergeable: union of HLLs is straightforward (max per bucket).
- O(1) update.
Used in
- Redis (PFADD, PFCOUNT, PFMERGE).
- BigQuery (
APPROX_COUNT_DISTINCT). - Druid, Presto, Snowflake, ClickHouse.
- Real-time analytics: unique users per ad campaign per minute, across many shards.
Gotcha
For small cardinalities (< ~100), the estimate can be inaccurate. HLL++ (Google's improvement) corrects with linear counting at low cardinalities.
17.5 Count-Min Sketch
Question: estimate frequency of items in a stream.
Exact frequency table = O(distinct items) memory. Count-min uses fixed memory.
How it works
- 2D array of d rows × w columns. d hash functions.
- Increment(x): for each row i, increment
arr[i][hi(x)]. - Query(x): return min over rows of
arr[i][hi(x)].
Properties
- Always overestimates (collisions inflate counts), never underestimates.
- Error bound: with d = ln(1/δ), w = e/ε, P(error > εN) < δ. Tunable.
- Tiny memory (KBs to handle gigabytes of input).
Used in
- Heavy-hitters detection: top-K most-frequent items.
- Network monitoring: detect DDoS sources without per-IP counters.
- Stream analytics: trending searches.
Variants
- Conservative update: only increment min cells; better accuracy.
- Top-K sketch: maintain heap of top items, with frequencies via CMS.
17.6 Skip List
Probabilistic alternative to balanced BSTs. Layered linked lists; higher layers skip more nodes.
- Search/insert/delete: O(log N) expected.
- Used in: Redis sorted sets (ZSET) internals; LevelDB / RocksDB memtables; many lock-free concurrent structures.
Why over a tree: easier to implement lock-free; simpler concurrent updates.
17.7 Trie
Tree where edges represent characters. Useful for prefix queries.
- Autocomplete: type "go" → traverse to "go" node; emit completions.
- IP routing: trie of IP prefixes (BGP routers use compressed tries).
- Spell check: check word existence + suggest.
Variants:
- Compressed / radix trie (PATRICIA): collapse single-child chains.
- Suffix trie / suffix array: substring queries; used in bioinformatics.
17.8 Geohash / S2 / H3
(See chapter 35 — Uber.) Encode lat/lon to a 1D string/integer that preserves locality (nearby locations → similar codes).
- Geohash (alphanumeric): coarse but easy.
- S2 (Google): hierarchical, spherical (doesn't distort at poles).
- H3 (Uber): hexagonal cells; uniform distance; no border discontinuities.
17.9 LSM Tree (recap from chapter 5)
Yes, it's a "data structure." The combination of memtable (skip list) + SSTables (sorted runs on disk) + Bloom filters + compaction.
The LSM is the unsung hero of modern distributed storage.
17.10 Vector Clocks (recap from chapter 14)
Map of node ID → counter. Tracks causality across distributed events.
17.11 CRDTs (Conflict-Free Replicated Data Types)
Data structures where all replicas converge regardless of order. Two flavors:
- State-based (CvRDT): each replica sends its state; merge function is commutative, associative, idempotent.
- Operation-based (CmRDT): replicas send operations; reliable causal broadcast.
Common types:
- G-Counter: grow-only counter (one slot per replica; sum).
- PN-Counter: increment + decrement (two G-counters).
- G-Set: grow-only set.
- OR-Set: observed-remove set; each element tagged with unique add-ID.
- LWW-Register: timestamped scalar.
- RGA / WOOT / Yjs / Automerge: collaborative text editing.
Used in: Riak (CRDT types), Redis CRDT module, collaborative apps (Figma, Linear, Notion partly).
17.12 SSTables (Sorted String Tables)
Immutable, sorted file of key-value pairs. Internal block structure for fast lookup. Foundation of all LSM-based storage.
- Read: binary search index → seek to block → scan.
- Compression: per-block, with prefix sharing in keys.
- Bloom filter accompanies each SSTable.
17.13 RAFT Log (recap)
Log + state machine. Each entry is (term, index, command). Replicated via consensus. Apply to state machine in order.
17.14 What you should be able to derive in an interview
- Bloom filter: false positives only; used in LSM trees and caches.
- Consistent hashing: K/N keys move on resize; vnodes for balance.
- Merkle tree: efficient diff between large sets.
- HLL: count distinct in tiny memory; mergeable.
- Count-min sketch: frequency estimation; always overestimates.
Key takeaways
- Probabilistic structures trade exactness for huge memory savings.
- Bloom + consistent hashing + Merkle = the trio that makes Cassandra/Dynamo possible.
- HLL and Count-Min Sketch power real-time analytics at scale.
- These are interview gold: knowing them signals you understand how big systems work.
// 1 view