17. Probabilistic Structures: Bloom, Consistent Hashing, Merkle, HLL, Count-Min

A handful of clever data structures show up everywhere in distributed systems. Knowing them is interview table stakes.

17.1 Bloom Filter

Question: "Is X in this huge set?" Bloom filter answers: "definitely not" or "probably yes."

How it works

Bit array of size m (initially all zero).
k independent hash functions.
Insert(x): set bits at h1(x), h2(x), ..., hk(x).
Query(x): check those bits; if any is 0, x is definitely absent. If all are 1, x is probably present (false positives possible).

Properties

No false negatives.
False positive rate: ~(1 - e^(-kn/m))^k. For 1% FPR with n=1M, m≈10MB.
Insert and query are O(k), independent of n.
Cannot delete from a standard Bloom filter (would corrupt). Counting Bloom filter (each "bit" is a small counter) supports delete.

Where you'll see it

LSM trees: each SSTable has a Bloom filter; reads skip files that "definitely don't" contain the key.
Caches: pre-check before going to backing store ("cache penetration" defense).
Web crawlers: "have I seen this URL?"
Distributed systems: deduplication of seen events.
Spell checkers: word in dictionary?

Variants

Counting Bloom filter (deletes).
Cuckoo filter (deletes + slightly better FPR; more complex).
Quotient filter (cache-friendly).

17.2 Consistent Hashing

Problem: distribute keys across N nodes such that adding/removing a node moves few keys.

Naive hash(key) mod N fails: change N, every key remaps.

How it works

Place all nodes on a "ring" by hashing their identifiers (e.g., name + IP).
For each key: hash to a point on the ring; assign to the next node clockwise.
Add a node: it takes over keys from its successor — only ~K/N keys move.
Remove a node: its keys go to the next node — only those keys move.

Virtual nodes (vnodes)

Each physical node gets multiple positions on the ring (e.g., 100-200 vnodes per node). Smooths out load (otherwise random position assignment causes large imbalances).

Where it shines

Memcached / Redis client-side sharding.
Cassandra / Dynamo: data partitioning + replication (replicate to next R nodes clockwise).
CDN cache servers.
Load balancers with sticky-by-key routing.

Trade-off

Solves rebalancing pain.
Doesn't solve hot keys: if one key is super-hot, it still lives on one (set of) node(s).

Rendezvous (HRW) hashing

Alternative: for each key, compute hash(key, node) for every node; pick the node with the highest score. Same K/N rebalancing property; doesn't need a ring; weights nodes naturally. Often simpler to implement.

17.3 Merkle Trees

A binary tree of hashes. Leaves = hashes of data blocks; each internal node = hash of its children.

Properties

Root hash uniquely identifies the entire data set.
Detect difference between two trees in O(log N) by comparing roots, then recursing only on differing branches.

Used in

Git: every commit, tree, and blob is hashed; commit history is Merkle DAG.
Blockchains: Bitcoin block headers contain a Merkle root of all transactions.
Anti-entropy in Cassandra / Dynamo: replicas exchange Merkle trees to find diffs and resync only the differing chunks.
Filesystems: ZFS, IPFS use Merkle DAGs for integrity.
Synchronization protocols: rsync-like diffing.

Implementation note

For varying-size data, balance leaves carefully. Otherwise depth differs and comparison breaks.

17.4 HyperLogLog (HLL)

Question: count distinct items in a stream/large set with limited memory.

Exact COUNT(DISTINCT x) requires storing every item — infeasible at billions.

How it works (simplified)

Hash each input to a bit string.
Track the longest run of leading zeros seen across all hashes.
The longer the run, the more distinct items you've probably seen (rare events imply large universe).
Use multiple "buckets" by partitioning hash bits and average for accuracy.

Properties

Memory: ~12 KB to estimate cardinalities up to billions with ~2% error.
Mergeable: union of HLLs is straightforward (max per bucket).
O(1) update.

Used in

Redis (PFADD, PFCOUNT, PFMERGE).
BigQuery (APPROX_COUNT_DISTINCT).
Druid, Presto, Snowflake, ClickHouse.
Real-time analytics: unique users per ad campaign per minute, across many shards.

Gotcha

For small cardinalities (< ~100), the estimate can be inaccurate. HLL++ (Google's improvement) corrects with linear counting at low cardinalities.

17.5 Count-Min Sketch

Question: estimate frequency of items in a stream.

Exact frequency table = O(distinct items) memory. Count-min uses fixed memory.

How it works

2D array of d rows × w columns. d hash functions.
Increment(x): for each row i, increment arr[i][hi(x)].
Query(x): return min over rows of arr[i][hi(x)].

Properties

Always overestimates (collisions inflate counts), never underestimates.
Error bound: with d = ln(1/δ), w = e/ε, P(error > εN) < δ. Tunable.
Tiny memory (KBs to handle gigabytes of input).

Used in

Heavy-hitters detection: top-K most-frequent items.
Network monitoring: detect DDoS sources without per-IP counters.
Stream analytics: trending searches.

Variants

Conservative update: only increment min cells; better accuracy.
Top-K sketch: maintain heap of top items, with frequencies via CMS.

17.6 Skip List

Probabilistic alternative to balanced BSTs. Layered linked lists; higher layers skip more nodes.

Search/insert/delete: O(log N) expected.
Used in: Redis sorted sets (ZSET) internals; LevelDB / RocksDB memtables; many lock-free concurrent structures.

Why over a tree: easier to implement lock-free; simpler concurrent updates.

17.7 Trie

Tree where edges represent characters. Useful for prefix queries.

Autocomplete: type "go" → traverse to "go" node; emit completions.
IP routing: trie of IP prefixes (BGP routers use compressed tries).
Spell check: check word existence + suggest.

Variants:

Compressed / radix trie (PATRICIA): collapse single-child chains.
Suffix trie / suffix array: substring queries; used in bioinformatics.

17.8 Geohash / S2 / H3

(See chapter 35 — Uber.) Encode lat/lon to a 1D string/integer that preserves locality (nearby locations → similar codes).

Geohash (alphanumeric): coarse but easy.
S2 (Google): hierarchical, spherical (doesn't distort at poles).
H3 (Uber): hexagonal cells; uniform distance; no border discontinuities.

17.9 LSM Tree (recap from chapter 5)

Yes, it's a "data structure." The combination of memtable (skip list) + SSTables (sorted runs on disk) + Bloom filters + compaction.

The LSM is the unsung hero of modern distributed storage.

17.10 Vector Clocks (recap from chapter 14)

Map of node ID → counter. Tracks causality across distributed events.

17.11 CRDTs (Conflict-Free Replicated Data Types)

Data structures where all replicas converge regardless of order. Two flavors:

State-based (CvRDT): each replica sends its state; merge function is commutative, associative, idempotent.
Operation-based (CmRDT): replicas send operations; reliable causal broadcast.

Common types:

G-Counter: grow-only counter (one slot per replica; sum).
PN-Counter: increment + decrement (two G-counters).
G-Set: grow-only set.
OR-Set: observed-remove set; each element tagged with unique add-ID.
LWW-Register: timestamped scalar.
RGA / WOOT / Yjs / Automerge: collaborative text editing.

Used in: Riak (CRDT types), Redis CRDT module, collaborative apps (Figma, Linear, Notion partly).

17.12 SSTables (Sorted String Tables)

Immutable, sorted file of key-value pairs. Internal block structure for fast lookup. Foundation of all LSM-based storage.

Read: binary search index → seek to block → scan.
Compression: per-block, with prefix sharing in keys.
Bloom filter accompanies each SSTable.

17.13 RAFT Log (recap)

Log + state machine. Each entry is (term, index, command). Replicated via consensus. Apply to state machine in order.

17.14 What you should be able to derive in an interview

Bloom filter: false positives only; used in LSM trees and caches.
Consistent hashing: K/N keys move on resize; vnodes for balance.
Merkle tree: efficient diff between large sets.
HLL: count distinct in tiny memory; mergeable.
Count-min sketch: frequency estimation; always overestimates.

Key takeaways

Probabilistic structures trade exactness for huge memory savings.
Bloom + consistent hashing + Merkle = the trio that makes Cassandra/Dynamo possible.
HLL and Count-Min Sketch power real-time analytics at scale.
These are interview gold: knowing them signals you understand how big systems work.