system-design/numbers-estimation.md

2. Numbers & Capacity Estimation

You will estimate in every interview. The math is trivial; what matters is *speed and confidence*. Memorize the numbers below cold.

~5 min read·updated 5/29/2026

2. Numbers & Capacity Estimation

You will estimate in every interview. The math is trivial; what matters is speed and confidence. Memorize the numbers below cold.

2.1 Powers of 10

Bytes:

  • KB = 10³ = 1,000 bytes (~10³)
  • MB = 10⁶
  • GB = 10⁹
  • TB = 10¹²
  • PB = 10¹⁵
  • EB = 10¹⁸

Time:

  • 1 day = 86,400 ≈ 10⁵ seconds
  • 1 month = ~2.5 × 10⁶ seconds
  • 1 year = ~3.15 × 10⁷ seconds

People:

  • 1 million users with 10% DAU = 100K DAU
  • DAU to QPS rule of thumb: assume each user does N actions/day, distributed unevenly with peak ~3× average.

2.2 Latency numbers every programmer should know

(Jeff Dean's classic; rounded for memorization.)

OperationLatency
L1 cache reference0.5 ns
Branch mispredict5 ns
L2 cache reference7 ns
Mutex lock/unlock25 ns
Main memory reference100 ns
Compress 1KB with Snappy3 µs
Send 1KB over 1Gbps network10 µs
Read 4KB random from SSD150 µs
Read 1MB sequentially from memory250 µs
Round trip in same datacenter500 µs
Read 1MB sequentially from SSD1 ms
Disk seek (HDD)10 ms
Read 1MB sequentially from HDD20 ms
Round trip US ↔ Europe150 ms

What this implies

  • Memory is ~100,000× faster than disk seek. Caching is the biggest lever you have.
  • A network round trip in-DC (500 µs) is 5,000× slower than memory access. Batch your RPCs.
  • Cross-continent RTT (150 ms) > sequential read of 1GB. Don't chat with the user across the ocean.
  • Sequential disk is ~20× faster than random disk. Append-only logs (LSM trees, Kafka) crush random-write workloads.

2.3 Storage capacity (typical 2026 hardware)

  • Single SSD: 1–8 TB common, up to 30 TB
  • Single HDD: 16–24 TB common
  • Modern server RAM: 256 GB – 4 TB
  • 1 commodity rack: ~40 servers
  • Datacenter: 10K–100K+ servers

2.4 Throughput numbers

  • 1 Gbps NIC: ~125 MB/s effective
  • 10 Gbps NIC: ~1.25 GB/s
  • Sequential SSD write: ~500 MB/s – 3 GB/s (NVMe)
  • Random SSD read: ~10K–1M IOPS depending on tier
  • HDD throughput: ~150 MB/s sequential, ~100 IOPS random

2.5 The estimation framework

Always answer in this order:

  1. Clarify the load parameter. "By 'design Twitter' do you mean 300M MAU? What's the read:write ratio?"
  2. Compute QPS. Average and peak.
  3. Compute storage (per day, per year, with replication factor).
  4. Compute bandwidth (in and out).
  5. Compute memory (for cache, indexes).
  6. Compute server count (QPS / per-server capacity).

Worked example: "Design Twitter"

Assumptions:

  • 300M MAU, 50% DAU = 150M DAU
  • Each user reads 100 tweets/day, writes 0.5 tweets/day
  • Average tweet = 200 bytes text + 200 bytes metadata = 400 bytes; 10% have media (avg 500 KB)

Write QPS (tweets):

  • 150M × 0.5 = 75M tweets/day
  • Average: 75M / 86,400 ≈ 870 tweets/sec
  • Peak (3×): ~2,600 tweets/sec → call it 3K writes/sec

Read QPS (timeline reads):

  • 150M × 100 = 15B reads/day
  • Average: 15B / 86,400 ≈ 174K reads/sec
  • Peak: ~500K reads/sec

Storage per day (text):

  • 75M × 400 bytes = 30 GB/day
  • Per year: ~11 TB
  • 5 years: ~55 TB
  • With 3× replication: ~165 TB. Fits on tens of nodes.

Storage per day (media):

  • 7.5M × 500 KB = 3.75 TB/day
  • Per year: ~1.4 PB
  • With 3× replication: ~4 PB. Use object storage (S3/GCS).

Bandwidth in (writes):

  • Text: 30 GB/day = 350 KB/s — trivial
  • Media: 3.75 TB/day = 43 MB/s — also small

Bandwidth out (reads):

  • 15B reads/day, each ~5KB serialized timeline payload (cached snippets)
  • 15B × 5KB = 75 TB/day = 870 MB/s = ~7 Gbps. Need multiple 10G NICs / multiple servers.

Cache memory:

  • Cache last 800 tweets per active user (covers 99% of timeline reads) × 150M DAU × ~1 KB = 120 TB. Spread across hundreds of cache nodes.

Common ratios to assume

WorkloadRead:write
Social feed (Twitter, IG)100:1
Wiki / docs1000:1
Email1:1
Search10000:1
Analytics ingest1:100 (write-heavy)
Logging1:1000 (write-heavy)

2.6 Server count math

A modern web server with a typical stack handles:

  • ~1,000–10,000 QPS for simple GETs (cache-served, no DB)
  • ~100–1,000 QPS for typical app endpoints
  • ~10–100 QPS for complex DB-bound endpoints

Postgres on a strong box: ~10K–50K simple reads/sec, ~1K–10K writes/sec. Past that, add read replicas, then shard.

Redis: ~100K ops/sec per node. Cluster scales horizontally.

Kafka: ~1M messages/sec per broker (small messages, batched).

2.7 Quick rules-of-thumb

  • Cache hit ratio. If 80% hit, only 20% of QPS reaches DB.
  • Replication factor 3. Standard for distributed databases (Cassandra, GFS, HDFS).
  • Hot keys are 80/20. Top 20% of keys get 80% of traffic; top 1% can get 50%.
  • Image sizes. Thumbnail ~10 KB, web image ~200 KB, photo ~2 MB, raw ~30 MB.
  • Video. 1080p ~5 Mbps, 4K ~25 Mbps.
  • Text. Average tweet 200 chars. Average web page 2 MB (75% images/JS).

2.8 Show your work

The interviewer doesn't want exact numbers, they want reasoning. Always:

  1. State assumptions out loud.
  2. Round to easy numbers (use 100K not 86,400).
  3. Sanity check ("3 PB sounds right for global Twitter media").
  4. Update the design once you see a number is too big or too small.

If you say "3M QPS hits Postgres" without flinching, you have failed. Numbers must drive architecture.

Key takeaways

  • Memorize Jeff Dean's latency numbers.
  • Estimate in this order: QPS → storage → bandwidth → memory → server count.
  • Read-heavy systems get caches first; write-heavy systems get queues and partitioning first.
  • State assumptions and round aggressively.

// 1 view

main
UTF-8·typescript