◐ system-design/numbers-estimation.md
2. Numbers & Capacity Estimation
You will estimate in every interview. The math is trivial; what matters is *speed and confidence*. Memorize the numbers below cold.
~5 min read·updated 5/29/2026
2. Numbers & Capacity Estimation
You will estimate in every interview. The math is trivial; what matters is speed and confidence. Memorize the numbers below cold.
2.1 Powers of 10
Bytes:
- KB = 10³ = 1,000 bytes (~10³)
- MB = 10⁶
- GB = 10⁹
- TB = 10¹²
- PB = 10¹⁵
- EB = 10¹⁸
Time:
- 1 day = 86,400 ≈ 10⁵ seconds
- 1 month = ~2.5 × 10⁶ seconds
- 1 year = ~3.15 × 10⁷ seconds
People:
- 1 million users with 10% DAU = 100K DAU
- DAU to QPS rule of thumb: assume each user does N actions/day, distributed unevenly with peak ~3× average.
2.2 Latency numbers every programmer should know
(Jeff Dean's classic; rounded for memorization.)
| Operation | Latency |
|---|---|
| L1 cache reference | 0.5 ns |
| Branch mispredict | 5 ns |
| L2 cache reference | 7 ns |
| Mutex lock/unlock | 25 ns |
| Main memory reference | 100 ns |
| Compress 1KB with Snappy | 3 µs |
| Send 1KB over 1Gbps network | 10 µs |
| Read 4KB random from SSD | 150 µs |
| Read 1MB sequentially from memory | 250 µs |
| Round trip in same datacenter | 500 µs |
| Read 1MB sequentially from SSD | 1 ms |
| Disk seek (HDD) | 10 ms |
| Read 1MB sequentially from HDD | 20 ms |
| Round trip US ↔ Europe | 150 ms |
What this implies
- Memory is ~100,000× faster than disk seek. Caching is the biggest lever you have.
- A network round trip in-DC (500 µs) is 5,000× slower than memory access. Batch your RPCs.
- Cross-continent RTT (150 ms) > sequential read of 1GB. Don't chat with the user across the ocean.
- Sequential disk is ~20× faster than random disk. Append-only logs (LSM trees, Kafka) crush random-write workloads.
2.3 Storage capacity (typical 2026 hardware)
- Single SSD: 1–8 TB common, up to 30 TB
- Single HDD: 16–24 TB common
- Modern server RAM: 256 GB – 4 TB
- 1 commodity rack: ~40 servers
- Datacenter: 10K–100K+ servers
2.4 Throughput numbers
- 1 Gbps NIC: ~125 MB/s effective
- 10 Gbps NIC: ~1.25 GB/s
- Sequential SSD write: ~500 MB/s – 3 GB/s (NVMe)
- Random SSD read: ~10K–1M IOPS depending on tier
- HDD throughput: ~150 MB/s sequential, ~100 IOPS random
2.5 The estimation framework
Always answer in this order:
- Clarify the load parameter. "By 'design Twitter' do you mean 300M MAU? What's the read:write ratio?"
- Compute QPS. Average and peak.
- Compute storage (per day, per year, with replication factor).
- Compute bandwidth (in and out).
- Compute memory (for cache, indexes).
- Compute server count (QPS / per-server capacity).
Worked example: "Design Twitter"
Assumptions:
- 300M MAU, 50% DAU = 150M DAU
- Each user reads 100 tweets/day, writes 0.5 tweets/day
- Average tweet = 200 bytes text + 200 bytes metadata = 400 bytes; 10% have media (avg 500 KB)
Write QPS (tweets):
- 150M × 0.5 = 75M tweets/day
- Average: 75M / 86,400 ≈ 870 tweets/sec
- Peak (3×): ~2,600 tweets/sec → call it 3K writes/sec
Read QPS (timeline reads):
- 150M × 100 = 15B reads/day
- Average: 15B / 86,400 ≈ 174K reads/sec
- Peak: ~500K reads/sec
Storage per day (text):
- 75M × 400 bytes = 30 GB/day
- Per year: ~11 TB
- 5 years: ~55 TB
- With 3× replication: ~165 TB. Fits on tens of nodes.
Storage per day (media):
- 7.5M × 500 KB = 3.75 TB/day
- Per year: ~1.4 PB
- With 3× replication: ~4 PB. Use object storage (S3/GCS).
Bandwidth in (writes):
- Text: 30 GB/day = 350 KB/s — trivial
- Media: 3.75 TB/day = 43 MB/s — also small
Bandwidth out (reads):
- 15B reads/day, each ~5KB serialized timeline payload (cached snippets)
- 15B × 5KB = 75 TB/day = 870 MB/s = ~7 Gbps. Need multiple 10G NICs / multiple servers.
Cache memory:
- Cache last 800 tweets per active user (covers 99% of timeline reads) × 150M DAU × ~1 KB = 120 TB. Spread across hundreds of cache nodes.
Common ratios to assume
| Workload | Read:write |
|---|---|
| Social feed (Twitter, IG) | 100:1 |
| Wiki / docs | 1000:1 |
| 1:1 | |
| Search | 10000:1 |
| Analytics ingest | 1:100 (write-heavy) |
| Logging | 1:1000 (write-heavy) |
2.6 Server count math
A modern web server with a typical stack handles:
- ~1,000–10,000 QPS for simple GETs (cache-served, no DB)
- ~100–1,000 QPS for typical app endpoints
- ~10–100 QPS for complex DB-bound endpoints
Postgres on a strong box: ~10K–50K simple reads/sec, ~1K–10K writes/sec. Past that, add read replicas, then shard.
Redis: ~100K ops/sec per node. Cluster scales horizontally.
Kafka: ~1M messages/sec per broker (small messages, batched).
2.7 Quick rules-of-thumb
- Cache hit ratio. If 80% hit, only 20% of QPS reaches DB.
- Replication factor 3. Standard for distributed databases (Cassandra, GFS, HDFS).
- Hot keys are 80/20. Top 20% of keys get 80% of traffic; top 1% can get 50%.
- Image sizes. Thumbnail ~10 KB, web image ~200 KB, photo ~2 MB, raw ~30 MB.
- Video. 1080p ~5 Mbps, 4K ~25 Mbps.
- Text. Average tweet 200 chars. Average web page 2 MB (75% images/JS).
2.8 Show your work
The interviewer doesn't want exact numbers, they want reasoning. Always:
- State assumptions out loud.
- Round to easy numbers (use 100K not 86,400).
- Sanity check ("3 PB sounds right for global Twitter media").
- Update the design once you see a number is too big or too small.
If you say "3M QPS hits Postgres" without flinching, you have failed. Numbers must drive architecture.
Key takeaways
- Memorize Jeff Dean's latency numbers.
- Estimate in this order: QPS → storage → bandwidth → memory → server count.
- Read-heavy systems get caches first; write-heavy systems get queues and partitioning first.
- State assumptions and round aggressively.
// 1 view