◐ system-design/design-web-crawler.md
38. Design a Web Crawler
Tests: BFS at planet scale, politeness, deduplication, fault tolerance, distributed coordination.
~4 min read·updated 5/29/2026
38. Design a Web Crawler
Tests: BFS at planet scale, politeness, deduplication, fault tolerance, distributed coordination.
38.1 Requirements
Functional
- Crawl the web starting from seed URLs.
- Discover new URLs from pages.
- Store crawled content for later processing (indexing).
- Respect robots.txt.
- Re-crawl periodically to detect changes.
Non-functional
- Crawl billions of pages.
- High throughput: tens of thousands of pages/sec.
- Polite: don't hammer any single host.
- Fault-tolerant: pages, hosts, even datacenters fail.
- Extensible: support different content types.
38.2 Estimates
- 100B pages / 30 days = ~40K pages/sec.
- Avg page ~500 KB → ~20 GB/sec storage write rate.
- Storage: 100B × 500 KB × compression × dedup = ~5-20 PB.
38.3 High-level architecture
[ URL frontier ] ─→ [ Fetcher pool ] ─→ [ Content store ]
↑ │ │
│ [ HTML parser ] [ Async pipeline ]
│ │ │
│ [ URL extractor ] [ Indexer / etc. ]
│ │
discovered URLs ────────┘
The frontier is the queue of "URLs to crawl." The fetcher pulls from it; new discovered URLs feed back in.
38.4 The URL frontier
A priority queue of URLs to crawl. Two key properties:
Priority (which URL next?)
- Freshness (re-crawl popular pages more often).
- Importance (PageRank-ish).
- Crawl depth.
Politeness (don't hammer one host)
- No more than 1 request per N seconds per host.
- Respect
Crawl-Delayfrom robots.txt.
Implementation
Mercator-style frontier: two layers.
- Front queues by priority (e.g., 5 priority levels).
- Back queues per host (FIFO).
- Round-robin sample back queues so no host floods.
- Front queue selection probabilistic per priority.
Distributed: shard frontier by host hash. Each shard owns a slice of hosts and is the only crawler for them.
38.5 Politeness
For each host:
- Fetch robots.txt; parse (cache 24h).
- Respect
Disallowrules. - Respect
Crawl-Delayif present, else default (e.g., 1 sec per request). - Identify with User-Agent string.
Failure to be polite gets you blocked, sued, or shamed. Real production crawlers keep per-host fetch logs.
38.6 The fetcher
For each URL:
- DNS resolve (cache aggressively).
- HTTP GET with timeout.
- Handle redirects (max 5).
- Detect content type; only continue for HTML / docs of interest.
- Compute hash of content; check if we've seen this content before (skip if dup).
- Store raw content.
- Pass to parser.
Throughput: each fetcher handles ~100-1000 conn/sec; many fetchers in parallel.
38.7 Parser & URL extractor
- Parse HTML.
- Extract
<a href>URLs. - Normalize: resolve relative URLs, strip fragments, canonicalize (lowercase host, sort query params, remove trailing slashes).
- Filter out unwanted (mailto:, javascript:, etc.).
- Dedup against "seen" set.
- Add new URLs to frontier with priority.
38.8 Deduplication
Two kinds:
URL dedup
"Have I seen this URL before?"
- A "seen" set, possibly billions large.
- Bloom filter in front of an exact set (chapter 17). Bloom says no → definitely not seen; Bloom says yes → check exact.
- Exact set: distributed KV (sharded by URL hash).
Content dedup
Pages with different URLs but same content (mirrors, syndicated copies):
- Hash content (SHA-256 or shingles for near-dup).
- Store content by hash.
- Skip parsing if seen.
For near-duplicate detection: simhash or MinHash to detect 95% similar pages without exact equality.
38.9 Storage
Content store: object storage for raw HTML/PDF/etc. (S3, HDFS, GCS).
Indexed for downstream processing. Often in compressed columnar format (Parquet) for analytics.
Metadata DB: per-URL stats (last crawl time, status, hash, depth).
38.10 Re-crawl scheduling
Pages change at different rates. Static pages rarely; news every minute.
- Adaptive: page changed since last crawl → schedule sooner; unchanged → defer.
- Politeness still wins (can't hit too often).
A scheduler queries metadata DB for "pages due for re-crawl" and feeds frontier.
38.11 Distributed coordination
Many crawler nodes. Coordination needed for:
- Frontier ownership (which node owns which host).
- Avoid duplicate crawls.
- Failure recovery (node dies → reassign hosts).
Approaches:
- Hash-based partitioning of hosts (consistent hashing). Each node owns its slice.
- Coordination service (ZooKeeper / etcd) for membership and reassignment.
38.12 robots.txt and ethics
- Always honor.
- Identify with proper User-Agent.
- Respect copyright (some sites disallow archiving).
- Rate limit aggressively.
38.13 Failure modes
- Hung connection: timeout (e.g., 30s). Don't block forever.
- Host returns errors continuously: back off; eventually drop from frontier for a day.
- Parsing error: log + skip; never crash the worker.
- Content too large: cap downloads (e.g., 10 MB per page).
- Spider trap: page generates infinite URLs (calendar with year=X, month=Y forever). Detect via URL patterns and depth limits.
38.14 Distributed system aspects
This is a beautifully distributed problem:
- Partitioning: by host.
- Coordination: per-host owner.
- Replication: content store replicated for durability.
- Throttling: per-host politeness.
- Eventual consistency: seen-set converges; some duplicates briefly OK.
38.15 Scaling tactics
- Add fetcher nodes to increase aggregate request rate.
- Add frontier shards when seen-set grows.
- DNS caching to avoid hammering DNS.
- Async I/O in fetchers (one node can hold thousands of in-flight fetches).
- HTTP/2 connection reuse to same host where allowed.
38.16 Sketch for interview
[ Seed URLs ] ──→ [ URL frontier (sharded by host) ]
│
[ Fetcher pool ] (async, per-host throttle)
│
[ DNS cache ] [ robots.txt cache ]
│
[ Content hash dedup ] ─→ [ Content store (object) ]
│
[ HTML parser ] [ URL extractor ]
│
[ URL dedup (Bloom + exact) ]
│
──→ back to frontier
│
[ Metadata DB: per-URL state ]
[ Re-crawl scheduler ]
38.17 What an interviewer wants
- BFS over the web with priority and politeness.
- Mercator-style frontier (priority + per-host queue).
- Bloom + exact for URL dedup.
- Content hash dedup (or simhash for near-dup).
- Distributed coordination for host ownership.
- robots.txt respect.
38.18 Common pitfalls
- No politeness → banned.
- No URL dedup → infinite work.
- No content dedup → wasted storage.
- No spider trap detection → frontier explodes.
- Synchronous I/O in fetcher → 1000× fewer pages/sec.
Key takeaways
- The frontier is a priority + politeness queue.
- Bloom filters keep the seen-URL set tractable.
- Content hashing dedupes mirrors.
- Distributed crawling = host-partitioned ownership + coordination service.
- robots.txt and politeness are non-negotiable.
// 1 view