38. Design a Web Crawler

Tests: BFS at planet scale, politeness, deduplication, fault tolerance, distributed coordination.

38.1 Requirements

Functional

Crawl the web starting from seed URLs.
Discover new URLs from pages.
Store crawled content for later processing (indexing).
Respect robots.txt.
Re-crawl periodically to detect changes.

Non-functional

Crawl billions of pages.
High throughput: tens of thousands of pages/sec.
Polite: don't hammer any single host.
Fault-tolerant: pages, hosts, even datacenters fail.
Extensible: support different content types.

38.2 Estimates

100B pages / 30 days = ~40K pages/sec.
Avg page ~500 KB → ~20 GB/sec storage write rate.
Storage: 100B × 500 KB × compression × dedup = ~5-20 PB.

38.3 High-level architecture

[ URL frontier ] ─→ [ Fetcher pool ] ─→ [ Content store ]
       ↑                  │                   │
       │            [ HTML parser ]      [ Async pipeline ]
       │                  │                   │
       │            [ URL extractor ]   [ Indexer / etc. ]
       │                  │
   discovered URLs ────────┘

The frontier is the queue of "URLs to crawl." The fetcher pulls from it; new discovered URLs feed back in.

38.4 The URL frontier

A priority queue of URLs to crawl. Two key properties:

Priority (which URL next?)

Freshness (re-crawl popular pages more often).
Importance (PageRank-ish).
Crawl depth.

Politeness (don't hammer one host)

No more than 1 request per N seconds per host.
Respect Crawl-Delay from robots.txt.

Implementation

Mercator-style frontier: two layers.

Front queues by priority (e.g., 5 priority levels).
Back queues per host (FIFO).
Round-robin sample back queues so no host floods.
Front queue selection probabilistic per priority.

Distributed: shard frontier by host hash. Each shard owns a slice of hosts and is the only crawler for them.

38.5 Politeness

For each host:

Fetch robots.txt; parse (cache 24h).
Respect Disallow rules.
Respect Crawl-Delay if present, else default (e.g., 1 sec per request).
Identify with User-Agent string.

Failure to be polite gets you blocked, sued, or shamed. Real production crawlers keep per-host fetch logs.

38.6 The fetcher

For each URL:

DNS resolve (cache aggressively).
HTTP GET with timeout.
Handle redirects (max 5).
Detect content type; only continue for HTML / docs of interest.
Compute hash of content; check if we've seen this content before (skip if dup).
Store raw content.
Pass to parser.

Throughput: each fetcher handles ~100-1000 conn/sec; many fetchers in parallel.

38.7 Parser & URL extractor

Parse HTML.
Extract <a href> URLs.
Normalize: resolve relative URLs, strip fragments, canonicalize (lowercase host, sort query params, remove trailing slashes).
Filter out unwanted (mailto:, javascript:, etc.).
Dedup against "seen" set.
Add new URLs to frontier with priority.

38.8 Deduplication

Two kinds:

URL dedup

"Have I seen this URL before?"

A "seen" set, possibly billions large.
Bloom filter in front of an exact set (chapter 17). Bloom says no → definitely not seen; Bloom says yes → check exact.
Exact set: distributed KV (sharded by URL hash).

Content dedup

Pages with different URLs but same content (mirrors, syndicated copies):

Hash content (SHA-256 or shingles for near-dup).
Store content by hash.
Skip parsing if seen.

For near-duplicate detection: simhash or MinHash to detect 95% similar pages without exact equality.

38.9 Storage

Content store: object storage for raw HTML/PDF/etc. (S3, HDFS, GCS).

Indexed for downstream processing. Often in compressed columnar format (Parquet) for analytics.

Metadata DB: per-URL stats (last crawl time, status, hash, depth).

38.10 Re-crawl scheduling

Pages change at different rates. Static pages rarely; news every minute.

Adaptive: page changed since last crawl → schedule sooner; unchanged → defer.
Politeness still wins (can't hit too often).

A scheduler queries metadata DB for "pages due for re-crawl" and feeds frontier.

38.11 Distributed coordination

Many crawler nodes. Coordination needed for:

Frontier ownership (which node owns which host).
Avoid duplicate crawls.
Failure recovery (node dies → reassign hosts).

Approaches:

Hash-based partitioning of hosts (consistent hashing). Each node owns its slice.
Coordination service (ZooKeeper / etcd) for membership and reassignment.

38.12 robots.txt and ethics

Always honor.
Identify with proper User-Agent.
Respect copyright (some sites disallow archiving).
Rate limit aggressively.

38.13 Failure modes

Hung connection: timeout (e.g., 30s). Don't block forever.
Host returns errors continuously: back off; eventually drop from frontier for a day.
Parsing error: log + skip; never crash the worker.
Content too large: cap downloads (e.g., 10 MB per page).
Spider trap: page generates infinite URLs (calendar with year=X, month=Y forever). Detect via URL patterns and depth limits.

38.14 Distributed system aspects

This is a beautifully distributed problem:

Partitioning: by host.
Coordination: per-host owner.
Replication: content store replicated for durability.
Throttling: per-host politeness.
Eventual consistency: seen-set converges; some duplicates briefly OK.

38.15 Scaling tactics

Add fetcher nodes to increase aggregate request rate.
Add frontier shards when seen-set grows.
DNS caching to avoid hammering DNS.
Async I/O in fetchers (one node can hold thousands of in-flight fetches).
HTTP/2 connection reuse to same host where allowed.

38.16 Sketch for interview

[ Seed URLs ] ──→ [ URL frontier (sharded by host) ]
                            │
                  [ Fetcher pool ] (async, per-host throttle)
                            │
                   [ DNS cache ] [ robots.txt cache ]
                            │
                  [ Content hash dedup ] ─→ [ Content store (object) ]
                            │
                   [ HTML parser ] [ URL extractor ]
                            │
                   [ URL dedup (Bloom + exact) ]
                            │
                   ──→ back to frontier
                            │
                  [ Metadata DB: per-URL state ]
                  [ Re-crawl scheduler ]

38.17 What an interviewer wants

BFS over the web with priority and politeness.
Mercator-style frontier (priority + per-host queue).
Bloom + exact for URL dedup.
Content hash dedup (or simhash for near-dup).
Distributed coordination for host ownership.
robots.txt respect.

38.18 Common pitfalls

No politeness → banned.
No URL dedup → infinite work.
No content dedup → wasted storage.
No spider trap detection → frontier explodes.
Synchronous I/O in fetcher → 1000× fewer pages/sec.

Key takeaways

The frontier is a priority + politeness queue.
Bloom filters keep the seen-URL set tractable.
Content hashing dedupes mirrors.
Distributed crawling = host-partitioned ownership + coordination service.
robots.txt and politeness are non-negotiable.