system-design/design-web-crawler.md

38. Design a Web Crawler

Tests: BFS at planet scale, politeness, deduplication, fault tolerance, distributed coordination.

~4 min read·updated 5/29/2026

38. Design a Web Crawler

Tests: BFS at planet scale, politeness, deduplication, fault tolerance, distributed coordination.

38.1 Requirements

Functional

  • Crawl the web starting from seed URLs.
  • Discover new URLs from pages.
  • Store crawled content for later processing (indexing).
  • Respect robots.txt.
  • Re-crawl periodically to detect changes.

Non-functional

  • Crawl billions of pages.
  • High throughput: tens of thousands of pages/sec.
  • Polite: don't hammer any single host.
  • Fault-tolerant: pages, hosts, even datacenters fail.
  • Extensible: support different content types.

38.2 Estimates

  • 100B pages / 30 days = ~40K pages/sec.
  • Avg page ~500 KB → ~20 GB/sec storage write rate.
  • Storage: 100B × 500 KB × compression × dedup = ~5-20 PB.

38.3 High-level architecture

[ URL frontier ] ─→ [ Fetcher pool ] ─→ [ Content store ]
       ↑                  │                   │
       │            [ HTML parser ]      [ Async pipeline ]
       │                  │                   │
       │            [ URL extractor ]   [ Indexer / etc. ]
       │                  │
   discovered URLs ────────┘

The frontier is the queue of "URLs to crawl." The fetcher pulls from it; new discovered URLs feed back in.

38.4 The URL frontier

A priority queue of URLs to crawl. Two key properties:

Priority (which URL next?)

  • Freshness (re-crawl popular pages more often).
  • Importance (PageRank-ish).
  • Crawl depth.

Politeness (don't hammer one host)

  • No more than 1 request per N seconds per host.
  • Respect Crawl-Delay from robots.txt.

Implementation

Mercator-style frontier: two layers.

  • Front queues by priority (e.g., 5 priority levels).
  • Back queues per host (FIFO).
  • Round-robin sample back queues so no host floods.
  • Front queue selection probabilistic per priority.

Distributed: shard frontier by host hash. Each shard owns a slice of hosts and is the only crawler for them.

38.5 Politeness

For each host:

  • Fetch robots.txt; parse (cache 24h).
  • Respect Disallow rules.
  • Respect Crawl-Delay if present, else default (e.g., 1 sec per request).
  • Identify with User-Agent string.

Failure to be polite gets you blocked, sued, or shamed. Real production crawlers keep per-host fetch logs.

38.6 The fetcher

For each URL:

  1. DNS resolve (cache aggressively).
  2. HTTP GET with timeout.
  3. Handle redirects (max 5).
  4. Detect content type; only continue for HTML / docs of interest.
  5. Compute hash of content; check if we've seen this content before (skip if dup).
  6. Store raw content.
  7. Pass to parser.

Throughput: each fetcher handles ~100-1000 conn/sec; many fetchers in parallel.

38.7 Parser & URL extractor

  • Parse HTML.
  • Extract <a href> URLs.
  • Normalize: resolve relative URLs, strip fragments, canonicalize (lowercase host, sort query params, remove trailing slashes).
  • Filter out unwanted (mailto:, javascript:, etc.).
  • Dedup against "seen" set.
  • Add new URLs to frontier with priority.

38.8 Deduplication

Two kinds:

URL dedup

"Have I seen this URL before?"

  • A "seen" set, possibly billions large.
  • Bloom filter in front of an exact set (chapter 17). Bloom says no → definitely not seen; Bloom says yes → check exact.
  • Exact set: distributed KV (sharded by URL hash).

Content dedup

Pages with different URLs but same content (mirrors, syndicated copies):

  • Hash content (SHA-256 or shingles for near-dup).
  • Store content by hash.
  • Skip parsing if seen.

For near-duplicate detection: simhash or MinHash to detect 95% similar pages without exact equality.

38.9 Storage

Content store: object storage for raw HTML/PDF/etc. (S3, HDFS, GCS).

Indexed for downstream processing. Often in compressed columnar format (Parquet) for analytics.

Metadata DB: per-URL stats (last crawl time, status, hash, depth).

38.10 Re-crawl scheduling

Pages change at different rates. Static pages rarely; news every minute.

  • Adaptive: page changed since last crawl → schedule sooner; unchanged → defer.
  • Politeness still wins (can't hit too often).

A scheduler queries metadata DB for "pages due for re-crawl" and feeds frontier.

38.11 Distributed coordination

Many crawler nodes. Coordination needed for:

  • Frontier ownership (which node owns which host).
  • Avoid duplicate crawls.
  • Failure recovery (node dies → reassign hosts).

Approaches:

  • Hash-based partitioning of hosts (consistent hashing). Each node owns its slice.
  • Coordination service (ZooKeeper / etcd) for membership and reassignment.

38.12 robots.txt and ethics

  • Always honor.
  • Identify with proper User-Agent.
  • Respect copyright (some sites disallow archiving).
  • Rate limit aggressively.

38.13 Failure modes

  • Hung connection: timeout (e.g., 30s). Don't block forever.
  • Host returns errors continuously: back off; eventually drop from frontier for a day.
  • Parsing error: log + skip; never crash the worker.
  • Content too large: cap downloads (e.g., 10 MB per page).
  • Spider trap: page generates infinite URLs (calendar with year=X, month=Y forever). Detect via URL patterns and depth limits.

38.14 Distributed system aspects

This is a beautifully distributed problem:

  • Partitioning: by host.
  • Coordination: per-host owner.
  • Replication: content store replicated for durability.
  • Throttling: per-host politeness.
  • Eventual consistency: seen-set converges; some duplicates briefly OK.

38.15 Scaling tactics

  • Add fetcher nodes to increase aggregate request rate.
  • Add frontier shards when seen-set grows.
  • DNS caching to avoid hammering DNS.
  • Async I/O in fetchers (one node can hold thousands of in-flight fetches).
  • HTTP/2 connection reuse to same host where allowed.

38.16 Sketch for interview

[ Seed URLs ] ──→ [ URL frontier (sharded by host) ]
                            │
                  [ Fetcher pool ] (async, per-host throttle)
                            │
                   [ DNS cache ] [ robots.txt cache ]
                            │
                  [ Content hash dedup ] ─→ [ Content store (object) ]
                            │
                   [ HTML parser ] [ URL extractor ]
                            │
                   [ URL dedup (Bloom + exact) ]
                            │
                   ──→ back to frontier
                            │
                  [ Metadata DB: per-URL state ]
                  [ Re-crawl scheduler ]

38.17 What an interviewer wants

  • BFS over the web with priority and politeness.
  • Mercator-style frontier (priority + per-host queue).
  • Bloom + exact for URL dedup.
  • Content hash dedup (or simhash for near-dup).
  • Distributed coordination for host ownership.
  • robots.txt respect.

38.18 Common pitfalls

  • No politeness → banned.
  • No URL dedup → infinite work.
  • No content dedup → wasted storage.
  • No spider trap detection → frontier explodes.
  • Synchronous I/O in fetcher → 1000× fewer pages/sec.

Key takeaways

  • The frontier is a priority + politeness queue.
  • Bloom filters keep the seen-URL set tractable.
  • Content hashing dedupes mirrors.
  • Distributed crawling = host-partitioned ownership + coordination service.
  • robots.txt and politeness are non-negotiable.

// 1 view

main
UTF-8·typescript