system-design/design-url-shortener.md

32. Design a URL Shortener (TinyURL / bit.ly)

The classic warm-up. Easy to start, deceptively rich. Gets at: ID generation, KV stores, caching, redirect mechanics, analytics fan-out.

~4 min read·updated 5/29/2026

32. Design a URL Shortener (TinyURL / bit.ly)

The classic warm-up. Easy to start, deceptively rich. Gets at: ID generation, KV stores, caching, redirect mechanics, analytics fan-out.

32.1 Requirements

Functional

  • Shorten a long URL → return a short URL.
  • Visit short URL → redirect to original.
  • Custom aliases (optional).
  • Analytics: clicks, geo, referer.
  • Optional: expiration, TTL, password-protect.

Non-functional

  • High availability (read-heavy: ~100:1).
  • Low latency redirect (< 50 ms p99).
  • Durable storage of mappings.
  • Scale: 100M new URLs/year, 10B redirects/day.

32.2 Capacity estimate

  • 100M new URLs/year ÷ 86,400 × 365 ≈ 3 writes/sec average; peak ~10/sec.
  • 10B redirects/day ÷ 86,400 ≈ 116K reads/sec; peak ~500K reads/sec.
  • Read:write = ~10,000:1.

Storage:

  • 100M new × 5 years = 500M URLs.
  • ~500 bytes per record (long URL avg 100 bytes + meta) → 250 GB. Easy.

32.3 The short ID

A short URL tinyurl.com/abc123 needs a unique ID for abc123. How do we generate it?

Option A: Hash the long URL

md5(url) → first 7 chars (base62) → 62^7 ≈ 3.5 trillion possible IDs.

Pros: deterministic — same URL → same short URL (dedup for free). Cons: collisions possible (rare, but must handle); custom aliases conflict.

Option B: Auto-increment counter, encode base62

Counter goes 1, 2, 3 ... encoded as 1, 2, ..., a, b, ..., zz etc.

Pros: simple, sequential, no collisions. Cons: predictable (security/scrape risk); single-point counter is bottleneck — sharded counter needed.

Option C: Distributed ID generator

Twitter Snowflake-style: 64-bit ID = timestamp + machine ID + sequence. Or KGS (Key Generation Service) that pre-generates and caches a pool of unused IDs per app server.

Pros: no central counter; high throughput. Cons: longer than minimum possible.

Recommended

Hash + collision retry, OR KGS pre-allocated pool of base62 codes. Use KGS for production: hands out ranges (e.g., 10K codes) to app servers; codes < 8 chars guaranteed.

32.4 Storage

Single KV store: code → {long_url, user_id, created_at, expires_at}.

Choices:

  • Postgres (or MySQL): plenty for 500M rows; great consistency; strong indexes.
  • DynamoDB / Bigtable / Cassandra: trivially scalable; eventual consistency OK.
  • Redis: only as cache, not source of truth.

Use Postgres + Redis cache for the workload size given. Beyond ~5B rows, scale to DynamoDB / Bigtable.

Schema:

CREATE TABLE urls ( code TEXT PRIMARY KEY, long_url TEXT NOT NULL, user_id BIGINT, created_at TIMESTAMPTZ DEFAULT NOW(), expires_at TIMESTAMPTZ ); CREATE INDEX urls_user_idx ON urls(user_id);

32.5 Read path (the redirect)

Hot path: 116K req/sec average. Optimize aggressively.

GET /abc123
  → Cache lookup (Redis): code → long_url (TTL: 1 day)
  → Cache hit (target ~99%): return 301/302
  → Cache miss: Postgres lookup → set in cache → return
  → Not found: 404

301 vs 302 redirect

  • 301 Permanent: browsers cache; subsequent visits don't even hit your server. Good for stable URLs; bad for analytics (you don't see the click).
  • 302 Found / 307 Temp: every visit hits your server; you get analytics.

For analytics-driven products: 302. For pure redirect / max performance: 301 with separate beacon.

CDN

Put a CDN in front. For 301s, CDN can cache the redirect. Origin only sees uncached or expired entries.

32.6 Write path (shorten)

POST /shorten { long_url }
  → Validate URL.
  → Check existing (hash lookup if dedup desired).
  → Generate code (KGS pool or hash).
  → INSERT into Postgres.
  → Populate cache.
  → Return short URL.

Async work:

  • Crawl long URL for title/meta (off the hot path).
  • Safe-browsing check (Google Safe Browsing API).

32.7 Custom aliases

User specifies code: tinyurl.com/launch. Steps:

  • Validate (allowed chars, length).
  • Check uniqueness.
  • INSERT with that code.

Reserve a "vanity codes" namespace separate from generated codes to avoid conflict.

32.8 Analytics

Each click is a row in an event store. At 500K/sec peak, batch:

Redirect → log event to Kafka (`url_clicks` topic)
  → Stream processor aggregates (per-code, per-hour counters)
  → Writes to ClickHouse / BigQuery for queries
  → Async job summarizes geo, referer, UA

Don't write per-click to Postgres — that kills your DB. Kafka handles the spike; downstream digests.

32.9 Sharding

For 500M rows, single Postgres is fine. At 50B rows, shard by code (hash to N shards).

Code generation: each shard pre-allocates from KGS so codes don't collide across shards.

32.10 Caching strategy

  • L1 in-process LRU for top-N hottest codes (10K codes).
  • L2 distributed Redis for warm set (TTL 24h).
  • CDN edge cache for 301 redirects.

For typical click distribution (Zipf): top 1% of codes get ~50% of clicks. Caching covers them effortlessly.

32.11 Expiration / TTL

For URLs with expires_at:

  • On read: check expiry; serve 410 Gone if past.
  • Background job: GC expired records.
  • Or: TTL in Cassandra / DynamoDB does it natively.

32.12 Security

  • Phishing: validate against Google Safe Browsing on shorten and periodically.
  • Open redirect: that's literally the product, but warn + tag suspect URLs.
  • Rate limit per IP / API key on shorten.
  • Block known bad domains.
  • CAPTCHA on suspicious patterns.

32.13 Multi-region

For global low-latency reads:

  • Replicate Redis cache to each region.
  • Postgres single-leader (writes still cross-region) or Spanner-style distributed SQL.
  • DNS-based geo routing OR anycast.
  • Eventual consistency across regions for analytics is OK.

32.14 Failure modes

  • Cache cluster down: fall back to Postgres; latency rises but service continues.
  • Postgres primary down: read replicas serve reads (eventual consistency briefly).
  • KGS down: app servers can serve from local pool; replenish when back.
  • Hot URL (a celebrity tweets a link): CDN absorbs; no origin impact.

32.15 Estimates revisited

  • 116K reads/sec average × 99% cache hit = ~1.2K Postgres reads/sec. Trivial.
  • 500K peak × 99% cache hit = 5K reads/sec. Still fine.
  • Click events at 500K/sec to Kafka: needs ~5-10 brokers depending on size.

32.16 What an interviewer wants

  • ID generation strategy with reasoning (hash vs counter vs KGS).
  • Cache + Postgres + CDN architecture.
  • 301 vs 302 trade-off.
  • Analytics off the hot path via Kafka.
  • Sharding plan if asked to scale.

Key takeaways

  • ID generation is the central decision: KGS or hash-with-retry.
  • Read-heavy → cache aggressively + CDN.
  • Analytics async via event log; never on the redirect path.
  • Postgres is enough until tens of billions of rows; then shard or Cassandra/DDB.
  • 302 + Kafka for analytics; 301 + CDN for raw speed.

// 1 view

main
UTF-8·typescript