32. Design a URL Shortener (TinyURL / bit.ly)

The classic warm-up. Easy to start, deceptively rich. Gets at: ID generation, KV stores, caching, redirect mechanics, analytics fan-out.

32.1 Requirements

Functional

Shorten a long URL → return a short URL.
Visit short URL → redirect to original.
Custom aliases (optional).
Analytics: clicks, geo, referer.
Optional: expiration, TTL, password-protect.

Non-functional

High availability (read-heavy: ~100:1).
Low latency redirect (< 50 ms p99).
Durable storage of mappings.
Scale: 100M new URLs/year, 10B redirects/day.

32.2 Capacity estimate

100M new URLs/year ÷ 86,400 × 365 ≈ 3 writes/sec average; peak ~10/sec.
10B redirects/day ÷ 86,400 ≈ 116K reads/sec; peak ~500K reads/sec.
Read:write = ~10,000:1.

Storage:

100M new × 5 years = 500M URLs.
~500 bytes per record (long URL avg 100 bytes + meta) → 250 GB. Easy.

32.3 The short ID

A short URL tinyurl.com/abc123 needs a unique ID for abc123. How do we generate it?

Option A: Hash the long URL

md5(url) → first 7 chars (base62) → 62^7 ≈ 3.5 trillion possible IDs.

Pros: deterministic — same URL → same short URL (dedup for free). Cons: collisions possible (rare, but must handle); custom aliases conflict.

Option B: Auto-increment counter, encode base62

Counter goes 1, 2, 3 ... encoded as 1, 2, ..., a, b, ..., zz etc.

Pros: simple, sequential, no collisions. Cons: predictable (security/scrape risk); single-point counter is bottleneck — sharded counter needed.

Option C: Distributed ID generator

Twitter Snowflake-style: 64-bit ID = timestamp + machine ID + sequence. Or KGS (Key Generation Service) that pre-generates and caches a pool of unused IDs per app server.

Pros: no central counter; high throughput. Cons: longer than minimum possible.

32.4 Storage

Single KV store: code → {long_url, user_id, created_at, expires_at}.

Choices:

Postgres (or MySQL): plenty for 500M rows; great consistency; strong indexes.
DynamoDB / Bigtable / Cassandra: trivially scalable; eventual consistency OK.
Redis: only as cache, not source of truth.

Use Postgres + Redis cache for the workload size given. Beyond ~5B rows, scale to DynamoDB / Bigtable.

Schema:

CREATE TABLE urls (
  code TEXT PRIMARY KEY,
  long_url TEXT NOT NULL,
  user_id BIGINT,
  created_at TIMESTAMPTZ DEFAULT NOW(),
  expires_at TIMESTAMPTZ
);
CREATE INDEX urls_user_idx ON urls(user_id);

32.5 Read path (the redirect)

Hot path: 116K req/sec average. Optimize aggressively.

GET /abc123
  → Cache lookup (Redis): code → long_url (TTL: 1 day)
  → Cache hit (target ~99%): return 301/302
  → Cache miss: Postgres lookup → set in cache → return
  → Not found: 404

301 vs 302 redirect

301 Permanent: browsers cache; subsequent visits don't even hit your server. Good for stable URLs; bad for analytics (you don't see the click).
302 Found / 307 Temp: every visit hits your server; you get analytics.

For analytics-driven products: 302. For pure redirect / max performance: 301 with separate beacon.

CDN

Put a CDN in front. For 301s, CDN can cache the redirect. Origin only sees uncached or expired entries.

32.6 Write path (shorten)

POST /shorten { long_url }
  → Validate URL.
  → Check existing (hash lookup if dedup desired).
  → Generate code (KGS pool or hash).
  → INSERT into Postgres.
  → Populate cache.
  → Return short URL.

Async work:

Crawl long URL for title/meta (off the hot path).
Safe-browsing check (Google Safe Browsing API).

32.7 Custom aliases

User specifies code: tinyurl.com/launch. Steps:

Validate (allowed chars, length).
Check uniqueness.
INSERT with that code.

Reserve a "vanity codes" namespace separate from generated codes to avoid conflict.

32.8 Analytics

Each click is a row in an event store. At 500K/sec peak, batch:

Redirect → log event to Kafka (`url_clicks` topic)
  → Stream processor aggregates (per-code, per-hour counters)
  → Writes to ClickHouse / BigQuery for queries
  → Async job summarizes geo, referer, UA

Don't write per-click to Postgres — that kills your DB. Kafka handles the spike; downstream digests.

32.9 Sharding

For 500M rows, single Postgres is fine. At 50B rows, shard by code (hash to N shards).

Code generation: each shard pre-allocates from KGS so codes don't collide across shards.

32.10 Caching strategy

L1 in-process LRU for top-N hottest codes (10K codes).
L2 distributed Redis for warm set (TTL 24h).
CDN edge cache for 301 redirects.

For typical click distribution (Zipf): top 1% of codes get ~50% of clicks. Caching covers them effortlessly.

32.11 Expiration / TTL

For URLs with expires_at:

On read: check expiry; serve 410 Gone if past.
Background job: GC expired records.
Or: TTL in Cassandra / DynamoDB does it natively.

32.12 Security

Phishing: validate against Google Safe Browsing on shorten and periodically.
Open redirect: that's literally the product, but warn + tag suspect URLs.
Rate limit per IP / API key on shorten.
Block known bad domains.
CAPTCHA on suspicious patterns.

32.13 Multi-region

For global low-latency reads:

Replicate Redis cache to each region.
Postgres single-leader (writes still cross-region) or Spanner-style distributed SQL.
DNS-based geo routing OR anycast.
Eventual consistency across regions for analytics is OK.

32.14 Failure modes

Cache cluster down: fall back to Postgres; latency rises but service continues.
Postgres primary down: read replicas serve reads (eventual consistency briefly).
KGS down: app servers can serve from local pool; replenish when back.
Hot URL (a celebrity tweets a link): CDN absorbs; no origin impact.

32.15 Estimates revisited

116K reads/sec average × 99% cache hit = ~1.2K Postgres reads/sec. Trivial.
500K peak × 99% cache hit = 5K reads/sec. Still fine.
Click events at 500K/sec to Kafka: needs ~5-10 brokers depending on size.

32.16 What an interviewer wants

ID generation strategy with reasoning (hash vs counter vs KGS).
Cache + Postgres + CDN architecture.
301 vs 302 trade-off.
Analytics off the hot path via Kafka.
Sharding plan if asked to scale.

Key takeaways

ID generation is the central decision: KGS or hash-with-retry.
Read-heavy → cache aggressively + CDN.
Analytics async via event log; never on the redirect path.
Postgres is enough until tens of billions of rows; then shard or Cassandra/DDB.
302 + Kafka for analytics; 301 + CDN for raw speed.