◐ system-design/design-url-shortener.md
32. Design a URL Shortener (TinyURL / bit.ly)
The classic warm-up. Easy to start, deceptively rich. Gets at: ID generation, KV stores, caching, redirect mechanics, analytics fan-out.
~4 min read·updated 5/29/2026
32. Design a URL Shortener (TinyURL / bit.ly)
The classic warm-up. Easy to start, deceptively rich. Gets at: ID generation, KV stores, caching, redirect mechanics, analytics fan-out.
32.1 Requirements
Functional
- Shorten a long URL → return a short URL.
- Visit short URL → redirect to original.
- Custom aliases (optional).
- Analytics: clicks, geo, referer.
- Optional: expiration, TTL, password-protect.
Non-functional
- High availability (read-heavy: ~100:1).
- Low latency redirect (< 50 ms p99).
- Durable storage of mappings.
- Scale: 100M new URLs/year, 10B redirects/day.
32.2 Capacity estimate
- 100M new URLs/year ÷ 86,400 × 365 ≈ 3 writes/sec average; peak ~10/sec.
- 10B redirects/day ÷ 86,400 ≈ 116K reads/sec; peak ~500K reads/sec.
- Read:write = ~10,000:1.
Storage:
- 100M new × 5 years = 500M URLs.
- ~500 bytes per record (long URL avg 100 bytes + meta) → 250 GB. Easy.
32.3 The short ID
A short URL tinyurl.com/abc123 needs a unique ID for abc123. How do we generate it?
Option A: Hash the long URL
md5(url) → first 7 chars (base62) → 62^7 ≈ 3.5 trillion possible IDs.
Pros: deterministic — same URL → same short URL (dedup for free). Cons: collisions possible (rare, but must handle); custom aliases conflict.
Option B: Auto-increment counter, encode base62
Counter goes 1, 2, 3 ... encoded as 1, 2, ..., a, b, ..., zz etc.
Pros: simple, sequential, no collisions. Cons: predictable (security/scrape risk); single-point counter is bottleneck — sharded counter needed.
Option C: Distributed ID generator
Twitter Snowflake-style: 64-bit ID = timestamp + machine ID + sequence. Or KGS (Key Generation Service) that pre-generates and caches a pool of unused IDs per app server.
Pros: no central counter; high throughput. Cons: longer than minimum possible.
Recommended
Hash + collision retry, OR KGS pre-allocated pool of base62 codes. Use KGS for production: hands out ranges (e.g., 10K codes) to app servers; codes < 8 chars guaranteed.
32.4 Storage
Single KV store: code → {long_url, user_id, created_at, expires_at}.
Choices:
- Postgres (or MySQL): plenty for 500M rows; great consistency; strong indexes.
- DynamoDB / Bigtable / Cassandra: trivially scalable; eventual consistency OK.
- Redis: only as cache, not source of truth.
Use Postgres + Redis cache for the workload size given. Beyond ~5B rows, scale to DynamoDB / Bigtable.
Schema:
CREATE TABLE urls ( code TEXT PRIMARY KEY, long_url TEXT NOT NULL, user_id BIGINT, created_at TIMESTAMPTZ DEFAULT NOW(), expires_at TIMESTAMPTZ ); CREATE INDEX urls_user_idx ON urls(user_id);
32.5 Read path (the redirect)
Hot path: 116K req/sec average. Optimize aggressively.
GET /abc123
→ Cache lookup (Redis): code → long_url (TTL: 1 day)
→ Cache hit (target ~99%): return 301/302
→ Cache miss: Postgres lookup → set in cache → return
→ Not found: 404
301 vs 302 redirect
- 301 Permanent: browsers cache; subsequent visits don't even hit your server. Good for stable URLs; bad for analytics (you don't see the click).
- 302 Found / 307 Temp: every visit hits your server; you get analytics.
For analytics-driven products: 302. For pure redirect / max performance: 301 with separate beacon.
CDN
Put a CDN in front. For 301s, CDN can cache the redirect. Origin only sees uncached or expired entries.
32.6 Write path (shorten)
POST /shorten { long_url }
→ Validate URL.
→ Check existing (hash lookup if dedup desired).
→ Generate code (KGS pool or hash).
→ INSERT into Postgres.
→ Populate cache.
→ Return short URL.
Async work:
- Crawl long URL for title/meta (off the hot path).
- Safe-browsing check (Google Safe Browsing API).
32.7 Custom aliases
User specifies code: tinyurl.com/launch. Steps:
- Validate (allowed chars, length).
- Check uniqueness.
- INSERT with that code.
Reserve a "vanity codes" namespace separate from generated codes to avoid conflict.
32.8 Analytics
Each click is a row in an event store. At 500K/sec peak, batch:
Redirect → log event to Kafka (`url_clicks` topic)
→ Stream processor aggregates (per-code, per-hour counters)
→ Writes to ClickHouse / BigQuery for queries
→ Async job summarizes geo, referer, UA
Don't write per-click to Postgres — that kills your DB. Kafka handles the spike; downstream digests.
32.9 Sharding
For 500M rows, single Postgres is fine. At 50B rows, shard by code (hash to N shards).
Code generation: each shard pre-allocates from KGS so codes don't collide across shards.
32.10 Caching strategy
- L1 in-process LRU for top-N hottest codes (10K codes).
- L2 distributed Redis for warm set (TTL 24h).
- CDN edge cache for 301 redirects.
For typical click distribution (Zipf): top 1% of codes get ~50% of clicks. Caching covers them effortlessly.
32.11 Expiration / TTL
For URLs with expires_at:
- On read: check expiry; serve 410 Gone if past.
- Background job: GC expired records.
- Or: TTL in Cassandra / DynamoDB does it natively.
32.12 Security
- Phishing: validate against Google Safe Browsing on shorten and periodically.
- Open redirect: that's literally the product, but warn + tag suspect URLs.
- Rate limit per IP / API key on shorten.
- Block known bad domains.
- CAPTCHA on suspicious patterns.
32.13 Multi-region
For global low-latency reads:
- Replicate Redis cache to each region.
- Postgres single-leader (writes still cross-region) or Spanner-style distributed SQL.
- DNS-based geo routing OR anycast.
- Eventual consistency across regions for analytics is OK.
32.14 Failure modes
- Cache cluster down: fall back to Postgres; latency rises but service continues.
- Postgres primary down: read replicas serve reads (eventual consistency briefly).
- KGS down: app servers can serve from local pool; replenish when back.
- Hot URL (a celebrity tweets a link): CDN absorbs; no origin impact.
32.15 Estimates revisited
- 116K reads/sec average × 99% cache hit = ~1.2K Postgres reads/sec. Trivial.
- 500K peak × 99% cache hit = 5K reads/sec. Still fine.
- Click events at 500K/sec to Kafka: needs ~5-10 brokers depending on size.
32.16 What an interviewer wants
- ID generation strategy with reasoning (hash vs counter vs KGS).
- Cache + Postgres + CDN architecture.
- 301 vs 302 trade-off.
- Analytics off the hot path via Kafka.
- Sharding plan if asked to scale.
Key takeaways
- ID generation is the central decision: KGS or hash-with-retry.
- Read-heavy → cache aggressively + CDN.
- Analytics async via event log; never on the redirect path.
- Postgres is enough until tens of billions of rows; then shard or Cassandra/DDB.
- 302 + Kafka for analytics; 301 + CDN for raw speed.
// 1 view