◐ system-design/load-balancing.md
18. Load Balancing: L4 vs L7, Algorithms
A load balancer (LB) distributes incoming requests across many servers. It's the first line of defense against an unhealthy node and the first lever for horizontal scale.
~6 min read·updated 5/29/2026
18. Load Balancing: L4 vs L7, Algorithms
A load balancer (LB) distributes incoming requests across many servers. It's the first line of defense against an unhealthy node and the first lever for horizontal scale.
18.1 Why load balance
- Scale: aggregate capacity of N servers > one big server (and cheaper).
- Availability: if one server dies, traffic shifts.
- Maintenance: drain a server gracefully; deploy without downtime.
- Geo distribution: send users to the closest region.
- SSL termination: offload TLS overhead from app servers.
18.2 L4 vs L7
The OSI layer at which the LB makes decisions.
Layer 4 (transport)
Operates on TCP/UDP. Sees IPs, ports, byte stream. Doesn't parse HTTP.
- Throughput: very high (millions of conn/sec). Mostly forwards packets.
- Routing: based on connection state (5-tuple).
- Persistence: source IP hash → same backend.
- Examples: AWS NLB, HAProxy in TCP mode, IPVS, Linux LVS, Google Maglev.
Layer 7 (application)
Operates on HTTP. Reads URL, headers, cookies, can decide based on payload.
- Throughput: lower per box (parsing cost).
- Routing: by host, path, header, cookie, weighted, A/B.
- Persistence: cookie-based (sticky sessions), header-based.
- Features: rewriting, compression, caching, SSL termination, WAF.
- Examples: AWS ALB, Nginx, HAProxy in HTTP mode, Envoy, Google Cloud Load Balancing, Cloudflare.
When to pick
- L4 when: connections are long-lived (databases, gRPC), throughput matters more than smarts, or you need protocol-agnostic LB.
- L7 when: HTTP routing rules, content-based routing, mid-request decisions, SSL termination.
Most modern stacks use both: an L4 LB (network layer) in front of L7 LBs (per-service routing).
18.3 Algorithms
How does the LB pick a backend?
Round robin
Cycle through servers in order. Simple. Ignores load. Bad if servers differ in capacity or requests differ in cost.
Weighted round robin
Each server gets a weight; pick proportionally. Good when server sizes differ.
Least connections
Send to the server with the fewest in-flight connections. Better balance for variable-cost requests.
Least time
Send to the server with the lowest combination of (active requests × response time). Better still; harder to implement.
Random
Pick a random backend. Surprisingly good with large server pools (low variance).
Power of two choices ("P2C")
Pick two random servers; choose the less loaded one. Near-optimal load balance with O(1) work, no global state. Used by HAProxy, Nginx Plus, Envoy. A go-to in modern designs.
Consistent hashing
Hash request key (e.g., user ID, URL) to a position; pick the next backend. Same key always lands on the same backend → cache locality, sticky session without cookies. Adding/removing a backend rebalances few keys (chapter 17).
IP hash
Hash source IP. Coarse stickiness. Bad behind NAT (everyone behind one IP).
18.4 Health checks
LBs continuously probe backends to know which are alive.
Active health checks
LB sends GET /health periodically. Marks backend down after N failures, up after M successes (hysteresis prevents flapping).
/health endpoint should:
- Check critical dependencies (DB connectivity, downstream services).
- Return shallow vs deep health (
/healthquick,/health/readythorough). - Be cheap (don't hammer the DB).
- Indicate readiness (Kubernetes distinguishes liveness vs readiness).
Passive health checks
LB watches real traffic; if a backend errors out N times, mark down.
Outlier detection (Envoy)
Statistical: backend's error rate > 2× cluster average → eject for cool-off period.
18.5 Connection draining
When taking a backend out of service:
- Stop sending new connections.
- Let in-flight requests finish (~30-60s grace).
- Then shut down.
Critical for zero-downtime deploys. Kubernetes does this via preStop hook + readiness probe flip.
18.6 Sticky sessions (session affinity)
Send the same client to the same backend, usually because the backend holds in-memory session state.
- Cookie-based (L7): LB sets a cookie on first response; reads it on subsequent.
- IP-based (L4): hash source IP. Crude; breaks behind NAT/CDN.
- Header-based (L7): explicit header.
Better alternative: externalize session state (Redis), so any backend can serve any request. Stateless backends scale linearly; sticky sessions limit you.
Sticky is sometimes unavoidable: WebSockets (long-lived connection), JVM warm caches, GPU model loading.
18.7 LB topology
Single LB → backends
The classic. Bottleneck and SPOF.
LB pair (active/standby)
Two LBs, one passive; failover via VIP / VRRP. Standard for hardware LBs (F5).
LB cluster (active/active)
Multiple LBs share work via DNS round robin or anycast IP. Used at any scale.
Anycast
Same IP advertised from multiple locations; BGP routes to nearest. Used by DNS roots, CDNs, Google Public DNS. Failure of one site routes traffic to others automatically.
Maglev (Google)
Software L4 LB at Google scale. Consistent hashing across LB cluster. Each packet processed independently; no shared state required. Paper from 2016.
18.8 Layer 7 features in production
- Path-based routing:
/api/payments/*→ payments-service;/api/users/*→ users-service. - Host-based routing:
api.example.comvsadmin.example.com. - Header-based routing:
X-Region: eu→ EU pool. - Weighted routing for canary: 95% to v1, 5% to v2.
- Mirror traffic: dup request to staging; ignore response.
- Retries: configurable per route, with budgets.
- Circuit breaker: outlier detection, fail fast.
- Rate limit: per-API-key throttling.
- Auth: validate JWT before forwarding.
18.9 Service mesh (preview, chapter 20)
In a microservice world, every service is also a "load balancer" for its dependencies. A service mesh (Istio, Linkerd) gives every pod a sidecar proxy (Envoy) that handles:
- Service discovery
- L7 routing
- Retries, timeouts, circuit breakers
- mTLS between services
- Telemetry (metrics, traces)
The mesh is essentially a programmable, pervasive L7 LB.
18.10 SSL/TLS termination
Decrypt at the LB so backends speak plain HTTP internally. Saves CPU on backends.
Trade-off: traffic between LB and backend is unencrypted. In zero-trust networks (Google's BeyondCorp), this is unacceptable; you re-encrypt or use mTLS internally.
18.11 Global load balancing
Direct user to nearest healthy region.
DNS-based
Authoritative DNS returns different A records per region (geo-DNS). Limited by DNS TTL caching.
Anycast IPs
One IP, many advertisements. BGP routes to nearest. Used by Cloudflare, Fastly, Google Cloud.
Application-level redirect
Initial server determines best region, redirects with HTTP 302.
Gotchas
- Regional failover: when a region dies, send to next-nearest. DNS TTL needs to be short.
- "Stickiness" across regions: sessions belong to one region; cross-region failover may force re-login.
18.12 Common pitfalls
- No health checks — broken backends get traffic until detected by users.
- All-or-nothing routing — one backend serves 80% of traffic because hash is bad.
- Sticky sessions everywhere — limits scale; deploys are painful.
- Drain time too short — kill in-flight requests, get angry users.
- L4 when you needed L7 — can't do header-based routing.
- L7 when you needed L4 — too much overhead; bottleneck on TLS termination.
18.13 Sizing
For HTTP traffic on a modern Nginx box:
- ~50K-100K req/sec, single instance, depending on TLS, HTTP/2, payload size.
- ~1M concurrent connections (with kernel tuning).
For TCP LB (HAProxy, NLB):
- ~1M+ packets/sec per core.
- 10s of millions of concurrent connections.
Beyond that, scale out the LB itself.
18.14 What the interviewer wants
- Know L4 vs L7 cold.
- Pick the right algorithm for the case (P2C is the modern pick).
- Discuss health checks, session stickiness, and when to externalize state.
- Discuss zero-downtime deploys (drain + readiness).
- Mention global LB and CDNs for scale.
Key takeaways
- L4 = fast, dumb (TCP/UDP). L7 = smart, slower (HTTP).
- Power-of-two-choices is the modern algorithm default.
- Health checks + draining = zero-downtime deploys.
- Sticky sessions are usually a smell; externalize state.
- For global scale: anycast, geo-DNS, regional failover.
// 1 view