18. Load Balancing: L4 vs L7, Algorithms

A load balancer (LB) distributes incoming requests across many servers. It's the first line of defense against an unhealthy node and the first lever for horizontal scale.

18.1 Why load balance

Scale: aggregate capacity of N servers > one big server (and cheaper).
Availability: if one server dies, traffic shifts.
Maintenance: drain a server gracefully; deploy without downtime.
Geo distribution: send users to the closest region.
SSL termination: offload TLS overhead from app servers.

18.2 L4 vs L7

The OSI layer at which the LB makes decisions.

Layer 4 (transport)

Operates on TCP/UDP. Sees IPs, ports, byte stream. Doesn't parse HTTP.

Throughput: very high (millions of conn/sec). Mostly forwards packets.
Routing: based on connection state (5-tuple).
Persistence: source IP hash → same backend.
Examples: AWS NLB, HAProxy in TCP mode, IPVS, Linux LVS, Google Maglev.

Layer 7 (application)

Operates on HTTP. Reads URL, headers, cookies, can decide based on payload.

Throughput: lower per box (parsing cost).
Routing: by host, path, header, cookie, weighted, A/B.
Persistence: cookie-based (sticky sessions), header-based.
Features: rewriting, compression, caching, SSL termination, WAF.
Examples: AWS ALB, Nginx, HAProxy in HTTP mode, Envoy, Google Cloud Load Balancing, Cloudflare.

When to pick

L4 when: connections are long-lived (databases, gRPC), throughput matters more than smarts, or you need protocol-agnostic LB.
L7 when: HTTP routing rules, content-based routing, mid-request decisions, SSL termination.

Most modern stacks use both: an L4 LB (network layer) in front of L7 LBs (per-service routing).

18.3 Algorithms

How does the LB pick a backend?

Round robin

Cycle through servers in order. Simple. Ignores load. Bad if servers differ in capacity or requests differ in cost.

Weighted round robin

Each server gets a weight; pick proportionally. Good when server sizes differ.

Least connections

Send to the server with the fewest in-flight connections. Better balance for variable-cost requests.

Least time

Send to the server with the lowest combination of (active requests × response time). Better still; harder to implement.

Random

Pick a random backend. Surprisingly good with large server pools (low variance).

Power of two choices ("P2C")

Pick two random servers; choose the less loaded one. Near-optimal load balance with O(1) work, no global state. Used by HAProxy, Nginx Plus, Envoy. A go-to in modern designs.

Consistent hashing

Hash request key (e.g., user ID, URL) to a position; pick the next backend. Same key always lands on the same backend → cache locality, sticky session without cookies. Adding/removing a backend rebalances few keys (chapter 17).

IP hash

Hash source IP. Coarse stickiness. Bad behind NAT (everyone behind one IP).

18.4 Health checks

LBs continuously probe backends to know which are alive.

Active health checks

LB sends GET /health periodically. Marks backend down after N failures, up after M successes (hysteresis prevents flapping).

/health endpoint should:

Check critical dependencies (DB connectivity, downstream services).
Return shallow vs deep health (/health quick, /health/ready thorough).
Be cheap (don't hammer the DB).
Indicate readiness (Kubernetes distinguishes liveness vs readiness).

Passive health checks

LB watches real traffic; if a backend errors out N times, mark down.

Outlier detection (Envoy)

Statistical: backend's error rate > 2× cluster average → eject for cool-off period.

18.5 Connection draining

When taking a backend out of service:

Stop sending new connections.
Let in-flight requests finish (~30-60s grace).
Then shut down.

Critical for zero-downtime deploys. Kubernetes does this via preStop hook + readiness probe flip.

18.6 Sticky sessions (session affinity)

Send the same client to the same backend, usually because the backend holds in-memory session state.

Cookie-based (L7): LB sets a cookie on first response; reads it on subsequent.
IP-based (L4): hash source IP. Crude; breaks behind NAT/CDN.
Header-based (L7): explicit header.

Better alternative: externalize session state (Redis), so any backend can serve any request. Stateless backends scale linearly; sticky sessions limit you.

Sticky is sometimes unavoidable: WebSockets (long-lived connection), JVM warm caches, GPU model loading.

18.7 LB topology

Single LB → backends

The classic. Bottleneck and SPOF.

LB pair (active/standby)

Two LBs, one passive; failover via VIP / VRRP. Standard for hardware LBs (F5).

LB cluster (active/active)

Multiple LBs share work via DNS round robin or anycast IP. Used at any scale.

Anycast

Same IP advertised from multiple locations; BGP routes to nearest. Used by DNS roots, CDNs, Google Public DNS. Failure of one site routes traffic to others automatically.

Maglev (Google)

Software L4 LB at Google scale. Consistent hashing across LB cluster. Each packet processed independently; no shared state required. Paper from 2016.

18.8 Layer 7 features in production

Path-based routing: /api/payments/* → payments-service; /api/users/* → users-service.
Host-based routing: api.example.com vs admin.example.com.
Header-based routing: X-Region: eu → EU pool.
Weighted routing for canary: 95% to v1, 5% to v2.
Mirror traffic: dup request to staging; ignore response.
Retries: configurable per route, with budgets.
Circuit breaker: outlier detection, fail fast.
Rate limit: per-API-key throttling.
Auth: validate JWT before forwarding.

18.9 Service mesh (preview, chapter 20)

In a microservice world, every service is also a "load balancer" for its dependencies. A service mesh (Istio, Linkerd) gives every pod a sidecar proxy (Envoy) that handles:

Service discovery
L7 routing
Retries, timeouts, circuit breakers
mTLS between services
Telemetry (metrics, traces)

The mesh is essentially a programmable, pervasive L7 LB.

18.10 SSL/TLS termination

Decrypt at the LB so backends speak plain HTTP internally. Saves CPU on backends.

Trade-off: traffic between LB and backend is unencrypted. In zero-trust networks (Google's BeyondCorp), this is unacceptable; you re-encrypt or use mTLS internally.

18.11 Global load balancing

Direct user to nearest healthy region.

DNS-based

Authoritative DNS returns different A records per region (geo-DNS). Limited by DNS TTL caching.

Anycast IPs

One IP, many advertisements. BGP routes to nearest. Used by Cloudflare, Fastly, Google Cloud.

Application-level redirect

Initial server determines best region, redirects with HTTP 302.

Gotchas

Regional failover: when a region dies, send to next-nearest. DNS TTL needs to be short.
"Stickiness" across regions: sessions belong to one region; cross-region failover may force re-login.

18.12 Common pitfalls

No health checks — broken backends get traffic until detected by users.
All-or-nothing routing — one backend serves 80% of traffic because hash is bad.
Sticky sessions everywhere — limits scale; deploys are painful.
Drain time too short — kill in-flight requests, get angry users.
L4 when you needed L7 — can't do header-based routing.
L7 when you needed L4 — too much overhead; bottleneck on TLS termination.

18.13 Sizing

For HTTP traffic on a modern Nginx box:

~50K-100K req/sec, single instance, depending on TLS, HTTP/2, payload size.
~1M concurrent connections (with kernel tuning).

For TCP LB (HAProxy, NLB):

~1M+ packets/sec per core.
10s of millions of concurrent connections.

Beyond that, scale out the LB itself.

18.14 What the interviewer wants

Know L4 vs L7 cold.
Pick the right algorithm for the case (P2C is the modern pick).
Discuss health checks, session stickiness, and when to externalize state.
Discuss zero-downtime deploys (drain + readiness).
Mention global LB and CDNs for scale.

Key takeaways

L4 = fast, dumb (TCP/UDP). L7 = smart, slower (HTTP).
Power-of-two-choices is the modern algorithm default.
Health checks + draining = zero-downtime deploys.
Sticky sessions are usually a smell; externalize state.
For global scale: anycast, geo-DNS, regional failover.