Load Balancing & Proxies
No single server survives production traffic. Load balancers distribute requests across pools, detect unhealthy backends, and terminate TLS at the edge. This chapter covers algorithms, L4 vs L7 trade-offs, the proxy/gateway/mesh spectrum, health check semantics, and global routing with DNS and anycast.
Load Balancing Algorithms
The algorithm decides which backend receives each request. Wrong choice causes hot spots, uneven CPU, or broken session affinity. Match the algorithm to request homogeneity, backend capacity, and statefulness.
L4 All algorithms assume a pool of healthy backends. Health checks (covered below) remove failed nodes first— the algorithm only distributes among survivors.
Round robin
Requests cycle sequentially: A → B → C → A → B → C. Simple, stateless, works when all backends are identical and requests have similar cost (~50 ms each).
- Pros — even distribution over long windows; zero configuration
- Cons — ignores current load; long requests create hot spots; no affinity
- Variants — weighted round robin (see below); DNS round robin (coarse, TTL issues)
Weighted round robin
Assign each server a weight proportional to capacity. Server A (weight 3) gets 3× traffic of Server B (weight 1). Use when instances differ in CPU/RAM (mixed instance types) or during gradual rollout (canary at weight 1, stable at weight 9).
# NGINX upstream example
upstream api_backend {
server 10.0.1.10:8080 weight=3;
server 10.0.1.11:8080 weight=3;
server 10.0.1.12:8080 weight=1; # canary / smaller instance
}
Least connections
Route to the backend with fewest active connections. Adapts to long-lived requests (WebSockets, gRPC streams, slow DB queries) where round robin would pile onto busy servers.
- Best for — heterogeneous request duration, persistent connections, streaming
- Caveat — connection count ≠ CPU load; a server with few heavy connections may still be saturated
- Weighted least connections — divide active connections by weight before comparing
Least response time
Route to the backend with lowest average response time (or RTT for L4). Requires active measurement— HAProxy tracks server response times; AWS ALB uses target response latency signals.
- Pros — adapts to degraded nodes before health checks fail
- Cons — noisy with small sample sizes; oscillation if metrics fluctuate
- Hybrid — least connections with response time tie-breaker (common in Envoy)
IP hash (source IP affinity)
Hash client IP → map to backend index. Same client always hits same server (until pool size changes). Enables sticky sessions without cookies—but breaks when clients share NAT (corporate egress, mobile carriers).
IP hash behind carrier-grade NAT sends thousands of users to one backend. Prefer cookie-based stickiness at L7 or consistent hashing on a session ID header when affinity is required.
Consistent hashing
Map both backends and request keys (user ID, cache key, partition ID) onto a hash ring (0–2³²). Request goes to the first server clockwise from its hash. When a server is added/removed, only K/N keys remap (K = keys, N = servers)—unlike modulo hash where most keys move.
flowchart LR
subgraph ring["Hash ring"]
S1[Server A]
S2[Server B]
S3[Server C]
K1[key: user_42]
K2[key: user_99]
end
K1 --> S2
K2 --> S3
- Virtual nodes — each physical server gets 100–200 virtual points on ring for even distribution
- Use cases — distributed caches (Memcached, Redis Cluster), CDNs, sharded data stores, Kafka partitions
- Hot spots — popular keys still overload one node; add replication or sub-partition hot keys
| Algorithm | State needed | Session affinity | Handles variable load |
|---|---|---|---|
| Round robin | Counter | No | No |
| Weighted RR | Counter + weights | No | Partial (capacity) |
| Least connections | Per-server counts | No | Yes |
| Least response time | Latency metrics | No | Yes |
| IP hash | Client IP | Yes (fragile) | No |
| Consistent hash | Request key | Yes (by key) | Partial |
"Design a distributed cache." Lead with consistent hashing on cache key, mention virtual nodes, and explain what happens when a node fails (only 1/N keys remap, thundering herd on replacement—use singleflight/mutex on miss).
L4 vs L7 Load Balancing
OSI layer determines how much of the packet/stream the load balancer inspects. L4 is fast and protocol-agnostic; L7 is intelligent but CPU-heavy. Production stacks use both in tandem.
Layer 4 (Transport)
Operates on TCP/UDP headers: source/dest IP, port, protocol. Forwards packets without reading HTTP body or TLS plaintext. Can do SSL passthrough (encrypted stream forwarded intact to backend).
- Speed — millions of packets/sec; minimal per-connection state
- Examples — AWS NLB, HAProxy TCP mode, Linux IPVS, Google Cloud TCP/SSL proxy
- Routing — IP + port only; cannot route
/apivs/staticdifferently - NAT — often SNATs client IP; backend sees LB IP unless PROXY protocol enabled
Layer 7 (Application)
Terminates HTTP, inspects headers, path, cookies, JWT claims. Routes based on content. Terminates TLS (unless re-encrypting to backend). Can modify requests (add headers, strip paths).
- Intelligence — path routing, header-based auth, A/B testing, canary by percentage
- Examples — AWS ALB, NGINX, Envoy, Traefik, HAProxy HTTP mode
- Cost — TLS decrypt + HTTP parse per request; lower max RPS than L4
- HTTP/2 & gRPC — must understand multiplexed streams; L4 treats as opaque TCP
| Dimension | L4 (Transport) | L7 (Application) |
|---|---|---|
| Inspected data | IP, port, TCP flags | HTTP headers, URL, cookies, body (optional) |
| TLS handling | Passthrough or TCP SSL offload | Terminate at LB; optional re-encrypt to backend |
| Routing criteria | IP:port | Host, path, headers, method, query params |
| Throughput | Very high (millions conn) | High (100K–500K RPS typical per node) |
| Latency overhead | < 1 ms | 1–5 ms (TLS + parse) |
| WebSocket | TCP pass-through works | Requires upgrade-aware LB (connection stickiness) |
| Client IP visibility | Needs PROXY protocol / X-Forwarded-For injection at L7 | X-Forwarded-For, X-Real-IP standard |
| Health checks | TCP connect or TCP+TLS handshake | HTTP GET /health with expected status/body |
| Typical placement | First hop, DB/cache TCP pools, gaming UDP | Public API edge, microservice ingress |
flowchart TB Client[Internet clients] L4[L4 LB — NLB\nTCP distribution] L7[L7 LB — ALB/Envoy\nTLS + path routing] API[API servers] Static[Static CDN origin] Client --> L4 L4 --> L7 L7 -->|/api/*| API L7 -->|/assets/*| Static
Two-tier (NLB → ALB) is common on AWS: NLB handles DDoS scale and static IPs; ALB does path routing. Single L7 is fine for moderate traffic. L4-only works for non-HTTP (PostgreSQL read replicas, Redis, custom TCP).
"We'll put a load balancer in front" is L4. "NLB for static IP + DDoS absorption, ALB terminates TLS with ACM certs, routes /v1 to ECS blue pool and /v2 canary at 5% weight, PROXY protocol preserves client IP to Envoy sidecar" is L6.
Reverse Proxy vs Forward Proxy vs API Gateway vs Service Mesh
"Proxy" is overloaded. These four patterns sit at different network positions and solve different problems. Conflating them leads to putting business logic in the wrong layer.
Forward proxy
Sits in front of clients. Client configures proxy address; proxy fetches on client's behalf. Used for: corporate egress filtering, caching, anonymization, geo-unblocking.
- Client knows about the proxy
- Server sees proxy IP, not client IP
- Examples: Squid, corporate HTTP proxies, VPN exit nodes
Reverse proxy
Sits in front of servers. Client thinks it talks directly to the server; reverse proxy forwards to backend pool. Combines load balancing, TLS termination, caching, compression, and security filtering.
- Client is unaware of backend topology
- Examples: NGINX, HAProxy, Caddy, Envoy as ingress
- Single entry point hides internal network structure
NGINX reverse proxy uses event-driven async I/O—one worker handles thousands of idle keep-alive connections. Backend connection pooling reuses TCP connections to upstreams, avoiding per-request TCP+TLS handshakes.
API Gateway
Specialized reverse proxy for API traffic (north-south: external clients → services). Adds API-centric features beyond generic reverse proxy:
- Authentication/authorization — JWT validation, API keys, OAuth token introspection
- Rate limiting & quotas — per-client, per-API, burst + sustained limits
- Request/response transformation — REST ↔ gRPC transcoding, header injection
- Analytics & billing — per-consumer usage metering
- Developer portal — API keys, documentation, versioning lifecycle
Examples: Kong, AWS API Gateway, Apigee, Azure API Management, Envoy Gateway with Gateway API.
Service mesh
Infrastructure layer for east-west traffic (service-to-service inside the cluster). Sidecar proxy (Envoy, Linkerd-proxy) injected alongside each pod; control plane (Istio, Linkerd, Consul) pushes config.
- mTLS everywhere — automatic mutual TLS between services; no app code changes
- Traffic management — retries, timeouts, circuit breaking, fault injection, canary splits
- Observability — distributed tracing headers, per-route metrics without SDK
- Cost — sidecar CPU/memory (~50–100 MB per pod); latency +1–3 ms per hop
| Pattern | Traffic direction | Primary purpose | Examples |
|---|---|---|---|
| Forward proxy | Client → internet | Egress control, caching | Squid, Zscaler |
| Reverse proxy | Internet → server | LB, TLS, caching, hide backends | NGINX, HAProxy |
| API Gateway | External → services | Auth, rate limits, API lifecycle | Kong, AWS API GW |
| Service mesh | Service → service | mTLS, retries, observability | Istio, Linkerd |
flowchart TB Ext[External client] GW[API Gateway\nnorth-south] S1[Service A + sidecar] S2[Service B + sidecar] S3[Service C + sidecar] Ext --> GW GW --> S1 S1 <-->|mTLS east-west| S2 S2 <-->|mTLS east-west| S3
Service mesh adds operational complexity—start without it for < 20 services. API gateway is almost always warranted for public APIs. Don't duplicate: gateway handles external auth; mesh handles internal mTLS—avoid double JWT validation.
Lyft pioneered Envoy (now CNCF) as both ingress gateway and sidecar. Netflix Zuul (gateway) + internal Eureka discovery predates mesh; newer stacks adopt mesh selectively. Stripe uses edge proxies for rate limiting without a full mesh on every internal call.
Health Checks
Load balancers and orchestrators must know when a backend is failing, starting up, or shutting down. Wrong health check semantics cause traffic to dead nodes—or cut traffic to nodes still serving in-flight requests.
Active vs passive health checks
| Type | Mechanism | Detection speed | Overhead |
|---|---|---|---|
| Active | LB periodically probes (GET /health, TCP connect) |
Interval + threshold (e.g. 3 fails in 30 s) | Probe traffic on every backend |
| Passive | LB observes real request outcomes (5xx, timeout, connection refused) | Immediate on real traffic; blind if no traffic | Zero extra probes |
Production uses both: active catches silent failures (process hung but port open); passive catches slow degradation faster when traffic is flowing. AWS ALB supports both; HAProxy can mark server down after N failed real requests.
Liveness vs readiness
Kubernetes popularized the distinction; it applies to any orchestrated system:
- Liveness — "Is the process alive?" Failure → restart the container. Should be cheap (return 200 if event loop not deadlocked). Don't check dependencies here—DB blip shouldn't kill the pod.
- Readiness — "Can this instance accept traffic?" Failure → remove from LB pool, don't restart. Check dependencies: DB connection pool warmed, cache connected, migration complete.
- Startup — slow-starting apps (JVM 60 s boot). Disables liveness until first success; prevents restart loops.
# Kubernetes probes
livenessProbe:
httpGet:
path: /healthz
port: 8080
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 5
failureThreshold: 2
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 30
periodSeconds: 10
Graceful drain (connection draining)
On deploy or scale-down, yanking a node immediately drops in-flight requests. Graceful drain:
- Mark unhealthy — readiness fails; LB stops sending new connections
- Wait drain period — existing connections complete (30–300 s depending on longest request)
- Force close — terminate stragglers; shutdown process
- AWS ALB — deregistration delay (default 300 s configurable)
- Kubernetes —
preStophook +terminationGracePeriodSeconds - WebSockets — send close frame, wait for client ack, or migrate via pub/sub before kill
- HTTP keep-alive — client may reuse connection to dying node until idle timeout
Readiness probe that checks downstream DB causes cascade removal: DB slow → all pods not ready → total outage. Use circuit breakers; readiness should reflect this instance's ability to serve, not global dependency health.
Deploy 50 pods with 30 s drain, rolling maxUnavailable 25%: minimum 6 waves × 30 s = 3 min deploy time. Long-polling requests with 60 s hold → need 60 s+ drain or forced disconnects.
Health endpoint should be unauthenticated but rate-limited internally. Returning dependency status in JSON helps debugging; returning 503 on readiness when any dependency fails helps LB—but tune thresholds to avoid flapping.
Global Load Balancing
Single-region LB isn't enough for global users or disaster recovery. Global LB routes clients to the nearest or healthiest region using DNS, anycast IP, or GeoDNS—with failover when an entire datacenter goes dark.
L6 Global routing adds minutes of DNS TTL delay to failover, cross-region data consistency challenges, and cost of multi-region active-active infrastructure.
DNS-based load balancing
DNS returns multiple A/AAAA records for one hostname; clients pick one (often first, behavior varies). Low TTL (30–60 s) enables faster failover at cost of DNS query load and resolver caching unpredictability.
- Simple round-robin DNS — no health awareness; dead IP stays in rotation until manual removal
- Managed DNS — Route 53, Cloudflare, NS1 integrate health checks; remove unhealthy records automatically
- Limitation — client/resolver caching delays failover; not true load balancing (no connection awareness)
Anycast
Same IP address announced from multiple geographic locations via BGP. Internet routing delivers packets to the topologically nearest PoP. User in Tokyo hits Tokyo edge; user in London hits London edge—same IP.
- Used by — Cloudflare, Google 8.8.8.8, AWS CloudFront edge, Global Accelerator
- Failover — withdraw BGP announcement from failed PoP; routes shift in seconds to minutes
- DDoS absorption — attack traffic distributed across global edge network
- Stateful apps — anycast works best for stateless edge (CDN, DNS, UDP); TCP sessions break on PoP failure
flowchart TB U1[User — Tokyo] U2[User — London] IP["Same anycast IP\n203.0.113.1"] P1[PoP Tokyo] P2[PoP London] Origin[Origin region] U1 --> IP U2 --> IP IP --> P1 IP --> P2 P1 --> Origin P2 --> Origin
GeoDNS (Geolocation routing)
DNS server returns different answers based on resolver's geographic location (GeoIP of DNS resolver, not always end user). Policies: route Europeans to eu-west-1, Americans to us-east-1; or latency-based routing (Route 53 Latency Routing).
- Latency-based routing — AWS measures inter-region latency; returns lowest-latency healthy region
- Geoproximity — bias traffic with configurable "bias" to scale region capacity
- Geolocation — hard geographic rules (GDPR data residency: EU users → EU only)
- Weighted routing — 90% primary region, 10% DR for continuous validation
Failover strategies
| Strategy | RTO | Cost | Data consistency |
|---|---|---|---|
| Active-passive | Minutes (DNS TTL + warm-up) | Low (DR idle) | Async replication lag accepted |
| Active-active | Seconds (remove bad region) | High (all regions live) | Conflict resolution needed (CRDT, last-write-wins) |
| Pilot light | 10–30 min (scale DR) | Medium | Replica lag |
| Warm standby | Minutes | Medium | Near-sync replication |
Multi-region architecture pattern
- Global anycast or GeoDNS → regional L4/L7 LB
- Regional stack — compute, cache, read replicas local to region
- Write path — single primary region, or multi-primary with conflict resolution
- Health checks — Route 53 health checker on regional endpoint; auto-failover DNS
- Chaos testing — regularly kill a region (Netflix Chaos Monkey style) to validate failover
"Design a globally available service." Clarify RPO/RTO, read vs write ratio, consistency requirements. Active-active reads + single write primary is the safe default. Mention DNS TTL trade-off (60 s TTL = up to 60 s stale routing after failover).
Cloudflare anycast network (~300 cities) terminates TLS at edge; origin connection reused. AWS Route 53 health-checked failover shifted traffic during us-east-1 outages for prepared customers. Google Spanner enables true multi-region strong consistency—at latency and cost premium.
Quantify failover: "Route 53 health check every 30 s, 3 failures to mark unhealthy, TTL 60 s → worst case ~2.5 min before all clients reroute. For faster RTO, use anycast edge (sub-minute) with stateless workers and regional read replicas."