Load Balancing & Proxies

No single server survives production traffic. Load balancers distribute requests across pools, detect unhealthy backends, and terminate TLS at the edge. This chapter covers algorithms, L4 vs L7 trade-offs, the proxy/gateway/mesh spectrum, health check semantics, and global routing with DNS and anycast.

L4 L5 learning L7 · Envoy · HAProxy

Load Balancing Algorithms

The algorithm decides which backend receives each request. Wrong choice causes hot spots, uneven CPU, or broken session affinity. Match the algorithm to request homogeneity, backend capacity, and statefulness.

L4 All algorithms assume a pool of healthy backends. Health checks (covered below) remove failed nodes first— the algorithm only distributes among survivors.

Round robin

Requests cycle sequentially: A → B → C → A → B → C. Simple, stateless, works when all backends are identical and requests have similar cost (~50 ms each).

  • Pros — even distribution over long windows; zero configuration
  • Cons — ignores current load; long requests create hot spots; no affinity
  • Variants — weighted round robin (see below); DNS round robin (coarse, TTL issues)

Weighted round robin

Assign each server a weight proportional to capacity. Server A (weight 3) gets 3× traffic of Server B (weight 1). Use when instances differ in CPU/RAM (mixed instance types) or during gradual rollout (canary at weight 1, stable at weight 9).

# NGINX upstream example
upstream api_backend {
  server 10.0.1.10:8080 weight=3;
  server 10.0.1.11:8080 weight=3;
  server 10.0.1.12:8080 weight=1;  # canary / smaller instance
}

Least connections

Route to the backend with fewest active connections. Adapts to long-lived requests (WebSockets, gRPC streams, slow DB queries) where round robin would pile onto busy servers.

  • Best for — heterogeneous request duration, persistent connections, streaming
  • Caveat — connection count ≠ CPU load; a server with few heavy connections may still be saturated
  • Weighted least connections — divide active connections by weight before comparing

Least response time

Route to the backend with lowest average response time (or RTT for L4). Requires active measurement— HAProxy tracks server response times; AWS ALB uses target response latency signals.

  • Pros — adapts to degraded nodes before health checks fail
  • Cons — noisy with small sample sizes; oscillation if metrics fluctuate
  • Hybrid — least connections with response time tie-breaker (common in Envoy)

IP hash (source IP affinity)

Hash client IP → map to backend index. Same client always hits same server (until pool size changes). Enables sticky sessions without cookies—but breaks when clients share NAT (corporate egress, mobile carriers).

⚠️ Pitfall

IP hash behind carrier-grade NAT sends thousands of users to one backend. Prefer cookie-based stickiness at L7 or consistent hashing on a session ID header when affinity is required.

Consistent hashing

Map both backends and request keys (user ID, cache key, partition ID) onto a hash ring (0–2³²). Request goes to the first server clockwise from its hash. When a server is added/removed, only K/N keys remap (K = keys, N = servers)—unlike modulo hash where most keys move.

flowchart LR
  subgraph ring["Hash ring"]
    S1[Server A]
    S2[Server B]
    S3[Server C]
    K1[key: user_42]
    K2[key: user_99]
  end
  K1 --> S2
  K2 --> S3
  • Virtual nodes — each physical server gets 100–200 virtual points on ring for even distribution
  • Use cases — distributed caches (Memcached, Redis Cluster), CDNs, sharded data stores, Kafka partitions
  • Hot spots — popular keys still overload one node; add replication or sub-partition hot keys
Algorithm State needed Session affinity Handles variable load
Round robin Counter No No
Weighted RR Counter + weights No Partial (capacity)
Least connections Per-server counts No Yes
Least response time Latency metrics No Yes
IP hash Client IP Yes (fragile) No
Consistent hash Request key Yes (by key) Partial
🎯 Interview Tip

"Design a distributed cache." Lead with consistent hashing on cache key, mention virtual nodes, and explain what happens when a node fails (only 1/N keys remap, thundering herd on replacement—use singleflight/mutex on miss).

L4 vs L7 Load Balancing

OSI layer determines how much of the packet/stream the load balancer inspects. L4 is fast and protocol-agnostic; L7 is intelligent but CPU-heavy. Production stacks use both in tandem.

Layer 4 (Transport)

Operates on TCP/UDP headers: source/dest IP, port, protocol. Forwards packets without reading HTTP body or TLS plaintext. Can do SSL passthrough (encrypted stream forwarded intact to backend).

  • Speed — millions of packets/sec; minimal per-connection state
  • Examples — AWS NLB, HAProxy TCP mode, Linux IPVS, Google Cloud TCP/SSL proxy
  • Routing — IP + port only; cannot route /api vs /static differently
  • NAT — often SNATs client IP; backend sees LB IP unless PROXY protocol enabled

Layer 7 (Application)

Terminates HTTP, inspects headers, path, cookies, JWT claims. Routes based on content. Terminates TLS (unless re-encrypting to backend). Can modify requests (add headers, strip paths).

  • Intelligence — path routing, header-based auth, A/B testing, canary by percentage
  • Examples — AWS ALB, NGINX, Envoy, Traefik, HAProxy HTTP mode
  • Cost — TLS decrypt + HTTP parse per request; lower max RPS than L4
  • HTTP/2 & gRPC — must understand multiplexed streams; L4 treats as opaque TCP
Dimension L4 (Transport) L7 (Application)
Inspected data IP, port, TCP flags HTTP headers, URL, cookies, body (optional)
TLS handling Passthrough or TCP SSL offload Terminate at LB; optional re-encrypt to backend
Routing criteria IP:port Host, path, headers, method, query params
Throughput Very high (millions conn) High (100K–500K RPS typical per node)
Latency overhead < 1 ms 1–5 ms (TLS + parse)
WebSocket TCP pass-through works Requires upgrade-aware LB (connection stickiness)
Client IP visibility Needs PROXY protocol / X-Forwarded-For injection at L7 X-Forwarded-For, X-Real-IP standard
Health checks TCP connect or TCP+TLS handshake HTTP GET /health with expected status/body
Typical placement First hop, DB/cache TCP pools, gaming UDP Public API edge, microservice ingress
flowchart TB
  Client[Internet clients]
  L4[L4 LB — NLB\nTCP distribution]
  L7[L7 LB — ALB/Envoy\nTLS + path routing]
  API[API servers]
  Static[Static CDN origin]
  Client --> L4
  L4 --> L7
  L7 -->|/api/*| API
  L7 -->|/assets/*| Static
⚖️ Trade-off

Two-tier (NLB → ALB) is common on AWS: NLB handles DDoS scale and static IPs; ALB does path routing. Single L7 is fine for moderate traffic. L4-only works for non-HTTP (PostgreSQL read replicas, Redis, custom TCP).

🏆 Senior Signal

"We'll put a load balancer in front" is L4. "NLB for static IP + DDoS absorption, ALB terminates TLS with ACM certs, routes /v1 to ECS blue pool and /v2 canary at 5% weight, PROXY protocol preserves client IP to Envoy sidecar" is L6.

Reverse Proxy vs Forward Proxy vs API Gateway vs Service Mesh

"Proxy" is overloaded. These four patterns sit at different network positions and solve different problems. Conflating them leads to putting business logic in the wrong layer.

Forward proxy

Sits in front of clients. Client configures proxy address; proxy fetches on client's behalf. Used for: corporate egress filtering, caching, anonymization, geo-unblocking.

  • Client knows about the proxy
  • Server sees proxy IP, not client IP
  • Examples: Squid, corporate HTTP proxies, VPN exit nodes

Reverse proxy

Sits in front of servers. Client thinks it talks directly to the server; reverse proxy forwards to backend pool. Combines load balancing, TLS termination, caching, compression, and security filtering.

  • Client is unaware of backend topology
  • Examples: NGINX, HAProxy, Caddy, Envoy as ingress
  • Single entry point hides internal network structure
🔬 Under the Hood

NGINX reverse proxy uses event-driven async I/O—one worker handles thousands of idle keep-alive connections. Backend connection pooling reuses TCP connections to upstreams, avoiding per-request TCP+TLS handshakes.

API Gateway

Specialized reverse proxy for API traffic (north-south: external clients → services). Adds API-centric features beyond generic reverse proxy:

  • Authentication/authorization — JWT validation, API keys, OAuth token introspection
  • Rate limiting & quotas — per-client, per-API, burst + sustained limits
  • Request/response transformation — REST ↔ gRPC transcoding, header injection
  • Analytics & billing — per-consumer usage metering
  • Developer portal — API keys, documentation, versioning lifecycle

Examples: Kong, AWS API Gateway, Apigee, Azure API Management, Envoy Gateway with Gateway API.

Service mesh

Infrastructure layer for east-west traffic (service-to-service inside the cluster). Sidecar proxy (Envoy, Linkerd-proxy) injected alongside each pod; control plane (Istio, Linkerd, Consul) pushes config.

  • mTLS everywhere — automatic mutual TLS between services; no app code changes
  • Traffic management — retries, timeouts, circuit breaking, fault injection, canary splits
  • Observability — distributed tracing headers, per-route metrics without SDK
  • Cost — sidecar CPU/memory (~50–100 MB per pod); latency +1–3 ms per hop
Pattern Traffic direction Primary purpose Examples
Forward proxy Client → internet Egress control, caching Squid, Zscaler
Reverse proxy Internet → server LB, TLS, caching, hide backends NGINX, HAProxy
API Gateway External → services Auth, rate limits, API lifecycle Kong, AWS API GW
Service mesh Service → service mTLS, retries, observability Istio, Linkerd
flowchart TB
  Ext[External client]
  GW[API Gateway\nnorth-south]
  S1[Service A + sidecar]
  S2[Service B + sidecar]
  S3[Service C + sidecar]
  Ext --> GW
  GW --> S1
  S1 <-->|mTLS east-west| S2
  S2 <-->|mTLS east-west| S3
⚖️ Trade-off

Service mesh adds operational complexity—start without it for < 20 services. API gateway is almost always warranted for public APIs. Don't duplicate: gateway handles external auth; mesh handles internal mTLS—avoid double JWT validation.

📦 Real World

Lyft pioneered Envoy (now CNCF) as both ingress gateway and sidecar. Netflix Zuul (gateway) + internal Eureka discovery predates mesh; newer stacks adopt mesh selectively. Stripe uses edge proxies for rate limiting without a full mesh on every internal call.

Health Checks

Load balancers and orchestrators must know when a backend is failing, starting up, or shutting down. Wrong health check semantics cause traffic to dead nodes—or cut traffic to nodes still serving in-flight requests.

Active vs passive health checks

Type Mechanism Detection speed Overhead
Active LB periodically probes (GET /health, TCP connect) Interval + threshold (e.g. 3 fails in 30 s) Probe traffic on every backend
Passive LB observes real request outcomes (5xx, timeout, connection refused) Immediate on real traffic; blind if no traffic Zero extra probes

Production uses both: active catches silent failures (process hung but port open); passive catches slow degradation faster when traffic is flowing. AWS ALB supports both; HAProxy can mark server down after N failed real requests.

Liveness vs readiness

Kubernetes popularized the distinction; it applies to any orchestrated system:

  • Liveness — "Is the process alive?" Failure → restart the container. Should be cheap (return 200 if event loop not deadlocked). Don't check dependencies here—DB blip shouldn't kill the pod.
  • Readiness — "Can this instance accept traffic?" Failure → remove from LB pool, don't restart. Check dependencies: DB connection pool warmed, cache connected, migration complete.
  • Startup — slow-starting apps (JVM 60 s boot). Disables liveness until first success; prevents restart loops.
# Kubernetes probes
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 10
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  periodSeconds: 5
  failureThreshold: 2

startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  failureThreshold: 30
  periodSeconds: 10

Graceful drain (connection draining)

On deploy or scale-down, yanking a node immediately drops in-flight requests. Graceful drain:

  1. Mark unhealthy — readiness fails; LB stops sending new connections
  2. Wait drain period — existing connections complete (30–300 s depending on longest request)
  3. Force close — terminate stragglers; shutdown process
  • AWS ALB — deregistration delay (default 300 s configurable)
  • KubernetespreStop hook + terminationGracePeriodSeconds
  • WebSockets — send close frame, wait for client ack, or migrate via pub/sub before kill
  • HTTP keep-alive — client may reuse connection to dying node until idle timeout
⚠️ Pitfall

Readiness probe that checks downstream DB causes cascade removal: DB slow → all pods not ready → total outage. Use circuit breakers; readiness should reflect this instance's ability to serve, not global dependency health.

📐 Estimation

Deploy 50 pods with 30 s drain, rolling maxUnavailable 25%: minimum 6 waves × 30 s = 3 min deploy time. Long-polling requests with 60 s hold → need 60 s+ drain or forced disconnects.

💡 Pro Tip

Health endpoint should be unauthenticated but rate-limited internally. Returning dependency status in JSON helps debugging; returning 503 on readiness when any dependency fails helps LB—but tune thresholds to avoid flapping.

Global Load Balancing

Single-region LB isn't enough for global users or disaster recovery. Global LB routes clients to the nearest or healthiest region using DNS, anycast IP, or GeoDNS—with failover when an entire datacenter goes dark.

L6 Global routing adds minutes of DNS TTL delay to failover, cross-region data consistency challenges, and cost of multi-region active-active infrastructure.

DNS-based load balancing

DNS returns multiple A/AAAA records for one hostname; clients pick one (often first, behavior varies). Low TTL (30–60 s) enables faster failover at cost of DNS query load and resolver caching unpredictability.

  • Simple round-robin DNS — no health awareness; dead IP stays in rotation until manual removal
  • Managed DNS — Route 53, Cloudflare, NS1 integrate health checks; remove unhealthy records automatically
  • Limitation — client/resolver caching delays failover; not true load balancing (no connection awareness)

Anycast

Same IP address announced from multiple geographic locations via BGP. Internet routing delivers packets to the topologically nearest PoP. User in Tokyo hits Tokyo edge; user in London hits London edge—same IP.

  • Used by — Cloudflare, Google 8.8.8.8, AWS CloudFront edge, Global Accelerator
  • Failover — withdraw BGP announcement from failed PoP; routes shift in seconds to minutes
  • DDoS absorption — attack traffic distributed across global edge network
  • Stateful apps — anycast works best for stateless edge (CDN, DNS, UDP); TCP sessions break on PoP failure
flowchart TB
  U1[User — Tokyo]
  U2[User — London]
  IP["Same anycast IP\n203.0.113.1"]
  P1[PoP Tokyo]
  P2[PoP London]
  Origin[Origin region]
  U1 --> IP
  U2 --> IP
  IP --> P1
  IP --> P2
  P1 --> Origin
  P2 --> Origin

GeoDNS (Geolocation routing)

DNS server returns different answers based on resolver's geographic location (GeoIP of DNS resolver, not always end user). Policies: route Europeans to eu-west-1, Americans to us-east-1; or latency-based routing (Route 53 Latency Routing).

  • Latency-based routing — AWS measures inter-region latency; returns lowest-latency healthy region
  • Geoproximity — bias traffic with configurable "bias" to scale region capacity
  • Geolocation — hard geographic rules (GDPR data residency: EU users → EU only)
  • Weighted routing — 90% primary region, 10% DR for continuous validation

Failover strategies

Strategy RTO Cost Data consistency
Active-passive Minutes (DNS TTL + warm-up) Low (DR idle) Async replication lag accepted
Active-active Seconds (remove bad region) High (all regions live) Conflict resolution needed (CRDT, last-write-wins)
Pilot light 10–30 min (scale DR) Medium Replica lag
Warm standby Minutes Medium Near-sync replication

Multi-region architecture pattern

  1. Global anycast or GeoDNS → regional L4/L7 LB
  2. Regional stack — compute, cache, read replicas local to region
  3. Write path — single primary region, or multi-primary with conflict resolution
  4. Health checks — Route 53 health checker on regional endpoint; auto-failover DNS
  5. Chaos testing — regularly kill a region (Netflix Chaos Monkey style) to validate failover
🎯 Interview Tip

"Design a globally available service." Clarify RPO/RTO, read vs write ratio, consistency requirements. Active-active reads + single write primary is the safe default. Mention DNS TTL trade-off (60 s TTL = up to 60 s stale routing after failover).

📦 Real World

Cloudflare anycast network (~300 cities) terminates TLS at edge; origin connection reused. AWS Route 53 health-checked failover shifted traffic during us-east-1 outages for prepared customers. Google Spanner enables true multi-region strong consistency—at latency and cost premium.

🏆 Senior Signal

Quantify failover: "Route 53 health check every 30 s, 3 failures to mark unhealthy, TTL 60 s → worst case ~2.5 min before all clients reroute. For faster RTO, use anycast edge (sub-minute) with stateless workers and regional read replicas."