Load Balancing & Proxies

Load Balancing Algorithms

The algorithm decides which backend receives each request. Wrong choice causes hot spots, uneven CPU, or broken session affinity. Match the algorithm to request homogeneity, backend capacity, and statefulness.

L4 All algorithms assume a pool of healthy backends. Health checks (covered below) remove failed nodes first— the algorithm only distributes among survivors.

Round robin

Requests cycle sequentially: A → B → C → A → B → C. Simple, stateless, works when all backends are identical and requests have similar cost (~50 ms each).

Pros — even distribution over long windows; zero configuration
Cons — ignores current load; long requests create hot spots; no affinity
Variants — weighted round robin (see below); DNS round robin (coarse, TTL issues)

Weighted round robin

Assign each server a weight proportional to capacity. Server A (weight 3) gets 3× traffic of Server B (weight 1). Use when instances differ in CPU/RAM (mixed instance types) or during gradual rollout (canary at weight 1, stable at weight 9).

# NGINX upstream example
upstream api_backend {
  server 10.0.1.10:8080 weight=3;
  server 10.0.1.11:8080 weight=3;
  server 10.0.1.12:8080 weight=1;  # canary / smaller instance
}

Least connections

Route to the backend with fewest active connections. Adapts to long-lived requests (WebSockets, gRPC streams, slow DB queries) where round robin would pile onto busy servers.

Best for — heterogeneous request duration, persistent connections, streaming
Caveat — connection count ≠ CPU load; a server with few heavy connections may still be saturated
Weighted least connections — divide active connections by weight before comparing

Least response time

Route to the backend with lowest average response time (or RTT for L4). Requires active measurement— HAProxy tracks server response times; AWS ALB uses target response latency signals.

Pros — adapts to degraded nodes before health checks fail
Cons — noisy with small sample sizes; oscillation if metrics fluctuate
Hybrid — least connections with response time tie-breaker (common in Envoy)

IP hash (source IP affinity)

Hash client IP → map to backend index. Same client always hits same server (until pool size changes). Enables sticky sessions without cookies—but breaks when clients share NAT (corporate egress, mobile carriers).

⚠️ Pitfall

IP hash behind carrier-grade NAT sends thousands of users to one backend. Prefer cookie-based stickiness at L7 or consistent hashing on a session ID header when affinity is required.

Consistent hashing

Map both backends and request keys (user ID, cache key, partition ID) onto a hash ring (0–2³²). Request goes to the first server clockwise from its hash. When a server is added/removed, only K/N keys remap (K = keys, N = servers)—unlike modulo hash where most keys move.

flowchart LR
  subgraph ring["Hash ring"]
    S1[Server A]
    S2[Server B]
    S3[Server C]
    K1[key: user_42]
    K2[key: user_99]
  end
  K1 --> S2
  K2 --> S3

Virtual nodes — each physical server gets 100–200 virtual points on ring for even distribution
Use cases — distributed caches (Memcached, Redis Cluster), CDNs, sharded data stores, Kafka partitions
Hot spots — popular keys still overload one node; add replication or sub-partition hot keys

Algorithm	State needed	Session affinity	Handles variable load
Round robin	Counter	No	No
Weighted RR	Counter + weights	No	Partial (capacity)
Least connections	Per-server counts	No	Yes
Least response time	Latency metrics	No	Yes
IP hash	Client IP	Yes (fragile)	No
Consistent hash	Request key	Yes (by key)	Partial

🎯 Interview Tip

"Design a distributed cache." Lead with consistent hashing on cache key, mention virtual nodes, and explain what happens when a node fails (only 1/N keys remap, thundering herd on replacement—use singleflight/mutex on miss).

L4 vs L7 Load Balancing

OSI layer determines how much of the packet/stream the load balancer inspects. L4 is fast and protocol-agnostic; L7 is intelligent but CPU-heavy. Production stacks use both in tandem.

Layer 4 (Transport)

Operates on TCP/UDP headers: source/dest IP, port, protocol. Forwards packets without reading HTTP body or TLS plaintext. Can do SSL passthrough (encrypted stream forwarded intact to backend).

Speed — millions of packets/sec; minimal per-connection state
Examples — AWS NLB, HAProxy TCP mode, Linux IPVS, Google Cloud TCP/SSL proxy
Routing — IP + port only; cannot route /api vs /static differently
NAT — often SNATs client IP; backend sees LB IP unless PROXY protocol enabled

Layer 7 (Application)

Terminates HTTP, inspects headers, path, cookies, JWT claims. Routes based on content. Terminates TLS (unless re-encrypting to backend). Can modify requests (add headers, strip paths).

Intelligence — path routing, header-based auth, A/B testing, canary by percentage
Examples — AWS ALB, NGINX, Envoy, Traefik, HAProxy HTTP mode
Cost — TLS decrypt + HTTP parse per request; lower max RPS than L4
HTTP/2 & gRPC — must understand multiplexed streams; L4 treats as opaque TCP

Dimension	L4 (Transport)	L7 (Application)
Inspected data	IP, port, TCP flags	HTTP headers, URL, cookies, body (optional)
TLS handling	Passthrough or TCP SSL offload	Terminate at LB; optional re-encrypt to backend
Routing criteria	IP:port	Host, path, headers, method, query params
Throughput	Very high (millions conn)	High (100K–500K RPS typical per node)
Latency overhead	< 1 ms	1–5 ms (TLS + parse)
WebSocket	TCP pass-through works	Requires upgrade-aware LB (connection stickiness)
Client IP visibility	Needs PROXY protocol / X-Forwarded-For injection at L7	X-Forwarded-For, X-Real-IP standard
Health checks	TCP connect or TCP+TLS handshake	HTTP GET /health with expected status/body
Typical placement	First hop, DB/cache TCP pools, gaming UDP	Public API edge, microservice ingress

flowchart TB
  Client[Internet clients]
  L4[L4 LB — NLB\nTCP distribution]
  L7[L7 LB — ALB/Envoy\nTLS + path routing]
  API[API servers]
  Static[Static CDN origin]
  Client --> L4
  L4 --> L7
  L7 -->|/api/*| API
  L7 -->|/assets/*| Static

⚖️ Trade-off

Two-tier (NLB → ALB) is common on AWS: NLB handles DDoS scale and static IPs; ALB does path routing. Single L7 is fine for moderate traffic. L4-only works for non-HTTP (PostgreSQL read replicas, Redis, custom TCP).

🏆 Senior Signal

"We'll put a load balancer in front" is L4. "NLB for static IP + DDoS absorption, ALB terminates TLS with ACM certs, routes /v1 to ECS blue pool and /v2 canary at 5% weight, PROXY protocol preserves client IP to Envoy sidecar" is L6.

Reverse Proxy vs Forward Proxy vs API Gateway vs Service Mesh

"Proxy" is overloaded. These four patterns sit at different network positions and solve different problems. Conflating them leads to putting business logic in the wrong layer.

Forward proxy

Sits in front of clients. Client configures proxy address; proxy fetches on client's behalf. Used for: corporate egress filtering, caching, anonymization, geo-unblocking.

Client knows about the proxy
Server sees proxy IP, not client IP
Examples: Squid, corporate HTTP proxies, VPN exit nodes

Reverse proxy

Sits in front of servers. Client thinks it talks directly to the server; reverse proxy forwards to backend pool. Combines load balancing, TLS termination, caching, compression, and security filtering.

Client is unaware of backend topology
Examples: NGINX, HAProxy, Caddy, Envoy as ingress
Single entry point hides internal network structure

🔬 Under the Hood

NGINX reverse proxy uses event-driven async I/O—one worker handles thousands of idle keep-alive connections. Backend connection pooling reuses TCP connections to upstreams, avoiding per-request TCP+TLS handshakes.

API Gateway

Specialized reverse proxy for API traffic (north-south: external clients → services). Adds API-centric features beyond generic reverse proxy:

Authentication/authorization — JWT validation, API keys, OAuth token introspection
Rate limiting & quotas — per-client, per-API, burst + sustained limits
Request/response transformation — REST ↔ gRPC transcoding, header injection
Analytics & billing — per-consumer usage metering
Developer portal — API keys, documentation, versioning lifecycle

Examples: Kong, AWS API Gateway, Apigee, Azure API Management, Envoy Gateway with Gateway API.

Service mesh

Infrastructure layer for east-west traffic (service-to-service inside the cluster). Sidecar proxy (Envoy, Linkerd-proxy) injected alongside each pod; control plane (Istio, Linkerd, Consul) pushes config.

mTLS everywhere — automatic mutual TLS between services; no app code changes
Traffic management — retries, timeouts, circuit breaking, fault injection, canary splits
Observability — distributed tracing headers, per-route metrics without SDK
Cost — sidecar CPU/memory (~50–100 MB per pod); latency +1–3 ms per hop

Pattern	Traffic direction	Primary purpose	Examples
Forward proxy	Client → internet	Egress control, caching	Squid, Zscaler
Reverse proxy	Internet → server	LB, TLS, caching, hide backends	NGINX, HAProxy
API Gateway	External → services	Auth, rate limits, API lifecycle	Kong, AWS API GW
Service mesh	Service → service	mTLS, retries, observability	Istio, Linkerd

flowchart TB
  Ext[External client]
  GW[API Gateway\nnorth-south]
  S1[Service A + sidecar]
  S2[Service B + sidecar]
  S3[Service C + sidecar]
  Ext --> GW
  GW --> S1
  S1 <-->|mTLS east-west| S2
  S2 <-->|mTLS east-west| S3

⚖️ Trade-off

Service mesh adds operational complexity—start without it for < 20 services. API gateway is almost always warranted for public APIs. Don't duplicate: gateway handles external auth; mesh handles internal mTLS—avoid double JWT validation.

📦 Real World

Lyft pioneered Envoy (now CNCF) as both ingress gateway and sidecar. Netflix Zuul (gateway) + internal Eureka discovery predates mesh; newer stacks adopt mesh selectively. Stripe uses edge proxies for rate limiting without a full mesh on every internal call.

Health Checks

Load balancers and orchestrators must know when a backend is failing, starting up, or shutting down. Wrong health check semantics cause traffic to dead nodes—or cut traffic to nodes still serving in-flight requests.

Active vs passive health checks

Type	Mechanism	Detection speed	Overhead
Active	LB periodically probes (`GET /health`, TCP connect)	Interval + threshold (e.g. 3 fails in 30 s)	Probe traffic on every backend
Passive	LB observes real request outcomes (5xx, timeout, connection refused)	Immediate on real traffic; blind if no traffic	Zero extra probes

Production uses both: active catches silent failures (process hung but port open); passive catches slow degradation faster when traffic is flowing. AWS ALB supports both; HAProxy can mark server down after N failed real requests.

Liveness vs readiness

Kubernetes popularized the distinction; it applies to any orchestrated system:

Liveness — "Is the process alive?" Failure → restart the container. Should be cheap (return 200 if event loop not deadlocked). Don't check dependencies here—DB blip shouldn't kill the pod.
Readiness — "Can this instance accept traffic?" Failure → remove from LB pool, don't restart. Check dependencies: DB connection pool warmed, cache connected, migration complete.
Startup — slow-starting apps (JVM 60 s boot). Disables liveness until first success; prevents restart loops.

# Kubernetes probes
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  periodSeconds: 10
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  periodSeconds: 5
  failureThreshold: 2

startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  failureThreshold: 30
  periodSeconds: 10

Graceful drain (connection draining)

On deploy or scale-down, yanking a node immediately drops in-flight requests. Graceful drain:

Mark unhealthy — readiness fails; LB stops sending new connections
Wait drain period — existing connections complete (30–300 s depending on longest request)
Force close — terminate stragglers; shutdown process

AWS ALB — deregistration delay (default 300 s configurable)
Kubernetes — preStop hook + terminationGracePeriodSeconds
WebSockets — send close frame, wait for client ack, or migrate via pub/sub before kill
HTTP keep-alive — client may reuse connection to dying node until idle timeout

⚠️ Pitfall

Readiness probe that checks downstream DB causes cascade removal: DB slow → all pods not ready → total outage. Use circuit breakers; readiness should reflect this instance's ability to serve, not global dependency health.

📐 Estimation

Deploy 50 pods with 30 s drain, rolling maxUnavailable 25%: minimum 6 waves × 30 s = 3 min deploy time. Long-polling requests with 60 s hold → need 60 s+ drain or forced disconnects.

💡 Pro Tip

Health endpoint should be unauthenticated but rate-limited internally. Returning dependency status in JSON helps debugging; returning 503 on readiness when any dependency fails helps LB—but tune thresholds to avoid flapping.

Global Load Balancing

Single-region LB isn't enough for global users or disaster recovery. Global LB routes clients to the nearest or healthiest region using DNS, anycast IP, or GeoDNS—with failover when an entire datacenter goes dark.

L6 Global routing adds minutes of DNS TTL delay to failover, cross-region data consistency challenges, and cost of multi-region active-active infrastructure.

DNS-based load balancing

DNS returns multiple A/AAAA records for one hostname; clients pick one (often first, behavior varies). Low TTL (30–60 s) enables faster failover at cost of DNS query load and resolver caching unpredictability.

Simple round-robin DNS — no health awareness; dead IP stays in rotation until manual removal
Managed DNS — Route 53, Cloudflare, NS1 integrate health checks; remove unhealthy records automatically
Limitation — client/resolver caching delays failover; not true load balancing (no connection awareness)

Anycast

Same IP address announced from multiple geographic locations via BGP. Internet routing delivers packets to the topologically nearest PoP. User in Tokyo hits Tokyo edge; user in London hits London edge—same IP.

Used by — Cloudflare, Google 8.8.8.8, AWS CloudFront edge, Global Accelerator
Failover — withdraw BGP announcement from failed PoP; routes shift in seconds to minutes
DDoS absorption — attack traffic distributed across global edge network
Stateful apps — anycast works best for stateless edge (CDN, DNS, UDP); TCP sessions break on PoP failure

flowchart TB
  U1[User — Tokyo]
  U2[User — London]
  IP["Same anycast IP\n203.0.113.1"]
  P1[PoP Tokyo]
  P2[PoP London]
  Origin[Origin region]
  U1 --> IP
  U2 --> IP
  IP --> P1
  IP --> P2
  P1 --> Origin
  P2 --> Origin

GeoDNS (Geolocation routing)

DNS server returns different answers based on resolver's geographic location (GeoIP of DNS resolver, not always end user). Policies: route Europeans to eu-west-1, Americans to us-east-1; or latency-based routing (Route 53 Latency Routing).

Latency-based routing — AWS measures inter-region latency; returns lowest-latency healthy region
Geoproximity — bias traffic with configurable "bias" to scale region capacity
Geolocation — hard geographic rules (GDPR data residency: EU users → EU only)
Weighted routing — 90% primary region, 10% DR for continuous validation

Failover strategies

Strategy	RTO	Cost	Data consistency
Active-passive	Minutes (DNS TTL + warm-up)	Low (DR idle)	Async replication lag accepted
Active-active	Seconds (remove bad region)	High (all regions live)	Conflict resolution needed (CRDT, last-write-wins)
Pilot light	10–30 min (scale DR)	Medium	Replica lag
Warm standby	Minutes	Medium	Near-sync replication

Multi-region architecture pattern

Global anycast or GeoDNS → regional L4/L7 LB
Regional stack — compute, cache, read replicas local to region
Write path — single primary region, or multi-primary with conflict resolution
Health checks — Route 53 health checker on regional endpoint; auto-failover DNS
Chaos testing — regularly kill a region (Netflix Chaos Monkey style) to validate failover

🎯 Interview Tip

"Design a globally available service." Clarify RPO/RTO, read vs write ratio, consistency requirements. Active-active reads + single write primary is the safe default. Mention DNS TTL trade-off (60 s TTL = up to 60 s stale routing after failover).

📦 Real World

Cloudflare anycast network (~300 cities) terminates TLS at edge; origin connection reused. AWS Route 53 health-checked failover shifted traffic during us-east-1 outages for prepared customers. Google Spanner enables true multi-region strong consistency—at latency and cost premium.

🏆 Senior Signal

Quantify failover: "Route 53 health check every 30 s, 3 failures to mark unhealthy, TTL 60 s → worst case ~2.5 min before all clients reroute. For faster RTO, use anycast edge (sub-minute) with stateless workers and regional read replicas."