System Design Fundamentals

Latency Numbers Every Engineer Must Know

Jeff Dean's latency numbers (updated for modern hardware) are the single most valuable cheat sheet in system design. They explain why caching works, why you can't hit the database on every request, and why cross-region replication is never "free."

Computers are fast at some things and catastrophically slow at others—and the gap spans six orders of magnitude from L1 cache to cross-region network round-trips. When you design a system, every component choice is really a bet on where data lives relative to the CPU that needs it.

Operation	Latency	Human-scale analogy	Design implication
L1 cache reference	~1 ns	1 step	In-process data structures; avoid cache misses in hot loops
L2 cache reference	~4 ns	4 steps	CPU-bound work stays fast if working set fits L2 (~256 KB per core)
L3 cache reference	~40 ns	40 steps (~1 minute if 1 step = 1 second)	Shared L3 (~32 MB) is the last level before main memory
Main memory (RAM) reference	~100 ns	100 steps	In-memory caches (Redis, Memcached) are ~100× slower than L3 but ~1000× faster than SSD
SSD random read	~100 μs (100,000 ns)	1.5 days	NVMe SSD: ~100K IOPS; still 1000× slower than RAM
HDD seek + read	~5 ms (5,000,000 ns)	~2 months	Sequential reads only; random I/O is death for OLTP
Same-datacenter RTT	~500 μs (0.5 ms)	~6 months	Every RPC adds ~0.5 ms; 10 sequential calls = 5 ms minimum
Cross-region RTT (US ↔ EU)	~50 ms	~1.5 years	Multi-region strong consistency requires accepting 50+ ms writes
Read 1 MB from RAM	~250 μs	—	Memory bandwidth ~40 GB/s on modern server
Read 1 MB from SSD	~1 ms	—	Sequential SSD ~1 GB/s; random small reads much worse
Read 1 MB from HDD	~20 ms	—	Sequential ~100 MB/s; seek dominates random access
Send 1 MB over 1 Gbps network	~10 ms	—	Bandwidth-bound; compression and CDN edge caching reduce payload

The hierarchy in one sentence

If your hot path touches disk or network more than once per user request, you have a latency problem that no amount of "faster CPUs" will fix. The fix is always move data closer (cache), touch fewer things (denormalize, batch), or do less work (async, precompute).

🔬 Under the Hood

A single PostgreSQL query with an index lookup on SSD might take 1–5 ms for the disk I/O alone, plus 0.1–0.5 ms for query planning and execution. Add a same-DC network hop (0.5 ms) and JSON serialization (0.1 ms), and a "simple" read is already 2–6 ms—before your application logic runs. At 200 ms p99 budget, you have room for ~30–100 such hops if perfectly parallel; in practice, serial chains dominate.

🎯 Interview Tip

When asked "why cache?", cite numbers: "RAM is ~100 ns; SSD is ~100 μs—that's 1000×. A Redis hit at 1 ms still beats a DB round-trip at 5–10 ms." Interviewers want quantitative reasoning, not "caching makes things faster."

Practical rules derived from the numbers

Keep hot data in RAM. Redis/Memcached for session, feed fragments, rate-limit counters.
Minimize serial RPC chains. Each same-DC hop costs ~0.5 ms; fan-out in parallel, not sequence.
Batch disk writes. One 10 KB write is nearly as expensive as one 1 MB write on SSD (amortize seek/flush).
Never assume cross-region is fast. 50 ms RTT means synchronous replication caps write throughput and inflates p99.
Profile before optimizing CPU. If you're I/O-bound, faster algorithms won't help—change the data path.

📦 Real World

Google published that a 500 ms delay in search results reduces engagement by 20%. Amazon found every 100 ms of latency cost 1% in sales. Netflix precomputes recommendation rows in batch jobs so the API path stays under 50 ms p99— they never compute collaborative filtering on the request hot path.

💡 Pro Tip

Memorize the ratios, not exact nanoseconds: L1 → RAM is ~100×, RAM → SSD is ~1000×, SSD → cross-region is ~500×. These three ratios cover 90% of interview latency discussions.

Throughput vs Latency

Throughput is how many requests you handle per second. Latency is how long one request takes. They are related but not interchangeable—and optimizing one often degrades the other.

Throughput (requests/sec, QPS, TPS) measures system capacity. Latency measures user-perceived speed for a single operation. A system can have high throughput with terrible latency (batch processing) or low latency with low throughput (single-threaded optimized path).

Little's Law: the bridge between them

L = λ × W — average concurrent requests (L) equals arrival rate (λ) times average time in system (W). If your API handles 10,000 RPS with 100 ms average latency, you have ~1,000 in-flight requests at any moment. Double traffic without scaling → latency doubles (queue buildup).

Example: checkout service
  Arrival rate (λ)  = 2,000 RPS
  Avg latency (W)   = 50 ms = 0.05 s
  Concurrent (L)    = 2,000 × 0.05 = 100 in-flight requests

  If λ doubles to 4,000 RPS without adding capacity:
  W increases → queue depth grows → p99 explodes

Percentiles: why averages lie

Average latency hides tail behavior. A service with 10 ms average but 2 s p99 feels broken to 1% of users—that's millions of users at scale. Always design and commit to percentile SLOs, not averages.

Percentile	Meaning	Typical target	What causes tail latency
p50 (median)	Half of requests faster, half slower	20–50 ms for read APIs	Usually healthy; not sufficient alone
p90	90% of requests faster	50–100 ms	Occasional cache miss, GC pause
p99	99% of requests faster	100–300 ms (product-dependent)	DB slow query, retry storm, lock contention
p999	99.9% of requests faster	500 ms – 2 s	Cold start, cross-region failover, full GC
max	Worst observed request	Not used for SLOs (outliers unbounded)	Timeouts, deadlocks, cascading failure

⚖️ Trade-off

Batching increases throughput, increases latency. Kafka producers batch 16 KB–1 MB before sending; a single message may wait up to linger.ms (default 5 ms) for batch fill. High-throughput pipelines accept this; interactive APIs cannot. Choose per use case.

SLI, SLO, SLA — the observability contract

SLI (Indicator): What you measure — e.g., "proportion of requests completing in < 200 ms."
SLO (Objective): Target — e.g., "99.9% of requests < 200 ms over 30 days."
SLA (Agreement): Business contract with consequences — e.g., "99.9% uptime or credits."

Error budget = 1 − SLO. At 99.9% availability, you have 43.8 minutes/month of allowed downtime. Spend error budget on risky launches; conserve it during incidents.

The four golden signals (Google SRE)

Signal	Question it answers	Example metric
Latency	How long do requests take?	p50/p99 response time by endpoint
Traffic	How much demand?	RPS, concurrent users, bytes/sec
Errors	How many requests fail?	HTTP 5xx rate, timeout rate, error budget burn
Saturation	How full is the system?	CPU %, memory %, DB connection pool usage, queue depth

flowchart LR
  Client[Client]
  LB[Load Balancer\n~0.1 ms]
  API[API Server\n~5 ms]
  Cache[Redis Cache\n~1 ms]
  DB[(PostgreSQL\n~5 ms)]
  Client --> LB --> API
  API --> Cache
  Cache -->|miss| DB
  Cache -->|hit| API
  DB --> API
  API --> LB --> Client

🏆 Senior Signal

L6 Staff candidates discuss coordinated omission: load generators that wait for responses before sending the next request hide latency spikes under overload. Use open-loop load testing (fixed RPS regardless of response time) to find real breaking points.

⚠️ Pitfall

Saying "we need low latency" without a number is a red flag. Strong answer: "Feed load p99 < 200 ms; timeline can tolerate 500 ms. We'll measure p50/p99 separately and alert on SLO burn rate."

Scalability

Scalability is the ability to handle growth—more users, more data, more requests—by adding resources without redesigning the architecture. The question is always: what kind of growth, and at what cost?

Vertical vs horizontal scaling

Dimension	Vertical (scale up)	Horizontal (scale out)
Method	Bigger machine: more CPU, RAM, faster disk	More machines: add nodes behind load balancer
Limits	Hardware ceiling (e.g., 768 GB RAM, 128 cores)	Coordination overhead, data partitioning complexity
Cost curve	Non-linear—2× RAM often costs 3× price	Near-linear with commodity hardware
Downtime	Often requires restart for resize	Rolling add—zero downtime if stateless
Best for	Databases (initially), monoliths, quick fixes	Web tiers, workers, read replicas, sharded stores
Example	db.r6g.16xlarge → db.r6g.24xlarge	10 → 100 API pods behind ALB

⚖️ Trade-off

Vertical scaling is simpler until it isn't. A PostgreSQL primary on a 96-core, 768 GB machine handles ~50K simple reads/sec—but one hot row or long transaction blocks everyone. Horizontal read replicas scale reads; sharding scales writes. Most systems vertical-scale the DB first, then shard when write QPS exceeds single-node capacity (~10K–50K writes/sec depending on row size and indexes).

The AKF Scale Cube

Martin Abbott and Michael Fisher's AKF Scale Cube defines three independent axes of scaling. Mature systems scale on all three simultaneously.

X-axis — Horizontal duplication: Clone the entire service N times behind a load balancer. Stateless web/API tier. Easiest; no code changes.
Y-axis — Functional decomposition: Split by feature/service—User Service, Order Service, Payment Service. Microservices. Reduces blast radius; adds network overhead.
Z-axis — Data partitioning: Split by shard key—user_id % N, geographic region, tenant_id. Scales data and write throughput. Hardest; requires routing and rebalancing.

🔬 Under the Hood

Twitter's early architecture scaled X-axis (more Ruby app servers) until the monolith couldn't compile fast enough. They moved Y-axis (split into services) and Z-axis (Snowflake IDs, Gizzard sharding for timelines). No single axis solved everything—the combination did.

Stateless vs stateful services

Property	Stateless	Stateful
Session data	Externalized to Redis/DB; any node serves any request	Sticky sessions or in-memory state; node affinity required
Scaling	Add/remove nodes freely; auto-scale on CPU/RPS	Rebalancing state on scale-down; complex failover
Examples	REST API, CDN edge, Lambda functions	WebSocket servers, game servers, Kafka consumers with local state
Failure handling	Retry on any node; trivial	State recovery, partition reassignment, split-brain risk

🎯 Interview Tip

L4 When drawing architecture, explicitly label stateless boxes (API, workers) vs stateful (DB, cache, queue). Say: "API tier is stateless—we scale horizontally. Session in Redis, not in-memory."

📦 Real World

Discord moved from stateful Elixir nodes (millions of WebSocket connections per machine) to a hybrid: stateless API gateway + stateful connection nodes with consistent hashing for user→node mapping. Uber scales dispatch statelessly but keeps driver location state in Redis geospatial indexes (stateful data store, stateless readers).

Reliability & Availability

Reliability is correctness: the system does the right thing. Availability is uptime: the system responds when called. You can be available but wrong (stale reads) or correct but unavailable (CP during partition).

The nines table

Availability	Downtime / year	Downtime / month	Downtime / week	Typical tier
99% (two nines)	3.65 days	7.2 hours	1.68 hours	Internal tools, dev environments
99.9% (three nines)	8.76 hours	43.8 minutes	10.1 minutes	B2B SaaS, non-critical APIs
99.99% (four nines)	52.6 minutes	4.38 minutes	1.01 minutes	Payment systems, core product APIs
99.999% (five nines)	5.26 minutes	26.3 seconds	6.05 seconds	Telco, hospital systems, trading infra
99.9999% (six nines)	31.5 seconds	2.63 seconds	0.605 seconds	Requires active-active multi-region + formal methods

Each additional nine costs roughly 10× in engineering and infrastructure. Most consumer products target 99.9%–99.99%. Don't over-engineer five nines for a blog.

MTBF and MTTR

Availability = MTBF / (MTBF + MTTR)

MTBF (Mean Time Between Failures): How long the system runs before breaking. Increase via redundancy, chaos testing, better hardware.
MTTR (Mean Time To Recovery): How long to restore service after failure. Decrease via automation, runbooks, circuit breakers, graceful degradation.

Example: API cluster
  MTBF = 720 hours (30 days between incidents)
  MTTR = 0.5 hours (30 min to recover)
  Availability = 720 / (720 + 0.5) = 99.93%

  To reach 99.99% without improving MTBF:
  MTTR must drop to ~0.05 hours (3 minutes)

Redundancy patterns

Pattern	Description	Failure behavior	Cost
Active-passive	Standby replica idle until primary fails	Failover delay (seconds–minutes); possible data loss window	Low (1× idle capacity)
Active-active	All nodes serve traffic simultaneously	Instant reroute; requires conflict resolution	High (N× full capacity)
N+1	N nodes needed + 1 spare for one failure	One failure absorbed; second failure degrades	Moderate (~33% overhead for N=3)
N+2	Two simultaneous failures tolerated	Survives double failure during maintenance	Higher (~50% overhead for N=4)
Geographic redundancy	Multi-AZ or multi-region deployment	Survives datacenter/region loss	2–3× infra + replication lag complexity

Common failure modes

Single Point of Failure (SPOF): One component whose failure takes down the system—eliminate with redundancy.
Cascading failure: Overloaded node fails → retries flood siblings → they fail too. Fix: circuit breakers, bulkheads, rate limits.
Split brain: Network partition causes two nodes to both think they're primary. Fix: quorum, fencing tokens, STONITH.
Thundering herd: Cache expires → all requests hit DB simultaneously. Fix: jittered TTL, mutex on miss, pre-warming.
Retry storm: Clients retry failed requests → amplifies load 3–10×. Fix: exponential backoff with jitter, idempotency keys.

⚠️ Pitfall

Redundancy without health checks is theater. An active-passive DB pair where the standby hasn't replicated the last 30 seconds of writes will fail over into data loss. Test failover quarterly—most teams discover broken automation during the real incident.

🏆 Senior Signal

L5 Proactively identify SPOFs in your diagram: "The load balancer is a SPOF—we use managed ALB with cross-AZ. The config service is a SPOF—we cache config locally with 5-minute TTL for graceful degradation."

CAP Theorem

Eric Brewer's CAP theorem (formalized by Gilbert and Lynch, 2002) states that a distributed system can guarantee at most two of three properties simultaneously during a network partition.

The three properties

Consistency (C): Every read receives the most recent write or an error. All nodes see the same data at the same time (linearizable).
Availability (A): Every request receives a non-error response, without guarantee that it reflects the latest write.
Partition tolerance (P): The system continues operating despite arbitrary message loss or delay between nodes (network splits).

In practice, partitions happen—switch failures, GC pauses masquerading as timeouts, cross-AZ link cuts. P is not optional in distributed systems. The real choice is CP vs AP during partition: reject requests (sacrifice availability) or serve potentially stale data (sacrifice consistency).

CP systems (HBase, ZooKeeper, etcd): sacrifice availability during partition — reject requests rather than return stale data. Choose for leader election, distributed locks, financial consistency.

Choice	During partition	Examples	Use when
CP	Minority partition unavailable; rejects writes/reads	etcd, ZooKeeper, HBase, traditional RDBMS with sync replication	Financial transactions, leader election, distributed locks, inventory counts
AP	Both partitions serve requests; may diverge temporarily	Cassandra, DynamoDB, CouchDB, DNS, Riak	Shopping carts, social likes, analytics, session data, DNS
CA	Only on single node (no partition possible)	Single-node PostgreSQL/MySQL (no replication)	Not truly distributed; breaks when you add a replica

⚖️ Trade-off

CAP is often misquoted as "pick two at design time." Partitions are rare but inevitable—the choice is what happens during the partition, not a permanent architectural label. Many systems are mostly available with tunable consistency (DynamoDB: eventual by default, strong reads optional at 2× latency cost).

🎯 Interview Tip

L5 Don't stop at "we pick AP." Say: "Shopping cart is AP—merging conflicting carts on reconnect is acceptable. Payment authorization is CP—we fail closed if we can't reach the ledger." Same system, different CAP choices per operation.

🔬 Under the Hood

ZooKeeper (CP) uses ZAB consensus: during partition, the minority side stops accepting writes. Cassandra (AP) uses tunable consistency with QUORUM reads/writes—during partition, each side continues independently; hint handoff and read repair reconcile later.

PACELC

Daniel Abadi's PACELC extension (2012) adds: Else, even when there is no partition, you still trade off Latency (L) vs Consistency (C). CAP alone misses the normal-case trade-off.

Full form: If Partition → choose A or C. Else (normal operation) → choose L or C.

System	If Partition (PA or PC)	Else (EL or EC)	Classification	Notes
MySQL (InnoDB, single primary)	PC — primary unavailable blocks writes	EC — sync replication waits for replica ack (strong) or async (faster, weaker)	PC/EC	Read replicas give EL for reads; primary writes are EC with semi-sync
Cassandra	PA — both sides accept writes	EL — `ONE` reads/writes are fast; `QUORUM` is slower but consistent	PA/EL (default)	Tunable per query: `LOCAL_QUORUM`, `ALL`
DynamoDB	PA — multi-AZ continues serving	EL — eventual consistency default (~1 ms); strong consistency option (~2 ms, 2× RCU)	PA/EL	Strongly consistent reads hit the leader replica only
MongoDB	PC — minority partition loses primary	EC — default write concern `w:1` is fast; `w:majority` waits for replication	PC/EC (configurable)	Replica set elections cause brief unavailability (CP behavior)

⚖️ Trade-off

DynamoDB strong reads cost 2× read capacity units and add ~1 ms latency vs eventual. For a product catalog where stale price for 1 second is acceptable, use eventual (EL). For wallet balance, pay the EC cost or use conditional writes with version checks.

📦 Real World

Amazon Dynamo (the paper behind DynamoDB) explicitly chose PA/EL: shopping cart writes must never fail, and 200 ms staleness on item count is fine. Google Spanner chose PC/EC with TrueTime—paying latency cost (~5–10 ms cross-region) for external consistency globally.

🏆 Senior Signal

Principal PACELC reframes database selection as a per-operation decision, not a system-wide label. A Principal engineer maps each API endpoint to its PACELC quadrant and documents the business justification.

Consistency Models

Consistency defines what guarantees readers see after writers commit. Stronger guarantees cost latency and availability; weaker guarantees enable scale and speed.

Consistency is not binary. It's a spectrum from "every read sees the latest write globally" (linearizability) to "reads may be arbitrarily stale" (eventual). Pick the weakest model that satisfies your product requirements.

Model	Guarantee	Latency cost	Example use case
Strong / Linearizable	Every read sees the most recent write; global total order	High — cross-region: 50+ ms; requires consensus or single leader	Bank balances, inventory deduction, distributed locks
Sequential	All nodes agree on operation order, but reads may lag slightly	Medium-high	Collaborative editing with ordering
Causal	Cause precedes effect; concurrent ops may be seen in different order	Medium — vector clocks or version chains	Social comments, messaging threads
Read-your-writes	User always sees their own recent writes (session consistency)	Low-medium — sticky routing or session tokens	User profile edits, form submissions, settings pages
Monotonic reads	User never sees older data on subsequent reads	Low — route to same replica or track version	Feed pagination, timeline scrolling
Eventual	Given no new writes, all replicas converge given enough time	Lowest — async replication, no coordination	DNS, CDN, view counts, recommendation scores

Read-your-writes in practice

After a user updates their avatar, they must see the new avatar immediately—even if other users see the old one for seconds. Implementation options:

Sticky sessions: Route user to the same replica that handled the write (fragile on scale-down).
Session token with version: Client sends last-write timestamp; server reads from replica at least that fresh.
Write-through cache: Update cache synchronously on write; reads hit cache first.
Primary reads for session: User's writes go to primary; their reads also hit primary (others hit replicas).

🔬 Under the Hood

Facebook's TAO cache layer provides read-your-writes by routing a user's reads to the cache partition that processed their write. Cross-user reads may be eventually consistent—exactly the product requirement for social feeds.

⚖️ Trade-off

Strong consistency across regions adds 50–100 ms to every write (consensus round-trip). For a global chat app, causal consistency per conversation is sufficient and 10× faster. For a global ledger, you pay the latency or partition by region with async settlement.

When to use each model

Scenario	Recommended model	Why
Payment / transfer	Strong (linearizable)	Double-spend is unacceptable; fail closed on uncertainty
User profile edit	Read-your-writes	User must see own changes; others can wait seconds
Twitter timeline	Eventual + monotonic reads	Stale tweet for 2 s is fine; going backward in time is not
Inventory (flash sale)	Strong or optimistic locking	Overselling 10 units costs money and trust
Page view counter	Eventual	Exact count unnecessary; approximate is 100× cheaper
Collaborative document	Causal or CRDT	Concurrent edits merge; total order unnecessary

🎯 Interview Tip

Ask the interviewer: "Does the user need to see their own write immediately, or do all users need the latest data globally?" That one question narrows strong → read-your-writes → eventual and saves 10 minutes of over-engineering.

⚠️ Pitfall

Defaulting to "strong consistency everywhere" is the distributed systems equivalent of premature optimization in reverse— you pay 10× latency and availability cost where eventual would suffice. Match the model to the operation, not the system.

Back-of-Envelope Estimation

Every system design interview includes estimation. The goal is not precision—it's demonstrating structured thinking, stated assumptions, and order-of-magnitude correctness. Off by 2× is fine; off by 100× is not.

Powers of two (memorize these)

Power	Exact value	Approximation
2^10	1,024	~1 Thousand (1 KB)
2^20	1,048,576	~1 Million (1 MB)
2^30	1,073,741,824	~1 Billion (1 GB)
2^40	1,099,511,627,776	~1 Trillion (1 TB)
2^50	1,125,899,906,842,624	~1 Quadrillion (1 PB)

Other useful numbers

1 day = 86,400 seconds ≈ 100K seconds (round for mental math)
1 month ≈ 2.5M seconds
1 year ≈ 30M seconds
Peak QPS ≈ 2–3× average (sometimes 10× for viral events)
1 server (well-tuned API) ≈ 1K–10K RPS (use 1K for conservative estimates)
1 SSD ≈ 100K IOPS; 1 HDD ≈ 100 IOPS
1 Gbps = 125 MB/s; 10 Gbps = 1.25 GB/s

Estimation framework (5 steps)

Clarify assumptions: DAU, actions per user per day, read/write ratio, payload size.
Calculate QPS: (DAU × ops/user) / 86400; multiply by peak factor.
Calculate storage: DAU × ops × bytes/record × retention days.
Calculate bandwidth: peak QPS × average response/request size.
Calculate servers: peak QPS / per-server capacity; add 30% headroom.

📊 Estimation Template

Say aloud: "I'll assume 300M DAU, 2 tweets read per user per day, peak 3× average. Read QPS = 300M × 2 / 100K = 6K average, ~18K peak. Write QPS = 300M × 0.5 tweets / 100K = 1.5K average, ~4.5K peak."

Worked example: Twitter-scale timeline reads

Assumptions:

300M DAU
Each user reads timeline 10×/day (scroll sessions)
Each user posts 0.5 tweets/day (150M tweets/day)
Peak traffic = 3× average
Tweet size ≈ 300 bytes; timeline page ≈ 20 tweets = 6 KB response
Retention: 5 years of tweets

READ QPS
  = 300M DAU × 10 reads / 86,400 sec
  = 3B / 86,400 ≈ 35,000 avg
  Peak = 35K × 3 ≈ 105,000 read QPS

WRITE QPS
  = 300M × 0.5 / 86,400 ≈ 1,750 avg
  Peak ≈ 5,250 write QPS

STORAGE (5 years)
  = 150M tweets/day × 300 bytes × 365 × 5
  = 150M × 300 × 1,825 ≈ 82 × 10^15 bytes ≈ 82 PB raw
  With indexes (3×) ≈ 250 PB

BANDWIDTH (peak reads)
  = 105K RPS × 6 KB ≈ 630 MB/s ≈ 5 Gbps

APP SERVERS (stateless API @ 1K RPS each)
  = 105K / 1K ≈ 105 servers + 30% headroom ≈ 140 servers

💡 Pro Tip

Round aggressively: 86,400 → 100K, 300M × 10 = 3B. The interviewer cares about your formula, not whether you remembered that a day has exactly 86,400 seconds.

Interactive estimation calculator

Adjust inputs to see QPS, storage, bandwidth, and server count update in real time.

Daily Active Users (DAU)

Operations per user per day

Bytes per operation

Peak multiplier (× average)

Peak QPS —

Daily storage —

Peak bandwidth —

App servers needed —

🏆 Senior Signal

L6 After computing numbers, sanity-check against known systems: "105K read QPS is roughly Twitter 2012 scale— they'd need fan-out on write or hybrid model, not naive per-request DB joins." Connect estimation to architecture decisions.

📦 Real World

Instagram engineers report that back-of-envelope estimates during design reviews catch 80% of scaling issues before code is written. A 10-minute estimation at whiteboard time saves months of re-architecture after launch.