System Design Fundamentals
Every architecture decision rests on a handful of quantitative truths: how fast memory is, how far apart datacenters are, what "99.9% available" actually costs, and whether your database picks consistency or availability when the network splits. This chapter is the foundation—memorize the numbers, internalize the trade-offs, and practice estimation until it becomes reflex.
Latency Numbers Every Engineer Must Know
Jeff Dean's latency numbers (updated for modern hardware) are the single most valuable cheat sheet in system design. They explain why caching works, why you can't hit the database on every request, and why cross-region replication is never "free."
Computers are fast at some things and catastrophically slow at others—and the gap spans six orders of magnitude from L1 cache to cross-region network round-trips. When you design a system, every component choice is really a bet on where data lives relative to the CPU that needs it.
| Operation | Latency | Human-scale analogy | Design implication |
|---|---|---|---|
| L1 cache reference | ~1 ns | 1 step | In-process data structures; avoid cache misses in hot loops |
| L2 cache reference | ~4 ns | 4 steps | CPU-bound work stays fast if working set fits L2 (~256 KB per core) |
| L3 cache reference | ~40 ns | 40 steps (~1 minute if 1 step = 1 second) | Shared L3 (~32 MB) is the last level before main memory |
| Main memory (RAM) reference | ~100 ns | 100 steps | In-memory caches (Redis, Memcached) are ~100× slower than L3 but ~1000× faster than SSD |
| SSD random read | ~100 μs (100,000 ns) | 1.5 days | NVMe SSD: ~100K IOPS; still 1000× slower than RAM |
| HDD seek + read | ~5 ms (5,000,000 ns) | ~2 months | Sequential reads only; random I/O is death for OLTP |
| Same-datacenter RTT | ~500 μs (0.5 ms) | ~6 months | Every RPC adds ~0.5 ms; 10 sequential calls = 5 ms minimum |
| Cross-region RTT (US ↔ EU) | ~50 ms | ~1.5 years | Multi-region strong consistency requires accepting 50+ ms writes |
| Read 1 MB from RAM | ~250 μs | — | Memory bandwidth ~40 GB/s on modern server |
| Read 1 MB from SSD | ~1 ms | — | Sequential SSD ~1 GB/s; random small reads much worse |
| Read 1 MB from HDD | ~20 ms | — | Sequential ~100 MB/s; seek dominates random access |
| Send 1 MB over 1 Gbps network | ~10 ms | — | Bandwidth-bound; compression and CDN edge caching reduce payload |
The hierarchy in one sentence
If your hot path touches disk or network more than once per user request, you have a latency problem that no amount of "faster CPUs" will fix. The fix is always move data closer (cache), touch fewer things (denormalize, batch), or do less work (async, precompute).
A single PostgreSQL query with an index lookup on SSD might take 1–5 ms for the disk I/O alone, plus 0.1–0.5 ms for query planning and execution. Add a same-DC network hop (0.5 ms) and JSON serialization (0.1 ms), and a "simple" read is already 2–6 ms—before your application logic runs. At 200 ms p99 budget, you have room for ~30–100 such hops if perfectly parallel; in practice, serial chains dominate.
When asked "why cache?", cite numbers: "RAM is ~100 ns; SSD is ~100 μs—that's 1000×. A Redis hit at 1 ms still beats a DB round-trip at 5–10 ms." Interviewers want quantitative reasoning, not "caching makes things faster."
Practical rules derived from the numbers
- Keep hot data in RAM. Redis/Memcached for session, feed fragments, rate-limit counters.
- Minimize serial RPC chains. Each same-DC hop costs ~0.5 ms; fan-out in parallel, not sequence.
- Batch disk writes. One 10 KB write is nearly as expensive as one 1 MB write on SSD (amortize seek/flush).
- Never assume cross-region is fast. 50 ms RTT means synchronous replication caps write throughput and inflates p99.
- Profile before optimizing CPU. If you're I/O-bound, faster algorithms won't help—change the data path.
Google published that a 500 ms delay in search results reduces engagement by 20%. Amazon found every 100 ms of latency cost 1% in sales. Netflix precomputes recommendation rows in batch jobs so the API path stays under 50 ms p99— they never compute collaborative filtering on the request hot path.
Memorize the ratios, not exact nanoseconds: L1 → RAM is ~100×, RAM → SSD is ~1000×, SSD → cross-region is ~500×. These three ratios cover 90% of interview latency discussions.
Throughput vs Latency
Throughput is how many requests you handle per second. Latency is how long one request takes. They are related but not interchangeable—and optimizing one often degrades the other.
Throughput (requests/sec, QPS, TPS) measures system capacity. Latency measures user-perceived speed for a single operation. A system can have high throughput with terrible latency (batch processing) or low latency with low throughput (single-threaded optimized path).
Little's Law: the bridge between them
L = λ × W — average concurrent requests (L) equals arrival rate (λ) times average time in system (W). If your API handles 10,000 RPS with 100 ms average latency, you have ~1,000 in-flight requests at any moment. Double traffic without scaling → latency doubles (queue buildup).
Example: checkout service
Arrival rate (λ) = 2,000 RPS
Avg latency (W) = 50 ms = 0.05 s
Concurrent (L) = 2,000 × 0.05 = 100 in-flight requests
If λ doubles to 4,000 RPS without adding capacity:
W increases → queue depth grows → p99 explodes
Percentiles: why averages lie
Average latency hides tail behavior. A service with 10 ms average but 2 s p99 feels broken to 1% of users—that's millions of users at scale. Always design and commit to percentile SLOs, not averages.
| Percentile | Meaning | Typical target | What causes tail latency |
|---|---|---|---|
| p50 (median) | Half of requests faster, half slower | 20–50 ms for read APIs | Usually healthy; not sufficient alone |
| p90 | 90% of requests faster | 50–100 ms | Occasional cache miss, GC pause |
| p99 | 99% of requests faster | 100–300 ms (product-dependent) | DB slow query, retry storm, lock contention |
| p999 | 99.9% of requests faster | 500 ms – 2 s | Cold start, cross-region failover, full GC |
| max | Worst observed request | Not used for SLOs (outliers unbounded) | Timeouts, deadlocks, cascading failure |
Batching increases throughput, increases latency. Kafka producers batch 16 KB–1 MB before sending;
a single message may wait up to linger.ms (default 5 ms) for batch fill. High-throughput pipelines accept
this; interactive APIs cannot. Choose per use case.
SLI, SLO, SLA — the observability contract
- SLI (Indicator): What you measure — e.g., "proportion of requests completing in < 200 ms."
- SLO (Objective): Target — e.g., "99.9% of requests < 200 ms over 30 days."
- SLA (Agreement): Business contract with consequences — e.g., "99.9% uptime or credits."
Error budget = 1 − SLO. At 99.9% availability, you have 43.8 minutes/month of allowed downtime. Spend error budget on risky launches; conserve it during incidents.
The four golden signals (Google SRE)
| Signal | Question it answers | Example metric |
|---|---|---|
| Latency | How long do requests take? | p50/p99 response time by endpoint |
| Traffic | How much demand? | RPS, concurrent users, bytes/sec |
| Errors | How many requests fail? | HTTP 5xx rate, timeout rate, error budget burn |
| Saturation | How full is the system? | CPU %, memory %, DB connection pool usage, queue depth |
flowchart LR Client[Client] LB[Load Balancer\n~0.1 ms] API[API Server\n~5 ms] Cache[Redis Cache\n~1 ms] DB[(PostgreSQL\n~5 ms)] Client --> LB --> API API --> Cache Cache -->|miss| DB Cache -->|hit| API DB --> API API --> LB --> Client
L6 Staff candidates discuss coordinated omission: load generators that wait for responses before sending the next request hide latency spikes under overload. Use open-loop load testing (fixed RPS regardless of response time) to find real breaking points.
Saying "we need low latency" without a number is a red flag. Strong answer: "Feed load p99 < 200 ms; timeline can tolerate 500 ms. We'll measure p50/p99 separately and alert on SLO burn rate."
Scalability
Scalability is the ability to handle growth—more users, more data, more requests—by adding resources without redesigning the architecture. The question is always: what kind of growth, and at what cost?
Vertical vs horizontal scaling
| Dimension | Vertical (scale up) | Horizontal (scale out) |
|---|---|---|
| Method | Bigger machine: more CPU, RAM, faster disk | More machines: add nodes behind load balancer |
| Limits | Hardware ceiling (e.g., 768 GB RAM, 128 cores) | Coordination overhead, data partitioning complexity |
| Cost curve | Non-linear—2× RAM often costs 3× price | Near-linear with commodity hardware |
| Downtime | Often requires restart for resize | Rolling add—zero downtime if stateless |
| Best for | Databases (initially), monoliths, quick fixes | Web tiers, workers, read replicas, sharded stores |
| Example | db.r6g.16xlarge → db.r6g.24xlarge | 10 → 100 API pods behind ALB |
Vertical scaling is simpler until it isn't. A PostgreSQL primary on a 96-core, 768 GB machine handles ~50K simple reads/sec—but one hot row or long transaction blocks everyone. Horizontal read replicas scale reads; sharding scales writes. Most systems vertical-scale the DB first, then shard when write QPS exceeds single-node capacity (~10K–50K writes/sec depending on row size and indexes).
The AKF Scale Cube
Martin Abbott and Michael Fisher's AKF Scale Cube defines three independent axes of scaling. Mature systems scale on all three simultaneously.
- X-axis — Horizontal duplication: Clone the entire service N times behind a load balancer. Stateless web/API tier. Easiest; no code changes.
- Y-axis — Functional decomposition: Split by feature/service—User Service, Order Service, Payment Service. Microservices. Reduces blast radius; adds network overhead.
- Z-axis — Data partitioning: Split by shard key—user_id % N, geographic region, tenant_id. Scales data and write throughput. Hardest; requires routing and rebalancing.
Twitter's early architecture scaled X-axis (more Ruby app servers) until the monolith couldn't compile fast enough. They moved Y-axis (split into services) and Z-axis (Snowflake IDs, Gizzard sharding for timelines). No single axis solved everything—the combination did.
Stateless vs stateful services
| Property | Stateless | Stateful |
|---|---|---|
| Session data | Externalized to Redis/DB; any node serves any request | Sticky sessions or in-memory state; node affinity required |
| Scaling | Add/remove nodes freely; auto-scale on CPU/RPS | Rebalancing state on scale-down; complex failover |
| Examples | REST API, CDN edge, Lambda functions | WebSocket servers, game servers, Kafka consumers with local state |
| Failure handling | Retry on any node; trivial | State recovery, partition reassignment, split-brain risk |
L4 When drawing architecture, explicitly label stateless boxes (API, workers) vs stateful (DB, cache, queue). Say: "API tier is stateless—we scale horizontally. Session in Redis, not in-memory."
Discord moved from stateful Elixir nodes (millions of WebSocket connections per machine) to a hybrid: stateless API gateway + stateful connection nodes with consistent hashing for user→node mapping. Uber scales dispatch statelessly but keeps driver location state in Redis geospatial indexes (stateful data store, stateless readers).
Reliability & Availability
Reliability is correctness: the system does the right thing. Availability is uptime: the system responds when called. You can be available but wrong (stale reads) or correct but unavailable (CP during partition).
The nines table
| Availability | Downtime / year | Downtime / month | Downtime / week | Typical tier |
|---|---|---|---|---|
| 99% (two nines) | 3.65 days | 7.2 hours | 1.68 hours | Internal tools, dev environments |
| 99.9% (three nines) | 8.76 hours | 43.8 minutes | 10.1 minutes | B2B SaaS, non-critical APIs |
| 99.99% (four nines) | 52.6 minutes | 4.38 minutes | 1.01 minutes | Payment systems, core product APIs |
| 99.999% (five nines) | 5.26 minutes | 26.3 seconds | 6.05 seconds | Telco, hospital systems, trading infra |
| 99.9999% (six nines) | 31.5 seconds | 2.63 seconds | 0.605 seconds | Requires active-active multi-region + formal methods |
Each additional nine costs roughly 10× in engineering and infrastructure. Most consumer products target 99.9%–99.99%. Don't over-engineer five nines for a blog.
MTBF and MTTR
Availability = MTBF / (MTBF + MTTR)
- MTBF (Mean Time Between Failures): How long the system runs before breaking. Increase via redundancy, chaos testing, better hardware.
- MTTR (Mean Time To Recovery): How long to restore service after failure. Decrease via automation, runbooks, circuit breakers, graceful degradation.
Example: API cluster
MTBF = 720 hours (30 days between incidents)
MTTR = 0.5 hours (30 min to recover)
Availability = 720 / (720 + 0.5) = 99.93%
To reach 99.99% without improving MTBF:
MTTR must drop to ~0.05 hours (3 minutes)
Redundancy patterns
| Pattern | Description | Failure behavior | Cost |
|---|---|---|---|
| Active-passive | Standby replica idle until primary fails | Failover delay (seconds–minutes); possible data loss window | Low (1× idle capacity) |
| Active-active | All nodes serve traffic simultaneously | Instant reroute; requires conflict resolution | High (N× full capacity) |
| N+1 | N nodes needed + 1 spare for one failure | One failure absorbed; second failure degrades | Moderate (~33% overhead for N=3) |
| N+2 | Two simultaneous failures tolerated | Survives double failure during maintenance | Higher (~50% overhead for N=4) |
| Geographic redundancy | Multi-AZ or multi-region deployment | Survives datacenter/region loss | 2–3× infra + replication lag complexity |
Common failure modes
- Single Point of Failure (SPOF): One component whose failure takes down the system—eliminate with redundancy.
- Cascading failure: Overloaded node fails → retries flood siblings → they fail too. Fix: circuit breakers, bulkheads, rate limits.
- Split brain: Network partition causes two nodes to both think they're primary. Fix: quorum, fencing tokens, STONITH.
- Thundering herd: Cache expires → all requests hit DB simultaneously. Fix: jittered TTL, mutex on miss, pre-warming.
- Retry storm: Clients retry failed requests → amplifies load 3–10×. Fix: exponential backoff with jitter, idempotency keys.
Redundancy without health checks is theater. An active-passive DB pair where the standby hasn't replicated the last 30 seconds of writes will fail over into data loss. Test failover quarterly—most teams discover broken automation during the real incident.
L5 Proactively identify SPOFs in your diagram: "The load balancer is a SPOF—we use managed ALB with cross-AZ. The config service is a SPOF—we cache config locally with 5-minute TTL for graceful degradation."
CAP Theorem
Eric Brewer's CAP theorem (formalized by Gilbert and Lynch, 2002) states that a distributed system can guarantee at most two of three properties simultaneously during a network partition.
The three properties
- Consistency (C): Every read receives the most recent write or an error. All nodes see the same data at the same time (linearizable).
- Availability (A): Every request receives a non-error response, without guarantee that it reflects the latest write.
- Partition tolerance (P): The system continues operating despite arbitrary message loss or delay between nodes (network splits).
In practice, partitions happen—switch failures, GC pauses masquerading as timeouts, cross-AZ link cuts. P is not optional in distributed systems. The real choice is CP vs AP during partition: reject requests (sacrifice availability) or serve potentially stale data (sacrifice consistency).
| Choice | During partition | Examples | Use when |
|---|---|---|---|
| CP | Minority partition unavailable; rejects writes/reads | etcd, ZooKeeper, HBase, traditional RDBMS with sync replication | Financial transactions, leader election, distributed locks, inventory counts |
| AP | Both partitions serve requests; may diverge temporarily | Cassandra, DynamoDB, CouchDB, DNS, Riak | Shopping carts, social likes, analytics, session data, DNS |
| CA | Only on single node (no partition possible) | Single-node PostgreSQL/MySQL (no replication) | Not truly distributed; breaks when you add a replica |
CAP is often misquoted as "pick two at design time." Partitions are rare but inevitable—the choice is what happens during the partition, not a permanent architectural label. Many systems are mostly available with tunable consistency (DynamoDB: eventual by default, strong reads optional at 2× latency cost).
L5 Don't stop at "we pick AP." Say: "Shopping cart is AP—merging conflicting carts on reconnect is acceptable. Payment authorization is CP—we fail closed if we can't reach the ledger." Same system, different CAP choices per operation.
ZooKeeper (CP) uses ZAB consensus: during partition, the minority side stops accepting writes.
Cassandra (AP) uses tunable consistency with QUORUM reads/writes—during partition, each side
continues independently; hint handoff and read repair reconcile later.
PACELC
Daniel Abadi's PACELC extension (2012) adds: Else, even when there is no partition, you still trade off Latency (L) vs Consistency (C). CAP alone misses the normal-case trade-off.
Full form: If Partition → choose A or C. Else (normal operation) → choose L or C.
| System | If Partition (PA or PC) | Else (EL or EC) | Classification | Notes |
|---|---|---|---|---|
| MySQL (InnoDB, single primary) | PC — primary unavailable blocks writes | EC — sync replication waits for replica ack (strong) or async (faster, weaker) | PC/EC | Read replicas give EL for reads; primary writes are EC with semi-sync |
| Cassandra | PA — both sides accept writes | EL — ONE reads/writes are fast; QUORUM is slower but consistent |
PA/EL (default) | Tunable per query: LOCAL_QUORUM, ALL |
| DynamoDB | PA — multi-AZ continues serving | EL — eventual consistency default (~1 ms); strong consistency option (~2 ms, 2× RCU) | PA/EL | Strongly consistent reads hit the leader replica only |
| MongoDB | PC — minority partition loses primary | EC — default write concern w:1 is fast; w:majority waits for replication |
PC/EC (configurable) | Replica set elections cause brief unavailability (CP behavior) |
DynamoDB strong reads cost 2× read capacity units and add ~1 ms latency vs eventual. For a product catalog where stale price for 1 second is acceptable, use eventual (EL). For wallet balance, pay the EC cost or use conditional writes with version checks.
Amazon Dynamo (the paper behind DynamoDB) explicitly chose PA/EL: shopping cart writes must never fail, and 200 ms staleness on item count is fine. Google Spanner chose PC/EC with TrueTime—paying latency cost (~5–10 ms cross-region) for external consistency globally.
Principal PACELC reframes database selection as a per-operation decision, not a system-wide label. A Principal engineer maps each API endpoint to its PACELC quadrant and documents the business justification.
Consistency Models
Consistency defines what guarantees readers see after writers commit. Stronger guarantees cost latency and availability; weaker guarantees enable scale and speed.
Consistency is not binary. It's a spectrum from "every read sees the latest write globally" (linearizability) to "reads may be arbitrarily stale" (eventual). Pick the weakest model that satisfies your product requirements.
| Model | Guarantee | Latency cost | Example use case |
|---|---|---|---|
| Strong / Linearizable | Every read sees the most recent write; global total order | High — cross-region: 50+ ms; requires consensus or single leader | Bank balances, inventory deduction, distributed locks |
| Sequential | All nodes agree on operation order, but reads may lag slightly | Medium-high | Collaborative editing with ordering |
| Causal | Cause precedes effect; concurrent ops may be seen in different order | Medium — vector clocks or version chains | Social comments, messaging threads |
| Read-your-writes | User always sees their own recent writes (session consistency) | Low-medium — sticky routing or session tokens | User profile edits, form submissions, settings pages |
| Monotonic reads | User never sees older data on subsequent reads | Low — route to same replica or track version | Feed pagination, timeline scrolling |
| Eventual | Given no new writes, all replicas converge given enough time | Lowest — async replication, no coordination | DNS, CDN, view counts, recommendation scores |
Read-your-writes in practice
After a user updates their avatar, they must see the new avatar immediately—even if other users see the old one for seconds. Implementation options:
- Sticky sessions: Route user to the same replica that handled the write (fragile on scale-down).
- Session token with version: Client sends last-write timestamp; server reads from replica at least that fresh.
- Write-through cache: Update cache synchronously on write; reads hit cache first.
- Primary reads for session: User's writes go to primary; their reads also hit primary (others hit replicas).
Facebook's TAO cache layer provides read-your-writes by routing a user's reads to the cache partition that processed their write. Cross-user reads may be eventually consistent—exactly the product requirement for social feeds.
Strong consistency across regions adds 50–100 ms to every write (consensus round-trip). For a global chat app, causal consistency per conversation is sufficient and 10× faster. For a global ledger, you pay the latency or partition by region with async settlement.
When to use each model
| Scenario | Recommended model | Why |
|---|---|---|
| Payment / transfer | Strong (linearizable) | Double-spend is unacceptable; fail closed on uncertainty |
| User profile edit | Read-your-writes | User must see own changes; others can wait seconds |
| Twitter timeline | Eventual + monotonic reads | Stale tweet for 2 s is fine; going backward in time is not |
| Inventory (flash sale) | Strong or optimistic locking | Overselling 10 units costs money and trust |
| Page view counter | Eventual | Exact count unnecessary; approximate is 100× cheaper |
| Collaborative document | Causal or CRDT | Concurrent edits merge; total order unnecessary |
Ask the interviewer: "Does the user need to see their own write immediately, or do all users need the latest data globally?" That one question narrows strong → read-your-writes → eventual and saves 10 minutes of over-engineering.
Defaulting to "strong consistency everywhere" is the distributed systems equivalent of premature optimization in reverse— you pay 10× latency and availability cost where eventual would suffice. Match the model to the operation, not the system.
Back-of-Envelope Estimation
Every system design interview includes estimation. The goal is not precision—it's demonstrating structured thinking, stated assumptions, and order-of-magnitude correctness. Off by 2× is fine; off by 100× is not.
Powers of two (memorize these)
| Power | Exact value | Approximation |
|---|---|---|
| 2^10 | 1,024 | ~1 Thousand (1 KB) |
| 2^20 | 1,048,576 | ~1 Million (1 MB) |
| 2^30 | 1,073,741,824 | ~1 Billion (1 GB) |
| 2^40 | 1,099,511,627,776 | ~1 Trillion (1 TB) |
| 2^50 | 1,125,899,906,842,624 | ~1 Quadrillion (1 PB) |
Other useful numbers
- 1 day = 86,400 seconds ≈ 100K seconds (round for mental math)
- 1 month ≈ 2.5M seconds
- 1 year ≈ 30M seconds
- Peak QPS ≈ 2–3× average (sometimes 10× for viral events)
- 1 server (well-tuned API) ≈ 1K–10K RPS (use 1K for conservative estimates)
- 1 SSD ≈ 100K IOPS; 1 HDD ≈ 100 IOPS
- 1 Gbps = 125 MB/s; 10 Gbps = 1.25 GB/s
Estimation framework (5 steps)
- Clarify assumptions: DAU, actions per user per day, read/write ratio, payload size.
- Calculate QPS: (DAU × ops/user) / 86400; multiply by peak factor.
- Calculate storage: DAU × ops × bytes/record × retention days.
- Calculate bandwidth: peak QPS × average response/request size.
- Calculate servers: peak QPS / per-server capacity; add 30% headroom.
Say aloud: "I'll assume 300M DAU, 2 tweets read per user per day, peak 3× average. Read QPS = 300M × 2 / 100K = 6K average, ~18K peak. Write QPS = 300M × 0.5 tweets / 100K = 1.5K average, ~4.5K peak."
Worked example: Twitter-scale timeline reads
Assumptions:
- 300M DAU
- Each user reads timeline 10×/day (scroll sessions)
- Each user posts 0.5 tweets/day (150M tweets/day)
- Peak traffic = 3× average
- Tweet size ≈ 300 bytes; timeline page ≈ 20 tweets = 6 KB response
- Retention: 5 years of tweets
READ QPS
= 300M DAU × 10 reads / 86,400 sec
= 3B / 86,400 ≈ 35,000 avg
Peak = 35K × 3 ≈ 105,000 read QPS
WRITE QPS
= 300M × 0.5 / 86,400 ≈ 1,750 avg
Peak ≈ 5,250 write QPS
STORAGE (5 years)
= 150M tweets/day × 300 bytes × 365 × 5
= 150M × 300 × 1,825 ≈ 82 × 10^15 bytes ≈ 82 PB raw
With indexes (3×) ≈ 250 PB
BANDWIDTH (peak reads)
= 105K RPS × 6 KB ≈ 630 MB/s ≈ 5 Gbps
APP SERVERS (stateless API @ 1K RPS each)
= 105K / 1K ≈ 105 servers + 30% headroom ≈ 140 servers
Round aggressively: 86,400 → 100K, 300M × 10 = 3B. The interviewer cares about your formula, not whether you remembered that a day has exactly 86,400 seconds.
L6 After computing numbers, sanity-check against known systems: "105K read QPS is roughly Twitter 2012 scale— they'd need fan-out on write or hybrid model, not naive per-request DB joins." Connect estimation to architecture decisions.
Instagram engineers report that back-of-envelope estimates during design reviews catch 80% of scaling issues before code is written. A 10-minute estimation at whiteboard time saves months of re-architecture after launch.