System Design Fundamentals

Every architecture decision rests on a handful of quantitative truths: how fast memory is, how far apart datacenters are, what "99.9% available" actually costs, and whether your database picks consistency or availability when the network splits. This chapter is the foundation—memorize the numbers, internalize the trade-offs, and practice estimation until it becomes reflex.

L4 L5 learning Chapter 1

Latency Numbers Every Engineer Must Know

Jeff Dean's latency numbers (updated for modern hardware) are the single most valuable cheat sheet in system design. They explain why caching works, why you can't hit the database on every request, and why cross-region replication is never "free."

Computers are fast at some things and catastrophically slow at others—and the gap spans six orders of magnitude from L1 cache to cross-region network round-trips. When you design a system, every component choice is really a bet on where data lives relative to the CPU that needs it.

Operation Latency Human-scale analogy Design implication
L1 cache reference ~1 ns 1 step In-process data structures; avoid cache misses in hot loops
L2 cache reference ~4 ns 4 steps CPU-bound work stays fast if working set fits L2 (~256 KB per core)
L3 cache reference ~40 ns 40 steps (~1 minute if 1 step = 1 second) Shared L3 (~32 MB) is the last level before main memory
Main memory (RAM) reference ~100 ns 100 steps In-memory caches (Redis, Memcached) are ~100× slower than L3 but ~1000× faster than SSD
SSD random read ~100 μs (100,000 ns) 1.5 days NVMe SSD: ~100K IOPS; still 1000× slower than RAM
HDD seek + read ~5 ms (5,000,000 ns) ~2 months Sequential reads only; random I/O is death for OLTP
Same-datacenter RTT ~500 μs (0.5 ms) ~6 months Every RPC adds ~0.5 ms; 10 sequential calls = 5 ms minimum
Cross-region RTT (US ↔ EU) ~50 ms ~1.5 years Multi-region strong consistency requires accepting 50+ ms writes
Read 1 MB from RAM ~250 μs Memory bandwidth ~40 GB/s on modern server
Read 1 MB from SSD ~1 ms Sequential SSD ~1 GB/s; random small reads much worse
Read 1 MB from HDD ~20 ms Sequential ~100 MB/s; seek dominates random access
Send 1 MB over 1 Gbps network ~10 ms Bandwidth-bound; compression and CDN edge caching reduce payload

The hierarchy in one sentence

If your hot path touches disk or network more than once per user request, you have a latency problem that no amount of "faster CPUs" will fix. The fix is always move data closer (cache), touch fewer things (denormalize, batch), or do less work (async, precompute).

🔬 Under the Hood

A single PostgreSQL query with an index lookup on SSD might take 1–5 ms for the disk I/O alone, plus 0.1–0.5 ms for query planning and execution. Add a same-DC network hop (0.5 ms) and JSON serialization (0.1 ms), and a "simple" read is already 2–6 ms—before your application logic runs. At 200 ms p99 budget, you have room for ~30–100 such hops if perfectly parallel; in practice, serial chains dominate.

🎯 Interview Tip

When asked "why cache?", cite numbers: "RAM is ~100 ns; SSD is ~100 μs—that's 1000×. A Redis hit at 1 ms still beats a DB round-trip at 5–10 ms." Interviewers want quantitative reasoning, not "caching makes things faster."

Practical rules derived from the numbers

  • Keep hot data in RAM. Redis/Memcached for session, feed fragments, rate-limit counters.
  • Minimize serial RPC chains. Each same-DC hop costs ~0.5 ms; fan-out in parallel, not sequence.
  • Batch disk writes. One 10 KB write is nearly as expensive as one 1 MB write on SSD (amortize seek/flush).
  • Never assume cross-region is fast. 50 ms RTT means synchronous replication caps write throughput and inflates p99.
  • Profile before optimizing CPU. If you're I/O-bound, faster algorithms won't help—change the data path.
📦 Real World

Google published that a 500 ms delay in search results reduces engagement by 20%. Amazon found every 100 ms of latency cost 1% in sales. Netflix precomputes recommendation rows in batch jobs so the API path stays under 50 ms p99— they never compute collaborative filtering on the request hot path.

💡 Pro Tip

Memorize the ratios, not exact nanoseconds: L1 → RAM is ~100×, RAM → SSD is ~1000×, SSD → cross-region is ~500×. These three ratios cover 90% of interview latency discussions.

Throughput vs Latency

Throughput is how many requests you handle per second. Latency is how long one request takes. They are related but not interchangeable—and optimizing one often degrades the other.

Throughput (requests/sec, QPS, TPS) measures system capacity. Latency measures user-perceived speed for a single operation. A system can have high throughput with terrible latency (batch processing) or low latency with low throughput (single-threaded optimized path).

Little's Law: the bridge between them

L = λ × W — average concurrent requests (L) equals arrival rate (λ) times average time in system (W). If your API handles 10,000 RPS with 100 ms average latency, you have ~1,000 in-flight requests at any moment. Double traffic without scaling → latency doubles (queue buildup).

Example: checkout service
  Arrival rate (λ)  = 2,000 RPS
  Avg latency (W)   = 50 ms = 0.05 s
  Concurrent (L)    = 2,000 × 0.05 = 100 in-flight requests

  If λ doubles to 4,000 RPS without adding capacity:
  W increases → queue depth grows → p99 explodes

Percentiles: why averages lie

Average latency hides tail behavior. A service with 10 ms average but 2 s p99 feels broken to 1% of users—that's millions of users at scale. Always design and commit to percentile SLOs, not averages.

Percentile Meaning Typical target What causes tail latency
p50 (median) Half of requests faster, half slower 20–50 ms for read APIs Usually healthy; not sufficient alone
p90 90% of requests faster 50–100 ms Occasional cache miss, GC pause
p99 99% of requests faster 100–300 ms (product-dependent) DB slow query, retry storm, lock contention
p999 99.9% of requests faster 500 ms – 2 s Cold start, cross-region failover, full GC
max Worst observed request Not used for SLOs (outliers unbounded) Timeouts, deadlocks, cascading failure
⚖️ Trade-off

Batching increases throughput, increases latency. Kafka producers batch 16 KB–1 MB before sending; a single message may wait up to linger.ms (default 5 ms) for batch fill. High-throughput pipelines accept this; interactive APIs cannot. Choose per use case.

SLI, SLO, SLA — the observability contract

  • SLI (Indicator): What you measure — e.g., "proportion of requests completing in < 200 ms."
  • SLO (Objective): Target — e.g., "99.9% of requests < 200 ms over 30 days."
  • SLA (Agreement): Business contract with consequences — e.g., "99.9% uptime or credits."

Error budget = 1 − SLO. At 99.9% availability, you have 43.8 minutes/month of allowed downtime. Spend error budget on risky launches; conserve it during incidents.

The four golden signals (Google SRE)

Signal Question it answers Example metric
Latency How long do requests take? p50/p99 response time by endpoint
Traffic How much demand? RPS, concurrent users, bytes/sec
Errors How many requests fail? HTTP 5xx rate, timeout rate, error budget burn
Saturation How full is the system? CPU %, memory %, DB connection pool usage, queue depth
flowchart LR
  Client[Client]
  LB[Load Balancer\n~0.1 ms]
  API[API Server\n~5 ms]
  Cache[Redis Cache\n~1 ms]
  DB[(PostgreSQL\n~5 ms)]
  Client --> LB --> API
  API --> Cache
  Cache -->|miss| DB
  Cache -->|hit| API
  DB --> API
  API --> LB --> Client
🏆 Senior Signal

L6 Staff candidates discuss coordinated omission: load generators that wait for responses before sending the next request hide latency spikes under overload. Use open-loop load testing (fixed RPS regardless of response time) to find real breaking points.

⚠️ Pitfall

Saying "we need low latency" without a number is a red flag. Strong answer: "Feed load p99 < 200 ms; timeline can tolerate 500 ms. We'll measure p50/p99 separately and alert on SLO burn rate."

Scalability

Scalability is the ability to handle growth—more users, more data, more requests—by adding resources without redesigning the architecture. The question is always: what kind of growth, and at what cost?

Vertical vs horizontal scaling

Dimension Vertical (scale up) Horizontal (scale out)
Method Bigger machine: more CPU, RAM, faster disk More machines: add nodes behind load balancer
Limits Hardware ceiling (e.g., 768 GB RAM, 128 cores) Coordination overhead, data partitioning complexity
Cost curve Non-linear—2× RAM often costs 3× price Near-linear with commodity hardware
Downtime Often requires restart for resize Rolling add—zero downtime if stateless
Best for Databases (initially), monoliths, quick fixes Web tiers, workers, read replicas, sharded stores
Example db.r6g.16xlarge → db.r6g.24xlarge 10 → 100 API pods behind ALB
⚖️ Trade-off

Vertical scaling is simpler until it isn't. A PostgreSQL primary on a 96-core, 768 GB machine handles ~50K simple reads/sec—but one hot row or long transaction blocks everyone. Horizontal read replicas scale reads; sharding scales writes. Most systems vertical-scale the DB first, then shard when write QPS exceeds single-node capacity (~10K–50K writes/sec depending on row size and indexes).

The AKF Scale Cube

Martin Abbott and Michael Fisher's AKF Scale Cube defines three independent axes of scaling. Mature systems scale on all three simultaneously.

  • X-axis — Horizontal duplication: Clone the entire service N times behind a load balancer. Stateless web/API tier. Easiest; no code changes.
  • Y-axis — Functional decomposition: Split by feature/service—User Service, Order Service, Payment Service. Microservices. Reduces blast radius; adds network overhead.
  • Z-axis — Data partitioning: Split by shard key—user_id % N, geographic region, tenant_id. Scales data and write throughput. Hardest; requires routing and rebalancing.
🔬 Under the Hood

Twitter's early architecture scaled X-axis (more Ruby app servers) until the monolith couldn't compile fast enough. They moved Y-axis (split into services) and Z-axis (Snowflake IDs, Gizzard sharding for timelines). No single axis solved everything—the combination did.

Stateless vs stateful services

Property Stateless Stateful
Session data Externalized to Redis/DB; any node serves any request Sticky sessions or in-memory state; node affinity required
Scaling Add/remove nodes freely; auto-scale on CPU/RPS Rebalancing state on scale-down; complex failover
Examples REST API, CDN edge, Lambda functions WebSocket servers, game servers, Kafka consumers with local state
Failure handling Retry on any node; trivial State recovery, partition reassignment, split-brain risk
🎯 Interview Tip

L4 When drawing architecture, explicitly label stateless boxes (API, workers) vs stateful (DB, cache, queue). Say: "API tier is stateless—we scale horizontally. Session in Redis, not in-memory."

📦 Real World

Discord moved from stateful Elixir nodes (millions of WebSocket connections per machine) to a hybrid: stateless API gateway + stateful connection nodes with consistent hashing for user→node mapping. Uber scales dispatch statelessly but keeps driver location state in Redis geospatial indexes (stateful data store, stateless readers).

Reliability & Availability

Reliability is correctness: the system does the right thing. Availability is uptime: the system responds when called. You can be available but wrong (stale reads) or correct but unavailable (CP during partition).

The nines table

Availability Downtime / year Downtime / month Downtime / week Typical tier
99% (two nines) 3.65 days 7.2 hours 1.68 hours Internal tools, dev environments
99.9% (three nines) 8.76 hours 43.8 minutes 10.1 minutes B2B SaaS, non-critical APIs
99.99% (four nines) 52.6 minutes 4.38 minutes 1.01 minutes Payment systems, core product APIs
99.999% (five nines) 5.26 minutes 26.3 seconds 6.05 seconds Telco, hospital systems, trading infra
99.9999% (six nines) 31.5 seconds 2.63 seconds 0.605 seconds Requires active-active multi-region + formal methods

Each additional nine costs roughly 10× in engineering and infrastructure. Most consumer products target 99.9%–99.99%. Don't over-engineer five nines for a blog.

MTBF and MTTR

Availability = MTBF / (MTBF + MTTR)

  • MTBF (Mean Time Between Failures): How long the system runs before breaking. Increase via redundancy, chaos testing, better hardware.
  • MTTR (Mean Time To Recovery): How long to restore service after failure. Decrease via automation, runbooks, circuit breakers, graceful degradation.
Example: API cluster
  MTBF = 720 hours (30 days between incidents)
  MTTR = 0.5 hours (30 min to recover)
  Availability = 720 / (720 + 0.5) = 99.93%

  To reach 99.99% without improving MTBF:
  MTTR must drop to ~0.05 hours (3 minutes)

Redundancy patterns

Pattern Description Failure behavior Cost
Active-passive Standby replica idle until primary fails Failover delay (seconds–minutes); possible data loss window Low (1× idle capacity)
Active-active All nodes serve traffic simultaneously Instant reroute; requires conflict resolution High (N× full capacity)
N+1 N nodes needed + 1 spare for one failure One failure absorbed; second failure degrades Moderate (~33% overhead for N=3)
N+2 Two simultaneous failures tolerated Survives double failure during maintenance Higher (~50% overhead for N=4)
Geographic redundancy Multi-AZ or multi-region deployment Survives datacenter/region loss 2–3× infra + replication lag complexity

Common failure modes

  • Single Point of Failure (SPOF): One component whose failure takes down the system—eliminate with redundancy.
  • Cascading failure: Overloaded node fails → retries flood siblings → they fail too. Fix: circuit breakers, bulkheads, rate limits.
  • Split brain: Network partition causes two nodes to both think they're primary. Fix: quorum, fencing tokens, STONITH.
  • Thundering herd: Cache expires → all requests hit DB simultaneously. Fix: jittered TTL, mutex on miss, pre-warming.
  • Retry storm: Clients retry failed requests → amplifies load 3–10×. Fix: exponential backoff with jitter, idempotency keys.
⚠️ Pitfall

Redundancy without health checks is theater. An active-passive DB pair where the standby hasn't replicated the last 30 seconds of writes will fail over into data loss. Test failover quarterly—most teams discover broken automation during the real incident.

🏆 Senior Signal

L5 Proactively identify SPOFs in your diagram: "The load balancer is a SPOF—we use managed ALB with cross-AZ. The config service is a SPOF—we cache config locally with 5-minute TTL for graceful degradation."

CAP Theorem

Eric Brewer's CAP theorem (formalized by Gilbert and Lynch, 2002) states that a distributed system can guarantee at most two of three properties simultaneously during a network partition.

The three properties

  • Consistency (C): Every read receives the most recent write or an error. All nodes see the same data at the same time (linearizable).
  • Availability (A): Every request receives a non-error response, without guarantee that it reflects the latest write.
  • Partition tolerance (P): The system continues operating despite arbitrary message loss or delay between nodes (network splits).

In practice, partitions happen—switch failures, GC pauses masquerading as timeouts, cross-AZ link cuts. P is not optional in distributed systems. The real choice is CP vs AP during partition: reject requests (sacrifice availability) or serve potentially stale data (sacrifice consistency).

CP systems (HBase, ZooKeeper, etcd): sacrifice availability during partition — reject requests rather than return stale data. Choose for leader election, distributed locks, financial consistency.
Choice During partition Examples Use when
CP Minority partition unavailable; rejects writes/reads etcd, ZooKeeper, HBase, traditional RDBMS with sync replication Financial transactions, leader election, distributed locks, inventory counts
AP Both partitions serve requests; may diverge temporarily Cassandra, DynamoDB, CouchDB, DNS, Riak Shopping carts, social likes, analytics, session data, DNS
CA Only on single node (no partition possible) Single-node PostgreSQL/MySQL (no replication) Not truly distributed; breaks when you add a replica
⚖️ Trade-off

CAP is often misquoted as "pick two at design time." Partitions are rare but inevitable—the choice is what happens during the partition, not a permanent architectural label. Many systems are mostly available with tunable consistency (DynamoDB: eventual by default, strong reads optional at 2× latency cost).

🎯 Interview Tip

L5 Don't stop at "we pick AP." Say: "Shopping cart is AP—merging conflicting carts on reconnect is acceptable. Payment authorization is CP—we fail closed if we can't reach the ledger." Same system, different CAP choices per operation.

🔬 Under the Hood

ZooKeeper (CP) uses ZAB consensus: during partition, the minority side stops accepting writes. Cassandra (AP) uses tunable consistency with QUORUM reads/writes—during partition, each side continues independently; hint handoff and read repair reconcile later.

PACELC

Daniel Abadi's PACELC extension (2012) adds: Else, even when there is no partition, you still trade off Latency (L) vs Consistency (C). CAP alone misses the normal-case trade-off.

Full form: If Partition → choose A or C. Else (normal operation) → choose L or C.

System If Partition (PA or PC) Else (EL or EC) Classification Notes
MySQL (InnoDB, single primary) PC — primary unavailable blocks writes EC — sync replication waits for replica ack (strong) or async (faster, weaker) PC/EC Read replicas give EL for reads; primary writes are EC with semi-sync
Cassandra PA — both sides accept writes EL — ONE reads/writes are fast; QUORUM is slower but consistent PA/EL (default) Tunable per query: LOCAL_QUORUM, ALL
DynamoDB PA — multi-AZ continues serving EL — eventual consistency default (~1 ms); strong consistency option (~2 ms, 2× RCU) PA/EL Strongly consistent reads hit the leader replica only
MongoDB PC — minority partition loses primary EC — default write concern w:1 is fast; w:majority waits for replication PC/EC (configurable) Replica set elections cause brief unavailability (CP behavior)
⚖️ Trade-off

DynamoDB strong reads cost 2× read capacity units and add ~1 ms latency vs eventual. For a product catalog where stale price for 1 second is acceptable, use eventual (EL). For wallet balance, pay the EC cost or use conditional writes with version checks.

📦 Real World

Amazon Dynamo (the paper behind DynamoDB) explicitly chose PA/EL: shopping cart writes must never fail, and 200 ms staleness on item count is fine. Google Spanner chose PC/EC with TrueTime—paying latency cost (~5–10 ms cross-region) for external consistency globally.

🏆 Senior Signal

Principal PACELC reframes database selection as a per-operation decision, not a system-wide label. A Principal engineer maps each API endpoint to its PACELC quadrant and documents the business justification.

Consistency Models

Consistency defines what guarantees readers see after writers commit. Stronger guarantees cost latency and availability; weaker guarantees enable scale and speed.

Consistency is not binary. It's a spectrum from "every read sees the latest write globally" (linearizability) to "reads may be arbitrarily stale" (eventual). Pick the weakest model that satisfies your product requirements.

Model Guarantee Latency cost Example use case
Strong / Linearizable Every read sees the most recent write; global total order High — cross-region: 50+ ms; requires consensus or single leader Bank balances, inventory deduction, distributed locks
Sequential All nodes agree on operation order, but reads may lag slightly Medium-high Collaborative editing with ordering
Causal Cause precedes effect; concurrent ops may be seen in different order Medium — vector clocks or version chains Social comments, messaging threads
Read-your-writes User always sees their own recent writes (session consistency) Low-medium — sticky routing or session tokens User profile edits, form submissions, settings pages
Monotonic reads User never sees older data on subsequent reads Low — route to same replica or track version Feed pagination, timeline scrolling
Eventual Given no new writes, all replicas converge given enough time Lowest — async replication, no coordination DNS, CDN, view counts, recommendation scores

Read-your-writes in practice

After a user updates their avatar, they must see the new avatar immediately—even if other users see the old one for seconds. Implementation options:

  • Sticky sessions: Route user to the same replica that handled the write (fragile on scale-down).
  • Session token with version: Client sends last-write timestamp; server reads from replica at least that fresh.
  • Write-through cache: Update cache synchronously on write; reads hit cache first.
  • Primary reads for session: User's writes go to primary; their reads also hit primary (others hit replicas).
🔬 Under the Hood

Facebook's TAO cache layer provides read-your-writes by routing a user's reads to the cache partition that processed their write. Cross-user reads may be eventually consistent—exactly the product requirement for social feeds.

⚖️ Trade-off

Strong consistency across regions adds 50–100 ms to every write (consensus round-trip). For a global chat app, causal consistency per conversation is sufficient and 10× faster. For a global ledger, you pay the latency or partition by region with async settlement.

When to use each model

Scenario Recommended model Why
Payment / transfer Strong (linearizable) Double-spend is unacceptable; fail closed on uncertainty
User profile edit Read-your-writes User must see own changes; others can wait seconds
Twitter timeline Eventual + monotonic reads Stale tweet for 2 s is fine; going backward in time is not
Inventory (flash sale) Strong or optimistic locking Overselling 10 units costs money and trust
Page view counter Eventual Exact count unnecessary; approximate is 100× cheaper
Collaborative document Causal or CRDT Concurrent edits merge; total order unnecessary
🎯 Interview Tip

Ask the interviewer: "Does the user need to see their own write immediately, or do all users need the latest data globally?" That one question narrows strong → read-your-writes → eventual and saves 10 minutes of over-engineering.

⚠️ Pitfall

Defaulting to "strong consistency everywhere" is the distributed systems equivalent of premature optimization in reverse— you pay 10× latency and availability cost where eventual would suffice. Match the model to the operation, not the system.

Back-of-Envelope Estimation

Every system design interview includes estimation. The goal is not precision—it's demonstrating structured thinking, stated assumptions, and order-of-magnitude correctness. Off by 2× is fine; off by 100× is not.

Powers of two (memorize these)

Power Exact value Approximation
2^101,024~1 Thousand (1 KB)
2^201,048,576~1 Million (1 MB)
2^301,073,741,824~1 Billion (1 GB)
2^401,099,511,627,776~1 Trillion (1 TB)
2^501,125,899,906,842,624~1 Quadrillion (1 PB)

Other useful numbers

  • 1 day = 86,400 seconds ≈ 100K seconds (round for mental math)
  • 1 month ≈ 2.5M seconds
  • 1 year ≈ 30M seconds
  • Peak QPS ≈ 2–3× average (sometimes 10× for viral events)
  • 1 server (well-tuned API) ≈ 1K–10K RPS (use 1K for conservative estimates)
  • 1 SSD ≈ 100K IOPS; 1 HDD ≈ 100 IOPS
  • 1 Gbps = 125 MB/s; 10 Gbps = 1.25 GB/s

Estimation framework (5 steps)

  1. Clarify assumptions: DAU, actions per user per day, read/write ratio, payload size.
  2. Calculate QPS: (DAU × ops/user) / 86400; multiply by peak factor.
  3. Calculate storage: DAU × ops × bytes/record × retention days.
  4. Calculate bandwidth: peak QPS × average response/request size.
  5. Calculate servers: peak QPS / per-server capacity; add 30% headroom.
📊 Estimation Template

Say aloud: "I'll assume 300M DAU, 2 tweets read per user per day, peak 3× average. Read QPS = 300M × 2 / 100K = 6K average, ~18K peak. Write QPS = 300M × 0.5 tweets / 100K = 1.5K average, ~4.5K peak."

Worked example: Twitter-scale timeline reads

Assumptions:

  • 300M DAU
  • Each user reads timeline 10×/day (scroll sessions)
  • Each user posts 0.5 tweets/day (150M tweets/day)
  • Peak traffic = 3× average
  • Tweet size ≈ 300 bytes; timeline page ≈ 20 tweets = 6 KB response
  • Retention: 5 years of tweets
READ QPS
  = 300M DAU × 10 reads / 86,400 sec
  = 3B / 86,400 ≈ 35,000 avg
  Peak = 35K × 3 ≈ 105,000 read QPS

WRITE QPS
  = 300M × 0.5 / 86,400 ≈ 1,750 avg
  Peak ≈ 5,250 write QPS

STORAGE (5 years)
  = 150M tweets/day × 300 bytes × 365 × 5
  = 150M × 300 × 1,825 ≈ 82 × 10^15 bytes ≈ 82 PB raw
  With indexes (3×) ≈ 250 PB

BANDWIDTH (peak reads)
  = 105K RPS × 6 KB ≈ 630 MB/s ≈ 5 Gbps

APP SERVERS (stateless API @ 1K RPS each)
  = 105K / 1K ≈ 105 servers + 30% headroom ≈ 140 servers
💡 Pro Tip

Round aggressively: 86,400 → 100K, 300M × 10 = 3B. The interviewer cares about your formula, not whether you remembered that a day has exactly 86,400 seconds.

Interactive estimation calculator

Adjust inputs to see QPS, storage, bandwidth, and server count update in real time.

Peak QPS
Daily storage
Peak bandwidth
App servers needed
🏆 Senior Signal

L6 After computing numbers, sanity-check against known systems: "105K read QPS is roughly Twitter 2012 scale— they'd need fan-out on write or hybrid model, not naive per-request DB joins." Connect estimation to architecture decisions.

📦 Real World

Instagram engineers report that back-of-envelope estimates during design reviews catch 80% of scaling issues before code is written. A 10-minute estimation at whiteboard time saves months of re-architecture after launch.