System Design Interview Framework

The RADIO Framework

Every strong system design answer follows the same skeleton. RADIO keeps you structured under time pressure and ensures you cover non-functional requirements, data modeling, and trade-offs—not just boxes and arrows.

R Requirements

Functional, non-functional, explicit out-of-scope, clarifying questions

A Architecture

High-level components, data flow, major technology choices

D Data model

Schema, indexes, access patterns, partition/shard key rationale

I Interface

Key APIs, request/response shapes, pagination, error codes

O Optimizations

Bottlenecks, caching, sharding, monitoring, trade-off summary

What to deliver in each phase

Phase	Deliverable	Time budget	Common mistake
R — Requirements	Written list: functional features, NFRs with numbers, out-of-scope items	5 min	Skipping NFRs; vague "low latency" without p99 target
A — Architecture	Diagram: client → edge → services → data stores → async workers	10 min	Drawing microservices for a problem that needs a monolith
D — Data model	Tables/collections, primary keys, indexes, read vs write paths	5 min (often in deep dive)	Schema before knowing access patterns
I — Interface	3–5 critical API endpoints with HTTP method, path, key fields	5 min	REST for everything when WebSocket or gRPC fits better
O — Optimizations	Cache layer, CDN, sharding plan, SPOF elimination, cost note	10–15 min	Optimizing before identifying the actual bottleneck

🎯 Interview Tip

Announce your framework upfront: "I'll use RADIO—start with requirements, then architecture, data model, APIs, and optimizations." Interviewers relax when they see structure; it signals senior communication habits.

45-Minute Time Allocation

Time-box ruthlessly. Candidates who spend 20 minutes on requirements never reach architecture. Candidates who skip estimation guess wrong on every component count.

0–5 min Requirements

Clarify functional scope, DAU/QPS, latency p99, consistency, availability, out-of-scope. Ask 3–5 questions.

5–10 min Estimation

Back-of-envelope: QPS, storage, bandwidth, server count. State every assumption aloud.

10–20 min High-level design

Draw boxes: client, CDN, LB, API, cache, DB, queue, workers. Walk through read and write paths.

20–35 min Deep dive

Interviewer picks: DB schema, cache strategy, fan-out, consistency, API design, failure modes.

35–45 min Scale & wrap

10× traffic, single points of failure, monitoring, trade-offs recap, what you'd build in v1 vs v2.

Adaptive timing by level

Level	Requirements	Deep dive depth	Wrap-up focus
L4	Interviewer may provide numbers; 3 min sufficient	One component (e.g., cache or DB choice)	Happy path + one failure mode
L5	You drive clarification; 5 min expected	Two components + trade-off reasoning	Scale 10×, monitoring, cost awareness
L6	Proactively scope v1/v2; edge cases in requirements	Cross-cutting: consistency + ops + org implications	Multi-region, incident response, evolutionary path
Principal	Frame as platform/product decision, not just technical	Build vs buy, team topology, multi-year migration	TCO, vendor risk, paved-road strategy

⚠️ Pitfall

Starting to draw before clarifying requirements is the #1 red flag. Passive silence while drawing is #2. Think out loud—interviewers cannot score reasoning they cannot hear.

Requirements Deep Dive

Requirements are 40% of the interview score at L5+. A crisp requirements section proves you build the right system, not just a system. Separate functional, non-functional, and explicit out-of-scope.

Functional requirements

What the system does—user-visible features and core workflows. List 3–5 MVP features; defer nice-to-haves to v2.

URL shortener: Create short URL, redirect to original, optional custom alias, analytics (clicks)
Twitter feed: Post tweet, follow users, view home timeline, like/retweet
Uber: Request ride, match driver, track location, complete payment, rate trip
Chat: Send/receive messages, delivery receipts, group chat, online presence

Non-functional requirements (always quantify)

NFR category	Questions to ask	Example strong answer
Scale	DAU? MAU? Read/write ratio? Data retention?	"100M DAU, 10:1 read:write, 5-year retention"
Latency	p50 vs p99? Which endpoints? Mobile vs web?	"Feed load p99 < 200 ms; post tweet p99 < 500 ms"
Availability	How many nines? Acceptable downtime window?	"99.9% (8.7 hr/year); chat needs 99.99%, analytics 99.5%"
Consistency	Strong everywhere or per-operation? Read-your-writes?	"Read-your-writes for user's own posts; eventual for global feed"
Durability	Can we lose data? Backup RPO/RTO?	"Zero message loss; RPO < 1 min, RTO < 15 min"
Security	Auth model? PII? Compliance (GDPR, HIPAA, PCI)?	"OAuth2, encrypted at rest, GDPR right-to-delete"

Explicit out-of-scope (critical at L5+)

Naming what you are not building saves time and shows product sense:

"v1 excludes: search, recommendations, multi-region, admin dashboard"
"v1 excludes: video upload—text and images only"
"v1 excludes: real-time collaborative editing—single-writer model"
"v1 excludes: ML ranking—chronological feed only"

Questions to ask the interviewer (pick 5–7)

What is the expected scale? (DAU, QPS, data size)
What are the latency requirements? (p50 vs p99, which endpoints)
What consistency guarantees do we need? (strong, eventual, read-your-writes)
What is the read vs write ratio?
How long do we retain data?
Is this mobile-first, web-first, or API for third parties?
Are there compliance requirements? (GDPR, PCI, HIPAA)
Do we need multi-region from day one?
What is acceptable downtime? (availability target)
Are there geographic constraints? (data residency)

🎯 Interview Tip

Write requirements on the board before drawing. Interviewers at Google and Meta explicitly score "problem clarification." A candidate who asks "Does the user need to see their own write immediately?" immediately signals L5+ consistency thinking.

📋 Requirements Template

Functional: [feature 1], [feature 2], [feature 3]
Non-functional: [DAU], [p99 latency], [availability], [consistency model]
Out of scope: [v2 item 1], [v2 item 2]

Estimation Template — 6 Steps

Every system design interview includes estimation. Off by 2× is fine; off by 100× fails. The goal is structured thinking with stated assumptions—not precision.

The 6-step framework

Clarify assumptions: DAU, actions per user per day, read/write ratio, payload size, retention period.
Calculate QPS: (DAU × ops/user) / 100K seconds; separate read and write; apply peak multiplier (2–3×).
Calculate storage: daily writes × bytes/record × retention days; add 3× for indexes and replicas.
Calculate bandwidth: peak QPS × average request/response size; convert to Gbps.
Calculate capacity: peak QPS / per-server throughput; add 30% headroom; size DB IOPS separately.
Sanity check & connect to architecture: Compare to known systems; state which component breaks first.

Reference numbers (memorize)

1 day ≈ 100K seconds (round 86,400 for mental math)
Peak QPS ≈ 2–3× average (10× for viral events)
1 API server ≈ 1K RPS conservative (10K for static/cached)
1 SSD ≈ 100K IOPS; 1 HDD ≈ 100 IOPS
2^30 ≈ 1 GB; 2^40 ≈ 1 TB; 2^50 ≈ 1 PB

📊 Say This Aloud

"I'll assume 300M DAU, 10 timeline reads per user per day, 0.5 posts per user per day, peak 3× average. Read QPS = 300M × 10 / 100K = 30K avg, ~90K peak. Write QPS = 300M × 0.5 / 100K = 1.5K avg, ~4.5K peak."

Worked example: URL shortener at scale

Assumptions: 500M URLs created over 5 years; 100:1 redirect:creation ratio; 500 bytes per record; peak 2× average.

STEP 1 — Assumptions
  Total URLs: 500M over 5 years
  New URLs/day: 500M / (5 × 365) ≈ 274K/day
  Redirects: 274K × 100 ≈ 27.4M/day
  Record size: 500 bytes (short_code + long_url + metadata)

STEP 2 — QPS
  Write QPS = 274K / 100K ≈ 2.7 avg → 5.4 peak
  Read QPS  = 27.4M / 100K ≈ 274 avg → 548 peak

STEP 3 — Storage (5 years)
  = 500M × 500 bytes ≈ 250 GB raw
  With indexes (2×) + replicas (3×) ≈ 1.5 TB

STEP 4 — Bandwidth (peak redirects)
  = 548 RPS × 1 KB response ≈ 548 KB/s ≈ 4.4 Mbps (negligible)

STEP 5 — Capacity
  API servers: 548 / 1000 ≈ 1 server (reads are cacheable!)
  Cache: 80/20 rule → 20% URLs = 80% traffic → cache top 100M URLs in Redis (~50 GB)

STEP 6 — Sanity check
  548 peak read QPS is modest—cache-aside Redis handles this easily.
  Bottleneck shifts to DB only on cache miss; 99%+ hit rate keeps DB load < 6 QPS.

🎯 Interview Tip

Step 6 is what separates L5 from L4: "548 QPS is small—I'd invest in cache hit rate, not sharding. Sharding becomes relevant at 100K+ write QPS or TB-scale hot keys." Connect numbers to design decisions.

High-Level Architecture Template

Interviewers expect a standard diagram shape. Start with this skeleton and specialize per problem. Label read path and write path with different colors or numbered steps.

flowchart TB
  subgraph clients [Clients]
    WEB[Web / Mobile]
    API_CLIENT[Third-party API]
  end

  subgraph edge [Edge Layer]
    CDN[CDN / Static Assets]
    LB[Load Balancer]
    RL[Rate Limiter]
  end

  subgraph services [Application Layer]
    API[API Servers - stateless]
    WORK[Async Workers]
  end

  subgraph cache [Cache Layer]
    REDIS[(Redis / Memcached)]
  end

  subgraph data [Data Layer]
    DB[(Primary Database)]
    REPLICA[(Read Replicas)]
    SEARCH[(Search Index)]
  end

  subgraph async [Async Layer]
    QUEUE[Message Queue / Kafka]
    DLQ[Dead Letter Queue]
  end

  subgraph obs [Observability]
    LOGS[Logs / Metrics / Traces]
  end

  WEB --> CDN
  WEB --> LB
  API_CLIENT --> LB
  LB --> RL --> API
  API --> REDIS
  API --> DB
  API --> QUEUE
  QUEUE --> WORK
  WORK --> DB
  WORK --> SEARCH
  DB --> REPLICA
  API --> LOGS
  QUEUE -.-> DLQ

Read path walkthrough (say this while drawing)

Client request hits CDN for static assets; dynamic requests go to load balancer
Rate limiter checks token bucket per user/IP
Stateless API server checks cache (Redis) first
On cache miss, query read replica (never primary for reads at scale)
Populate cache on miss; return response with cache-control headers

Write path walkthrough

API validates request, checks idempotency key
Write to primary database synchronously
Invalidate or update cache entry (write-through or cache-aside invalidation)
Publish event to queue for async side effects (search index, notifications, analytics)
Worker processes event; failures go to DLQ after N retries

Component checklist

Component	When to include	Technology examples
CDN	Static assets, cacheable API responses, global users	CloudFront, Cloudflare, Akamai
Load balancer	Always for multi-instance deployments	ALB, NGINX, HAProxy
Rate limiter	Public APIs, abuse-prone endpoints	Redis token bucket, API gateway
Cache	Read-heavy, hot keys, session data	Redis, Memcached
Message queue	Async processing, decoupling, peak smoothing	Kafka, SQS, RabbitMQ
Search index	Full-text search, faceted filtering	Elasticsearch, OpenSearch
Blob storage	Images, videos, files	S3, GCS, MinIO

🎯 Interview Tip

Draw the read path first (interviewers usually care most about the hot path), then add write path and async workers. Mention observability last—it signals L5+ operational maturity without eating clock time.

Deep-Dive Decision Framework

The deep-dive phase is where levels separate. Use these decision trees when the interviewer says "let's go deeper on the database" or "how would you handle consistency?"

Database selection

Signal	Choose	Avoid
ACID transactions, complex joins, <10K TPS	PostgreSQL / MySQL	Cassandra for relational queries
High write throughput, flexible schema, partition key known	Cassandra / DynamoDB	Single PostgreSQL primary
Sub-ms reads, ephemeral data, leaderboards	Redis	PostgreSQL for hot cache data
Full-text search, faceted queries	Elasticsearch	SQL LIKE queries at scale
Graph traversals (friends-of-friends)	Neo4j / graph layer on SQL	Recursive SQL at billion-edge scale
Time-series metrics, append-only	TimescaleDB / InfluxDB	Row-oriented OLTP

Cache strategy

Pattern	When	Risk
Cache-aside	Read-heavy, cache miss acceptable	Stampede on miss; stale after write without invalidation
Write-through	Cache and DB must stay in sync	Write latency; cache memory limits
Write-behind	Write-heavy, eventual DB sync OK	Data loss if cache fails before flush
CDN edge	Static or semi-static content, global users	Invalidation propagation delay

Sharding

Shard key: Must match highest-cardinality access pattern (user_id, tenant_id, not country_code)
Strategy: Hash-based (even distribution) vs range-based (range queries, hot spots)
Resharding: Consistent hashing with virtual nodes minimizes remapping
Cross-shard queries: Scatter-gather (expensive) or denormalize to avoid them

Replication

Topology	Consistency	Use when
Single-leader	Strong on leader; eventual on replicas	Most OLTP; PostgreSQL, MySQL
Multi-leader	Conflict resolution needed	Multi-region writes; CRDTs or LWW
Leaderless (quorum)	Tunable (W+R>N)	High write availability; Cassandra, DynamoDB

API design

Concern	Decision
CRUD resources	REST with nouns, HTTP verbs, proper status codes
Real-time bidirectional	WebSocket or SSE (not polling)
Internal service-to-service	gRPC (binary, streaming, schema via Protobuf)
Pagination	Cursor-based (stable under concurrent inserts); avoid offset at scale
Versioning	URL prefix (/v1/) or Accept header; never break existing clients

Messaging

Queue (point-to-point): Task distribution, work stealing, one consumer per message
Pub/sub: Event broadcast, multiple subscribers, fan-out decoupling
Event log (Kafka): Replay, audit trail, stream processing, retention
Delivery: At-least-once + idempotent consumer for most cases; exactly-once when money moves

Consistency

Requirement	Model	Implementation
Bank balance, inventory	Strong / linearizable	Single leader, consensus, or conditional writes
User sees own edit immediately	Read-your-writes	Sticky routing, session token, write-through cache
Feed/timeline ordering	Monotonic reads + eventual	Per-user cursor, versioned cache entries
View counts, analytics	Eventual	Async aggregation, approximate counters

🎯 Interview Tip

For each deep-dive choice, state: requirement → option A vs B → pick B because [metric]. "We need read-your-writes for profile edits, so reads hit primary for the session, not stale replicas."

Scaling Deep Dive

Interviewers often end with "what happens at 10× traffic?" or "what's your biggest bottleneck?" Have a structured answer ready—identify the constraint, propose mitigation, state the trade-off.

Scaling checklist (apply in order)

Optimize before scale: Caching, DB indexes, query tuning, connection pooling
Scale stateless tiers: Horizontal pod/instance scaling behind LB
Scale reads: Read replicas, CDN, materialized views
Scale writes: Sharding, partitioning, async write-behind
Scale data: Archival, tiered storage, compression
Scale geography: Multi-region cells, geo-DNS routing

Identify the bottleneck

Symptom	Likely bottleneck	Mitigation
High p99 on reads	DB or cache miss storm	Cache warming, read replicas, denormalize
Write latency spikes	Primary DB saturation	Shard writes, async queue, batch writes
Queue lag growing	Consumer throughput	Scale consumers (≤ partition count), optimize handler
Single shard hot	Bad shard key (celebrity user)	Key splitting, separate hot-key cache tier
Cross-region latency	Physics (50ms RTT)	Regional cells, async replication, CRDTs

Single points of failure (always address)

Load balancer → DNS failover or multi-LB with anycast
Primary database → automated failover to replica (RDS Multi-AZ, Patroni)
Redis single node → Redis Cluster or Sentinel
Single Kafka broker → replication factor ≥ 3, min.in.sync.replicas = 2
Single region → multi-AZ minimum; multi-region for DR

Monitoring & alerting (L5+ signal)

Google's four golden signals applied to your design:

Latency: p50/p99 per endpoint; alert on p99 SLO breach
Traffic: QPS, queue depth, connection count
Errors: 5xx rate, DLQ depth, failed health checks
Saturation: CPU, memory, DB connections, disk IOPS, cache memory

🎯 Interview Tip

Close scaling discussion with a prioritization: "First bottleneck at 10× is the primary DB on write path— I'd add sharding by user_id before investing in multi-region. Multi-region is a v2 concern unless compliance requires it."

Common Follow-Up Q&A

Interviewers probe with "what if" scenarios. Strong answers acknowledge the failure, describe detection, mitigation, and user-visible impact—without panicking or over-engineering.

Reliability & failure scenarios

Question	Strong answer skeleton
What if the cache goes down?	Cache is an optimization, not source of truth. Circuit breaker to DB; rate-limit DB queries; auto-scale read replicas; cache rebuilds on recovery. Expect 10× latency spike—degrade non-critical features.
What if the database primary fails?	Automated failover to replica (30–60s). Brief write unavailability. Queue writes in Kafka during failover if zero-loss required. Monitor replication lag to ensure promoted replica is fresh.
What if a message is processed twice?	Design for at-least-once delivery. Idempotent consumer with dedup table keyed by message_id. Natural idempotency (PUT with idempotency key) preferred over application-level dedup.
What if traffic spikes 100× (viral event)?	Pre-warmed auto-scaling; CDN absorbs read spike; queue buffers write spike; load shedding on non-critical paths; rate limit new signups if needed. Katy Perry tweet → fan-out on read for celebrity accounts.
How do you prevent duplicate payments?	Idempotency key per checkout session; DB unique constraint on (user_id, idempotency_key); return cached result on replay within 24h window.

Design choice probes

Question	Strong answer skeleton
Why not use a monolith?	At current scale ([QPS]), monolith is fine—simpler ops. I'd extract services when team boundaries or independent scaling needs emerge (Conway's law). Premature microservices add network latency and distributed tracing burden.
Why SQL over NoSQL?	Access pattern needs ACID transactions and joins ([example query]). Write QPS is [X], within single-node PostgreSQL capacity. Would revisit at [threshold] with sharding or Cassandra.
How would you migrate without downtime?	Dual-write to old and new store → backfill historical data → verify consistency → switch reads to new → stop dual-write → decommission old. Expand-contract for schema changes.
How do you handle hot keys?	Detect via metrics (single shard QPS anomaly). Mitigate: local in-process cache for hot key, read replicas dedicated to hot shard, key splitting with salting, separate celebrity tier.
How do you ensure security?	OAuth2/JWT auth, RBAC, TLS everywhere, encrypt PII at rest, rate limiting, input validation, audit log for sensitive ops, principle of least privilege for service accounts.

🎯 Interview Tip

Answer failure questions with a template: Detect → Mitigate → User impact → Long-term fix. "We'd detect via Redis health check in 10s, mitigate by circuit-breaking to DB with rate limit, users see 2× latency for 30s, long-term fix is Redis Cluster with automatic failover."

Red Flags & Green Flags

Interviewers maintain mental checklists. Avoid the red flags that fail candidates in the first 15 minutes. Hit the green flags that signal senior engineering instincts.

Red flags

Behaviors that fail interviews

Drawing before clarifying — jumping to architecture without requirements
Silent drawing — not narrating thought process aloud
Buzzword bingo — "we'll use Kafka and microservices" without justification
No numbers — "low latency" and "high scale" without QPS or p99 targets
Single design — presenting one solution without mentioning alternatives considered
Ignoring failures — no mention of SPOFs, failover, or degraded mode
Over-engineering L4 problems — multi-region Kubernetes for a URL shortener at 1K QPS
Under-engineering L6 problems — single PostgreSQL for global write-heavy system at 100K TPS
No trade-offs — claiming design has no downsides
Arguing with interviewer — defensive when probed; collaboration beats combat
Infinite scope — trying to build every feature instead of scoping v1
Wrong consistency default — strong consistency everywhere without business justification

Green flags

Behaviors that pass interviews

Structured framework — announces RADIO or equivalent before starting
Clarifying questions — asks about scale, latency, consistency before designing
Stated assumptions — "I'll assume 100M DAU unless you have a different number"
Back-of-envelope math — QPS, storage, bandwidth with rounding shown
Trade-off articulation — "I chose X over Y because [requirement]; downside is [cost]"
Failure awareness — proactively identifies SPOFs and proposes mitigation
Appropriate scope — clear v1 vs v2 boundary with out-of-scope list
Operational maturity — mentions monitoring, alerting, runbooks, SLOs
Access-pattern-driven design — schema and shard key from query patterns, not nouns
Sanity check — compares estimates to known real-world systems
Collaborative tone — "Does this direction make sense?" invites interviewer input
Depth on demand — goes deep when probed, stays high-level when not

⚠️ Pitfall

Saying "we need low latency" without a number is a red flag. Strong answer: "Feed load p99 < 200 ms; post creation p99 < 500 ms." Quantified NFRs prove you've shipped production systems.

Level-Specific Expectations

The same question has different passing bars at L4 vs L6. Calibrate depth, scope, and trade-off sophistication to your target level—over-preparing wastes time; under-preparing fails loops.

L4 · Mid

Fundamentals & happy path

Complete RADIO with interviewer providing some numbers
Correct high-level diagram with client, LB, API, DB, cache
Basic estimation (order of magnitude QPS and storage)
One deep-dive area handled competently (e.g., cache-aside)
Knows CAP at high level; picks SQL or NoSQL with simple justification
Typical problems: URL shortener, rate limiter, key-value store, parking lot

L5 · Senior

Trade-offs & failure modes

Drives requirements clarification independently; quantifies all NFRs
Estimation connected to architecture ("90K read QPS → need fan-out on write")
Multiple deep dives with explicit trade-offs (fan-out on write vs read)
Failure scenarios: cache down, DB failover, duplicate messages
Operational awareness: monitoring, DLQ, idempotency, graceful degradation
Typical problems: Twitter feed, YouTube, Uber, WhatsApp, Dropbox, notification system

L6 · Staff

Cross-cutting & scale edge cases

Scopes v1/v2 with evolutionary architecture path (strangler fig, dual-write migration)
Consistency model chosen per operation with PACELC reasoning
Multi-region, hot-key, and celebrity-user edge cases proactively addressed
Cross-team implications: data contracts, schema evolution, on-call burden
Cost analysis: "$X/month at this scale; tiered storage saves Y%"
Typical problems: Google Maps, Facebook search, distributed DB, ad aggregation, stock exchange

Principal

Platform & org-level strategy

Frames problem as product/platform decision, not purely technical
Build vs buy with TCO, vendor lock-in, and team capability assessment
Conway's law and team topologies inform architecture boundaries
Multi-year migration strategy with risk milestones and rollback plans
Paved roads and golden paths for organizational scale
Typical topics: platform design, multi-year re-architecture, org design implications

🏆 Senior Signal

L6 End every interview with a 30-second recap: "v1 is single-region with fan-out on write for users under 10K followers, Redis cache for hot timelines, Cassandra for tweet storage sharded by tweet_id. v2 adds multi-region read replicas and search. Biggest risk is celebrity tweet hot keys—mitigated with hybrid fan-out."

🎯 Interview Tip

Practice 3 problems at your target level and 1 problem one level above. L5 candidates should nail Twitter and attempt Google Maps. Depth at the right altitude matters more than breadth across 20 problems.