Real-World Case Studies
Eight canonical system design problems—each one appeared in FAANG interviews and powers production systems at billion-user scale. Every section walks through requirements, back-of-envelope numbers, architecture diagrams, deep dives into the hard parts, and the trade-offs elite teams actually make. These are not toy diagrams: they are the patterns behind bit.ly, Twitter, YouTube, Uber, WhatsApp, and Google Maps.
How to use these case studies
Each case study follows the same structure: functional requirements → non-functional requirements → scale estimation → high-level architecture → deep dives → trade-offs. Switch to Interview track for tips on what interviewers probe at each level.
System design interviews rarely ask you to reproduce production systems verbatim. They test whether you can clarify requirements, estimate scale, draw a coherent architecture, and reason about failure modes. These eight problems cover the majority of pattern combinations you'll encounter: read-heavy caching, write-heavy fan-out, geospatial indexing, stream processing, distributed consensus trade-offs, and CDN delivery.
| Case study | Primary patterns | Typical level | Hardest deep dive |
|---|---|---|---|
| URL Shortener | ID generation, cache-aside, async analytics | L4 | 301 vs 302, collision-free IDs at scale |
| Rate Limiter | Token bucket, Redis Lua, distributed sync | L4–L5 | Sliding window accuracy vs memory |
| Key-Value Store | Hash table, WAL, replication, consistent hashing | L5 | Quorum reads/writes, anti-entropy repair |
| Twitter Feed | Hybrid fan-out, sorted sets, celebrity problem | L5–L6 | Fan-out on write vs read for 50M followers |
| YouTube | Upload pipeline, transcoding, CDN, recommendations | L5–L6 | Adaptive bitrate, cold-start recommendations |
| Uber | Geospatial index, matching, surge, state machine | L5–L6 | Real-time driver location at city scale |
| Cassandra messaging, E2E encryption, receipts | L6 | 100B msgs/day partition strategy | |
| Google Maps | Graph routing, tile CDN, real-time traffic | L6 | Contraction hierarchies, sub-second routing |
Pick 3–4 case studies and practice the full 45-minute RADIO flow on each: Requirements (5 min), Architecture (10 min), Data model (5 min), Interface/API (5 min), Optimizations & deep dives (15 min), Bottlenecks (5 min). Interviewers reward structured thinking over perfect diagrams.
1. URL Shortener (bit.ly)
The canonical L4 interview question. Shorten long URLs into compact aliases, resolve them at read time with sub-10 ms latency, and optionally track click analytics without blocking the redirect path.
Functional requirements
- Create short URL — given a long URL, return a unique short code (custom alias optional for premium users)
- Redirect —
GET /{shortCode}resolves to original URL via HTTP redirect - Analytics (optional) — track clicks: timestamp, referrer, geo, user-agent
- Expiration — TTL on links; soft-delete vs hard-delete policy
- Custom aliases — user-chosen codes with uniqueness check
Non-functional requirements
| Requirement | Target | Why it matters |
|---|---|---|
| Read latency | p99 < 10 ms | Redirects are user-facing; every ms hurts conversion |
| Availability | 99.99% on read path | Broken links damage brand trust instantly |
| Write throughput | ~1K–10K new URLs/sec peak | Lower than reads but must never block redirects |
| Read/write ratio | 100:1 typical | Optimize aggressively for read path |
| URL uniqueness | Global, collision-free | Duplicate codes = wrong redirects = security incident |
Scale estimation
Assumptions: 500M DAU, 2 link creations + 20 redirects per user per day.
Writes: 500M × 2 / 86400 ≈ 12K writes/sec (peak ×3 → 36K/sec).
Reads: 500M × 20 / 86400 ≈ 116K reads/sec (peak ×3 → 350K RPS).
Storage (10 years): 500M × 2 × 365 × 10 = 3.65T URLs × ~500 bytes = ~1.8 PB (plan tiered archival).
Bandwidth: 350K RPS × 500 bytes redirect response ≈ 175 MB/s (negligible vs CDN).
High-level architecture
flowchart TB Client[Client / Browser] LB[Load Balancer] API[Redirect Service] Create[Create Service] Redis[(Redis Cache\nshortCode → longURL)] MySQL[(MySQL\nsource of truth)] Kafka[Kafka\nclick events] Analytics[Analytics Pipeline\nFlink → ClickHouse] Client -->|GET /abc123| LB --> API API --> Redis Redis -->|miss| MySQL API -->|async| Kafka Client -->|POST /shorten| LB --> Create Create --> MySQL Create --> Redis Kafka --> Analytics
ID generation strategies
| Strategy | Example | Pros | Cons |
|---|---|---|---|
| Auto-increment + Base62 | 123456789 → "8M0kX" |
Simple, dense, no collision check | Single DB writer bottleneck; predictable/enumerable IDs |
| Snowflake / UUID | 64-bit time-ordered ID | Distributed, no DB coordination | Longer codes; need encoding scheme |
| Hash of URL + counter | MD5 prefix + retry on collision | Same long URL → same short URL (dedup) | Collision handling complexity at scale |
| Pre-generated ID pool | Range allocator per server | Zero contention at write time | Pool exhaustion, range management ops |
L5
Production choice for bit.ly-style systems: Snowflake IDs encoded in Base62 for distributed creation,
with MySQL as source of truth and Redis cache-aside on read. Base62 uses [a-zA-Z0-9] — 7 chars = 627 ≈ 3.5 trillion unique codes.
# Base62 encode — 7 chars covers 3.5T IDs
ALPHABET = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
def encode_base62(num: int) -> str:
if num == 0:
return ALPHABET[0]
chars = []
while num:
num, rem = divmod(num, 62)
chars.append(ALPHABET[rem])
return "".join(reversed(chars))
# Snowflake: 41-bit timestamp | 10-bit machine | 12-bit sequence
# → ~4096 IDs/ms per machine, sortable, no DB round-trip
301 vs 302 redirect — deep dive
This is the most common L4 trap question. The choice affects SEO, analytics accuracy, and cache behavior.
| Status | Browser behavior | Use when |
|---|---|---|
| 301 Moved Permanently | Browser caches redirect; may skip shortener on repeat visits | Permanent links, SEO consolidation, minimal server load after first hit |
| 302 Found (Temporary) | Browser always hits shortener first | Analytics on every click, A/B destination testing, expiring links |
| 307 Temporary | Preserves HTTP method (POST stays POST) | API redirects, non-GET flows |
301 reduces server load (browser caches) but under-counts analytics after first visit. 302 guarantees every click is counted but doubles latency on repeat visits. bit.ly uses 302 for marketing/analytics products; internal/permanent links may use 301. Say both options in interviews and tie to product requirements.
MySQL + Redis read path
GET shortCodefrom Redis (short:abc123 → longURL)- On miss: query MySQL by primary key (short_code indexed), populate Redis with TTL (24h–7d)
- Return 302 redirect; fire-and-forget click event to Kafka
- On create: INSERT MySQL, SET Redis, return short URL
MySQL schema: short_code VARCHAR(10) PK, long_url TEXT, user_id,
created_at, expires_at, click_count BIGINT (approximate via periodic flush from analytics).
Shard MySQL by hash(short_code) when write QPS exceeds single-node capacity (~10K/sec sustained writes).
Analytics via Kafka
sequenceDiagram participant Redirect as Redirect Service participant Kafka as Kafka participant Flink as Flink / Spark participant CH as ClickHouse participant Dash as Analytics Dashboard Redirect->>Kafka: click event (async, non-blocking) Note over Redirect: never wait for analytics Kafka->>Flink: stream aggregate Flink->>CH: hourly rollups CH->>Dash: query p99 < 2s
Click events: { short_code, timestamp, ip, user_agent, referrer, geo }.
Partition Kafka by short_code for ordered per-link aggregation.
Never block redirect on analytics write — if Kafka is down, sample/drop events rather than fail redirects.
Don't put analytics synchronously on the redirect path. At 350K RPS, even 1 ms extra Kafka latency adds 350 seconds of aggregate wait per second. Use async fire-and-forget with local buffer + backpressure drop.
bit.ly processes billions of redirects monthly. They use multi-region Redis clusters for hot codes, MySQL/Aurora for persistence, and real-time analytics pipelines. TinyURL early architecture was single MySQL + mod-hash — worked until read QPS forced caching layer.
"Read path: Redis cache-aside, 302 for analytics, async Kafka click stream. Write path: Snowflake ID + Base62, MySQL INSERT, cache warm. Scale reads with Redis Cluster; scale writes with DB sharding by short_code hash. 350K peak RPS needs ~4 Redis nodes at 100K ops each with 80% hit rate."
L4 Must mention: Base62/Snowflake ID, Redis cache, 301 vs 302 trade-off, read-heavy optimization.
L5 Add: Kafka analytics decoupling, sharding strategy, cache stampede mutex on miss.
L6 Discuss: multi-region active-active, custom alias race conditions (DB unique constraint + Redis SETNX),
abuse detection (malware URL scanning async on create).
2. Rate Limiter
Protect APIs and infrastructure from abuse, ensure fair usage across tenants, and enforce SLA tiers. Rate limiting appears in every gateway, every public API, and almost every L4–L5 interview.
Functional requirements
- Limit requests per user, IP, API key, or endpoint
- Return 429 Too Many Requests with
Retry-Afterheader when exceeded - Support multiple tiers: free (100/min), pro (10K/min), enterprise (custom)
- Configurable limits per route (
POST /uploadstricter thanGET /health) - Whitelist/blacklist bypass rules
Non-functional requirements & scale
| Metric | Target |
|---|---|
| Decision latency | < 1 ms p99 (on critical path) |
| Throughput | Match API gateway RPS (100K–1M+ RPS) |
| Accuracy | Within 1% of configured limit (sliding window) |
| Availability | Fail-open vs fail-closed policy explicit |
| Memory | ~100 bytes per active key (user/IP combo) |
1M RPS API, 10M unique users/day active → ~500K concurrent rate-limit keys in Redis. 500K × 100 bytes ≈ 50 MB — trivial for Redis Cluster. At 1M RPS, need ~10 Redis nodes (100K ops each) or local token bucket + Redis sync for global limits.
Algorithm comparison
| Algorithm | Behavior | Burst | Memory | Accuracy |
|---|---|---|---|---|
| Fixed window | Count requests per clock minute | 2× burst at window boundary | O(1) per key | Poor at edges |
| Sliding window log | Store timestamp of each request | None | O(limit) per key | Exact |
| Sliding window counter | Weighted blend of current + previous window | Minimal | O(1) per key | ~99% accurate |
| Token bucket | Tokens refill at fixed rate; burst up to bucket size | Configurable burst | O(1) per key | Good |
| Leaky bucket | Requests queue; processed at fixed rate | Smooths traffic | O(queue depth) | Strict output rate |
flowchart LR
subgraph tb [Token Bucket]
Refill[Refill 10/sec] --> Bucket[(Bucket\ncapacity 100)]
Req1[Request] -->|take 1 token| Bucket
Bucket -->|tokens > 0| Allow[Allow]
Bucket -->|empty| Deny429[429 Deny]
end
subgraph lb [Leaky Bucket]
In[Incoming burst] --> Queue[(Queue)]
Queue -->|leak 10/sec| Out[Steady output]
end
Token bucket — deep dive
Each client has a bucket with capacity B and refill rate R tokens/sec.
On request: if tokens ≥ 1, decrement and allow; else reject with 429.
Allows bursts up to B while maintaining average rate R over time.
Used by AWS API Gateway, Stripe, and most SaaS APIs.
-- Redis Lua: atomic token bucket (EVAL script)
local key = KEYS[1]
local capacity = tonumber(ARGV[1])
local rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local requested = tonumber(ARGV[4])
local data = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(data[1]) or capacity
local last = tonumber(data[2]) or now
local elapsed = math.max(0, now - last)
tokens = math.min(capacity, tokens + elapsed * rate)
if tokens >= requested then
tokens = tokens - requested
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, math.ceil(capacity / rate) + 1)
return 1 -- allowed
else
return 0 -- denied
end
Sliding window — deep dive
Fixed window allows 2× burst at minute boundaries (59 requests at 00:59 + 60 at 01:00 = 119 in 2 seconds). Sliding window counter fixes this with O(1) memory:
count = prev_window_count × (1 - elapsed/window) + curr_window_count
Store two counters per key: current and previous window. Redis key: ratelimit:{user}:{window_id}.
Accuracy ~99%; used by Cloudflare and Kong gateway at scale.
Distributed rate limiting
flowchart TB Client --> Edge[Edge / CDN\n coarse IP limits] Edge --> GW[API Gateway\n per-key limits] GW --> Local[Local token bucket\n 80% decisions] Local -->|sync every 100ms| Redis[(Redis Cluster\n global truth)] GW --> Service[Microservice] Service -->|optional 2nd check| Redis
- Centralized (Redis) — accurate global limits; adds ~0.5–1 ms latency; single cluster is bottleneck
- Local + sync — each node holds local bucket, syncs quota to Redis periodically; faster but briefly over-limit during sync
- Edge limiting — Cloudflare/Kong at PoP; stops abuse before origin; coarse granularity (IP/CIDR)
- Consistent hashing — route same user to same Redis shard; avoids cross-shard coordination
Fail-open (allow on Redis down): prioritizes availability; risk of abuse during outage. Fail-closed (deny all): protects infrastructure; causes user-facing outage. Stripe fails closed on payment endpoints, fail-open on read-only catalog. State your policy explicitly.
Rate limiting by IP alone fails behind NAT (corporate offices, mobile carriers). Combine IP + API key + user ID. For anonymous endpoints, use IP + fingerprint + progressive CAPTCHA escalation.
Stripe publishes rate limits per API key with Retry-After headers.
GitHub uses sliding window with response headers: X-RateLimit-Remaining, X-RateLimit-Reset.
AWS API Gateway uses token bucket at account and stage level.
Draw three boxes: client → gateway (local bucket) → Redis (global sync). Explain token bucket for burst tolerance. Mention Redis Lua for atomicity — without it, race conditions double-count under concurrency. L6: discuss race between limiter and billing meter, clock skew across nodes, and adaptive limits under DDoS.
3. Key-Value Store (Dynamo-style)
Design a distributed, highly available key-value store like DynamoDB, Riak, or the storage layer behind Redis Cluster. Tests understanding of partitioning, replication, consistency trade-offs, and failure recovery.
Functional requirements
put(key, value)— store arbitrary blob (≤ 1 MB per value typical)get(key)— retrieve latest value or not-founddelete(key)— tombstone for eventual removal- Optional: conditional writes (
put if not exists), TTL expiration - No complex queries — point lookups only (by design)
Non-functional requirements
| Requirement | Target |
|---|---|
| Availability | 99.99% — always accept writes during partition |
| Latency | GET p99 < 10 ms; PUT p99 < 20 ms |
| Throughput | 1M+ ops/sec cluster-wide |
| Durability | No acknowledged write loss on single-node failure |
| Scalability | Linear scale-out by adding nodes |
1B keys × 1 KB average = 1 TB raw data. Replication factor 3 → 3 TB cluster. 100K ops/sec per node × 30 nodes = 3M ops/sec cluster capacity. WAL write: 100K × 1 KB = 100 MB/s disk per node — NVMe handles easily.
High-level architecture
flowchart TB
Client[Client SDK]
Coord[Coordinator / Router]
subgraph ring [Consistent Hash Ring]
N1[Node A\nvnodes 0-85]
N2[Node B\nvnodes 86-170]
N3[Node C\nvnodes 171-255]
end
subgraph nodeA [Each Storage Node]
HT[(In-memory\nHash Table)]
WAL[(Write-Ahead Log)]
SST[(SSTable / Disk)]
HT --> WAL --> SST
end
Client --> Coord --> ring
N1 --> nodeA
In-memory hash table — deep dive
Hot data lives in an in-memory hash map (open addressing or chained buckets).
Each entry: { key, value, version, expiry }.
Memory budget: if node has 64 GB RAM, reserve ~48 GB for hash table, rest for buffers/OS.
Eviction: LRU for cache tier; for durable store, flush cold entries to SSTable on disk.
Write-Ahead Log (WAL) — deep dive
Every write appends to WAL before acknowledging the client. On crash recovery: replay WAL to rebuild in-memory state. WAL segment rotation: new file every 64–128 MB; old segments deleted after SSTable flush.
- Client sends PUT
- Append
(key, value, seq, timestamp)to WAL (fsync or group commit batch) - Update in-memory hash table
- Replicate to N-1 nodes (async or sync per consistency level)
- Return success to client
fsync every write: durable but ~1K writes/sec on HDD, ~100K on NVMe. Group commit (batch fsync every 1 ms): 10× throughput, lose up to 1 ms of writes on crash. Dynamo-style systems use group commit + replication for durability without single-disk fsync bottleneck.
Replication & quorum
Each key maps to a preference list of N nodes (typically N=3) via consistent hashing. Coordinator sends write to all N replicas; read from R replicas.
| Config | W + R | Behavior |
|---|---|---|
| Strong | W=2, R=2 (N=3) | Quorum overlap guarantees latest read; higher latency |
| Eventual | W=1, R=1 | Fastest; may read stale; repair via read-repair |
| Default Dynamo | W=2, R=1 or W=1, R=2 | Tunable per client request |
sequenceDiagram participant C as Client participant Co as Coordinator participant R1 as Replica 1 participant R2 as Replica 2 participant R3 as Replica 3 C->>Co: PUT key (W=2) Co->>R1: write v5 Co->>R2: write v5 Co->>R3: write v5 (async) R1-->>Co: ack R2-->>Co: ack Co-->>C: success (2 of 3) C->>Co: GET key (R=2) Co->>R1: read Co->>R2: read Co-->>C: return max version
Consistent hashing — deep dive
Place nodes and keys on a hash ring (0 to 232-1). Key maps to first node clockwise on ring. Virtual nodes: each physical node owns 100–256 vnodes for even load distribution. Adding a node: only adjacent key ranges migrate (~K/N keys).
Failure handling
- Hinted handoff — if preferred node down, write to next node on ring; return when original recovers
- Read repair — on read, if replicas disagree, write latest version to stale replicas
- Anti-entropy — background Merkle tree comparison between replicas; sync divergent keys
- Gossip membership — nodes discover failures via SWIM protocol; update ring state
"AP system by default: W=1 for low-latency writes, R=2 for read-your-writes when needed. Vector clocks resolve conflicts on concurrent writes to same key — client merges or LWW with timestamp. This is exactly the Dynamo paper trade-off space."
L5+: draw the hash ring with vnodes. Explain W/R/N quorum math: W + R > N guarantees overlap. L6: compare to Raft (strong consistency, leader bottleneck) vs Dynamo (always writable, eventual consistency). Mention vector clocks vs last-write-wins — LWW loses updates silently.
4. Twitter Feed (Home Timeline)
Design the home timeline: when a user opens Twitter/X, show recent tweets from people they follow— sorted, paginated, and fresh within seconds. The celebrity problem makes this one of the hardest L5–L6 questions.
Functional requirements
- Post tweet — user publishes ≤ 280 chars (+ media metadata)
- Home timeline — paginated feed of tweets from followed users, newest first
- Follow / unfollow — update social graph
- Search & trends — trending topics (optional scope extension)
- Media — images/video stored separately; timeline holds references
Scale numbers
300M DAU, 2 tweets + 50 timeline reads per user/day.
Write: 300M × 2 / 86400 ≈ 7K tweets/sec (peak ×3 → 21K/sec).
Read: 300M × 50 / 86400 ≈ 174K timeline reads/sec (peak → 520K/sec).
Storage: 7K/sec × 86400 × 365 × 500 bytes ≈ 110 TB/year tweets alone.
Fan-out: avg 200 followers × 7K tweets = 1.4M timeline writes/sec if pure fan-out-on-write.
Fan-out on write vs fan-out on read
| Strategy | Write cost | Read cost | Best for |
|---|---|---|---|
| Fan-out on write | O(followers) per tweet | O(1) — precomputed timeline | Normal users (<10K followers) |
| Fan-out on read | O(1) — store tweet once | O(following) — merge at read | Celebrities (>1M followers) |
| Hybrid | O(min(followers, 10K)) | O(1) + merge celebrity tweets | Production Twitter approach |
flowchart TB
Post[User posts tweet]
Post --> Check{Followers > 10K?}
Check -->|No| FOW[Fan-out on write\npush to follower timelines]
Check -->|Yes| Celeb[Store in celebrity bucket\nno fan-out]
FOW --> Redis[(Redis Sorted Set\ntimeline:userId)]
Celeb --> TweetStore[(Cassandra\ntweets by user)]
Read[User reads timeline] --> Redis
Read --> Merge[Merge celebrity tweets\nfrom Cassandra]
Merge --> Response[Sorted merged feed]
Redis sorted sets — deep dive
Each user's home timeline is a Redis sorted set (ZSET):
score = tweet timestamp (or snowflake ID), member = tweet_id.
ZADD timeline:{userId} {timestamp} {tweetId}
ZRANGE timeline:{userId} 0 49 REV → latest 50 tweets in O(log N + 50).
- Cap timeline at ~800 entries; trim oldest on insert (sliding window cache)
- 800 × 8 bytes tweet_id × 300M users = ~192 GB if every user had full cache — most inactive; use tiered storage
- Hot users' timelines in Redis; cold users fan-out-on-read from Cassandra
Cassandra for tweet storage
Partition key design drives everything:
| Table | Partition key | Clustering | Query |
|---|---|---|---|
tweets_by_user |
user_id |
tweet_id DESC |
User profile timeline |
timeline_cache |
viewer_id |
tweet_id DESC |
Fallback precomputed timeline |
social_graph |
follower_id |
followee_id |
Who does user follow? |
The celebrity problem — deep dive
Katy Perry (~100M+ followers): one tweet = 100M Redis ZADD operations. At 1 ms each (absurdly optimistic), that's 27+ hours. Even at 100K ops/sec cluster-wide, one celebrity tweet consumes the entire cluster for 1000 seconds.
Solution — hybrid fan-out:
- Users with <10K followers: fan-out on write to Redis timelines
- Celebrities (>10K): tweet stored only in their Cassandra partition
- On timeline read: fetch precomputed timeline + fetch recent tweets from each followed celebrity + merge sort
- Cache merged celebrity slice separately (5–30 sec TTL)
Trends — HyperLogLog
Counting unique users per hashtag at Twitter scale is impossible with exact counters.
HyperLogLog in Redis: ~12 KB memory per key, ~0.81% error, counts billions of uniques.
Pipeline: tweet → Kafka → Flink extracts hashtags → PFADD trend:{hashtag} {userId} →
periodic PFCOUNT for ranking. Top-K via Count-Min Sketch + heap.
Twitter's 2012 architecture blog described Redis timelines + MySQL/Cassandra + hybrid fan-out. They migrated to Manhattan (internal KV store) for timelines. The hybrid fan-out pattern remains the standard answer — Instagram uses similar approach for Stories feed.
Always compute fan-out cost: 7K tweets × 200 avg followers = 1.4M writes/sec — say this number out loud. Then explain hybrid. L6: discuss ranking (ML score vs chronological), ads insertion, and read-your-own-writes consistency when you tweet and immediately refresh feed.
5. YouTube (Video Platform)
Design a video upload, processing, and streaming platform serving billions of hours watched monthly. Spans blob storage, async transcoding pipelines, CDN delivery, and recommendation systems.
Functional requirements
- Upload video — resumable upload, metadata (title, description, tags)
- Transcode — multiple resolutions (360p–4K) and formats (H.264, VP9, AV1)
- Stream — adaptive bitrate playback with low startup latency (<2 sec)
- Search & recommendations — home feed, related videos
- Analytics — view counts, watch time, creator dashboard
Scale numbers
2B logged-in users/month, 500M hours watched/day, avg 720p @ 2 Mbps.
Uploads: ~500 hours/minute uploaded → ~8 uploads/sec average, peak ×10 → 80/sec.
Views: billions/day → ~100K–500K stream starts/sec peak globally.
Storage: 500 hr/min × 60 × 24 × 1 GB/hr (raw) ≈ 720 TB/day raw; transcoded 3–5× → 2–3 PB/day (most cold/archived).
CDN egress: 500M hr × 2 Mbps ≈ 278 Tbps daily average — CDN is non-negotiable.
Upload pipeline
flowchart LR Creator[Creator App] Upload[Upload Service\nresumable chunks] Blob[(Object Storage\nS3 / GCS)] Meta[(Metadata DB\nvideo_id, status)] Queue[SQS / Pub-Sub] Transcode[Transcoding Farm\nFFmpeg workers] CDN[CDN Origin] Catalog[Catalog Service] Creator -->|chunked PUT| Upload Upload --> Blob Upload --> Meta Upload -->|video.uploaded| Queue Queue --> Transcode Transcode -->|read raw| Blob Transcode -->|write HLS segments| CDN Transcode --> Meta Meta --> Catalog
Resumable upload — deep dive
- Client requests upload session → server returns
upload_id+ signed chunk URLs - Upload 5–50 MB chunks with
Content-Rangeheaders - Server tracks received byte ranges in Redis/DB
- On disconnect: client resumes from last confirmed byte
- Complete: assemble manifest, trigger transcoding job
YouTube uses resumable HTTP uploads (similar to Google Cloud Storage protocol). Never hold full video in app server memory.
Transcoding with FFmpeg — deep dive
| Resolution | Bitrate | Transcode time (1 hr video) | Segment size |
|---|---|---|---|
| 360p | 800 Kbps | ~15 min (1× realtime) | 2–6 sec HLS segments |
| 720p | 2.5 Mbps | ~30 min | 2–6 sec |
| 1080p | 5 Mbps | ~45 min | 2–6 sec |
| 4K | 15 Mbps | ~2 hr (GPU accelerated) | 2–6 sec |
Transcoding farm: Kubernetes jobs, one FFmpeg process per job, GPU nodes for 4K/AV1.
Priority queue: popular creators / live replay first. Spot/preemptible instances for 70% cost savings.
Output: HLS (.m3u8 manifest + .ts segments) and DASH (.mpd + .m4s) for browser compatibility.
# FFmpeg: generate HLS ladder from source
ffmpeg -i input.mp4 \
-filter_complex "[0:v]split=3[v1][v2][v3]; \
[v1]scale=640:360[v1out]; [v2]scale=1280:720[v2out]; [v3]scale=1920:1080[v3out]" \
-map "[v1out]" -c:v libx264 -b:v 800k -map a:0 -c:a aac -b:a 96k -f hls stream_360.m3u8 \
-map "[v2out]" -c:v libx264 -b:v 2500k -map a:0 -c:a aac -b:a 128k -f hls stream_720.m3u8 \
-map "[v3out]" -c:v libx264 -b:v 5000k -map a:0 -c:a aac -b:a 192k -f hls stream_1080.m3u8
CDN & HLS/DASH delivery
sequenceDiagram participant Player participant CDN as CDN Edge participant Origin as Origin / Storage Player->>CDN: GET master.m3u8 CDN-->>Player: 360p/720p/1080p variants Player->>Player: measure bandwidth Player->>CDN: GET 720p segment 001.ts CDN-->>Player: video chunk Note over Player: switch to 1080p if bandwidth allows Player->>CDN: GET 1080p segment 002.ts CDN->>Origin: cache miss fetch Origin-->>CDN: segment CDN-->>Player: video chunk
Adaptive bitrate (ABR): player monitors buffer and bandwidth, switches quality per segment. CDN caching: popular videos cached at edge; long-tail served from origin. Cache key: video_id + resolution + segment_index. Immutable segments = infinite TTL.
Recommendations — deep dive
- Candidate generation — collaborative filtering, content similarity, subscribed channels (millions → hundreds)
- Ranking — ML model scores candidates on watch probability, satisfaction, diversity
- Two-tower model — user embedding × video embedding dot product for fast retrieval at scale
- Feature store — real-time features (last 10 watches) + batch features (user demographics)
- Freshness — new uploads indexed within minutes via stream processing
Transcode-all-upfront: instant playback after processing delay (minutes–hours for 4K). Progressive / just-in-time: start playback after first segments ready (~30 sec); lower storage for unpopular videos. YouTube transcodes popular resolutions first (360p + 720p), adds 4K asynchronously.
Separate upload (write-heavy, async) from streaming (read-heavy, CDN). Don't transcode synchronously on upload request. L5: HLS segment structure, ABR algorithm. L6: live streaming (WebRTC/LL-HLS), copyright Content ID fingerprinting, regional compliance (geo-block metadata).
6. Uber (Ride Matching)
Match riders with nearby drivers in real time, calculate ETAs and fares, handle surge pricing, and manage trip lifecycle from request to payment. Geospatial indexing at city scale is the core challenge.
Functional requirements
- Request ride — rider specifies pickup, dropoff, ride type
- Match driver — find available driver within ~3 km, ETA < 5 min
- Track trip — real-time location updates for rider and driver
- Surge pricing — dynamic multiplier when demand exceeds supply
- Payment & receipt — charge on trip completion
- Rating — post-trip feedback
Scale numbers
130M MAU, 25M trips/day globally.
Trips: 25M / 86400 ≈ 290 trips/sec average; NYC peak alone ~30/sec, global peak ×5 → 1.5K/sec.
Location updates: 5M active drivers × 1 update/4 sec ≈ 1.25M location writes/sec peak.
Storage: location history 30 days × 1.25M × 86400 × 50 bytes ≈ 162 TB (tiered to cold storage).
Geospatial queries: each ride request searches ~3 km radius in <200 ms.
High-level architecture
flowchart TB Rider[Rider App] Driver[Driver App] GW[API Gateway] Trip[Trip Service\nstate machine] Match[Matching Service] Geo[(Geospatial Index\nRedis GEO / H3)] Surge[Surge Service] Loc[Location Service] Pay[Payment Service] WS[WebSocket / Push] Rider --> GW --> Trip Trip --> Match Match --> Geo Match --> Surge Driver -->|GPS every 4s| Loc --> Geo Trip --> WS WS --> Rider WS --> Driver Trip --> Pay
Geospatial indexing — deep dive
| Approach | Query | Update | Scale |
|---|---|---|---|
| Redis GEO | GEORADIUS in O(N+log M) | GEOADD O(log N) | Good for city-level (<100K drivers/region) |
| Geohash grid | Query cell + 8 neighbors | Re-index on move across cell | Simple; edge cases at cell boundaries |
| H3 hexagonal index | k-ring expansion from center hex | Update on hex change (~100m resolution) | Uber's production choice; uniform adjacency |
| Quadtree / R-tree | Range search O(log N) | Rebalance on update | PostGIS; heavier for 1M+ updates/sec |
Redis GEO for hot path: GEOADD drivers:{cityId} {lng} {lat} {driverId},
GEORADIUS drivers:{cityId} {lng} {lat} 3 km COUNT 10 ASC.
Partition by city/region — NYC, SF, London each own Redis cluster shard.
Driver location updates every 3–4 seconds; stale drivers (>30 sec) excluded from matching.
H3 & surge pricing — deep dive
Uber open-sourced H3 — hexagonal hierarchical geospatial index.
Resolution 7 hex ≈ 1.2 km²; resolution 9 ≈ 0.1 km².
Surge computed per H3 cell: surge = demand / supply smoothed over 5-min window.
- Count ride requests per H3 cell (demand signal)
- Count available drivers per H3 cell (supply signal)
- If ratio > threshold (e.g., 2.0): surge multiplier = min(ratio, 3.0)
- Push surge map to rider app via WebSocket; refresh every 60 sec
- Drivers incentivized to move to high-surge hexes (supply rebalancing)
Trip state machine
stateDiagram-v2 [*] --> Requested: rider requests Requested --> Matching: find drivers Matching --> Accepted: driver accepts Matching --> Cancelled: no driver / timeout Accepted --> Arriving: driver en route Arriving --> InProgress: rider picked up InProgress --> Completed: dropped off InProgress --> Cancelled: mid-trip cancel Completed --> Paid: payment processed Paid --> [*] Cancelled --> [*]
State stored in Trip Service (PostgreSQL or DynamoDB). Each transition emits domain event to Kafka:
trip.accepted, trip.completed. Payment, notification, and analytics services subscribe.
Idempotency keys on transitions prevent double-charge on retry.
Matching algorithm
- Fetch top 10 nearest available drivers (GEORADIUS ASC)
- Filter: correct ride type, rating > 4.5, not on another trip
- Score: ETA (60%) + driver acceptance rate (20%) + direction of travel (20%)
- Dispatch to best driver; if no accept in 15 sec, offer to next
- Batch matching in dense areas: solve assignment problem (Hungarian algorithm) every 2 sec
Don't query all drivers globally. Partition by city/geofence. At 1.25M location updates/sec, a single Redis instance dies — shard by region with consistent hashing on city_id.
Uber's Dispatch service handles matching; Ramen (location store) ingests GPS at massive scale. They migrated from PostGIS to H3 + custom indexing. Surge uses ML models beyond simple supply/demand ratio.
Lead with geospatial index (Redis GEO or H3), then state machine, then surge. Quantify location write QPS — it's the hidden scale driver. L6: discuss batched matching vs greedy nearest, driver supply prediction, and fraud (GPS spoofing).
7. WhatsApp (Messaging at Scale)
Design a global messaging platform delivering end-to-end encrypted messages with delivery receipts, offline support, and media sharing — at 100+ billion messages per day.
Functional requirements
- Send message — text, image, video, voice note to individual or group (≤1024 members)
- Delivery receipts — sent ✓, delivered ✓✓, read (blue ✓✓)
- Offline delivery — store-and-forward when recipient offline
- E2E encryption — server cannot read message content
- Last seen / online status — privacy-controlled presence
- Multi-device sync — same account on phone + desktop
Scale numbers
2B+ users, 100B messages/day (Meta reported ~2020).
Write: 100B / 86400 ≈ 1.16M messages/sec average; peak ×3 → 3.5M/sec.
Connections: ~500M concurrent WebSocket/long-poll connections globally.
Storage: 100B × 200 bytes metadata/day = 20 TB/day (content encrypted, media in object store).
Media: ~20% of messages include media; 20B × 500 KB avg = 10 PB/day (CDN + tiered storage).
High-level architecture
flowchart TB Sender[Sender Device] ConnGW[Connection Gateway\n500M WebSockets] MsgSvc[Message Service] Router[User Router\nuser_id → server] Cassandra[(Cassandra\nmessage store)] Offline[(Offline Queue\nper user partition)] Push[Push Notification\nAPNs / FCM] Receiver[Receiver Device] Media[(Object Storage\nencrypted blobs)] Signal[Signal Protocol\nE2E layer] Sender -->|encrypted payload| ConnGW --> MsgSvc MsgSvc --> Router Router -->|online| ConnGW --> Receiver Router -->|offline| Offline MsgSvc --> Cassandra Offline --> Push Sender --> Media Signal -.->|keys on device| Sender Signal -.->|keys on device| Receiver
Cassandra message storage — deep dive
Partition key = conversation_id (or recipient_user_id for 1:1).
| Table | Partition key | Clustering | Purpose |
|---|---|---|---|
messages |
conversation_id |
message_id DESC |
Chat history pagination |
offline_inbox |
user_id |
received_at ASC |
Pending delivery queue |
user_sessions |
user_id |
device_id |
Multi-device routing |
receipts |
message_id |
user_id |
Delivery/read status |
WhatsApp famously uses Erlang for connection handling (soft real-time, millions of processes) and FreeBSD tuning. Message metadata in Cassandra; content is encrypted blob the server cannot decrypt. Groups: fan-out on write to each member's offline queue (similar to Twitter, but encrypted envelope per recipient).
Delivery receipts — deep dive
sequenceDiagram participant S as Sender participant Server participant R as Receiver S->>Server: send encrypted msg (msg_id) Server-->>S: ack sent (1 grey tick) Server->>R: push message R->>Server: ack received Server-->>S: delivered (2 grey ticks) R->>R: user opens chat R->>Server: ack read Server-->>S: read (2 blue ticks)
Receipts are lightweight control messages (~50 bytes), not full message re-transmission.
Stored in Cassandra with TTL; aggregated for group chats (delivered when all members ack).
Ordering: server-assigned monotonic message_id per conversation (not wall clock — avoids clock skew).
E2E encryption — Signal Protocol
- X3DH — extended triple Diffie-Hellman for initial key agreement (first message to new contact)
- Double Ratchet — forward secrecy: compromise of current key doesn't expose past messages
- Sender Keys — for group messages: one encrypted payload, per-member keys for fan-out efficiency
- Server stores only:
{ from, to, timestamp, encrypted_blob, msg_id }— no plaintext ever - Key transparency: safety numbers for verifying contact identity
E2E encryption prevents server-side spam/malware scanning, search indexing, and ML moderation. WhatsApp uses metadata analysis + user reports for abuse. Multi-device sync requires Sender Key distribution across devices — significant protocol complexity.
Connection layer at 500M concurrent
- Connection gateway shards by
user_id hash— sticky routing to same Erlang node - Heartbeat every 30 sec; disconnect after 3 missed → mark offline, route to push
- Long-poll fallback for restrictive networks (HTTP/2 multiplexing)
- Cross-region: user homed in primary region; travel = cross-region relay with +50 ms latency acceptable for async chat
WhatsApp scaled to 900M users with 50 engineers (2016) — extreme focus on Erlang + FreeBSD + Cassandra. Acquired by Facebook/Meta for $19B. Architecture prioritizes reliability and simplicity over feature richness. Group limit raised from 256 → 1024 required rethinking fan-out queues.
L6 question — show you know the numbers (100B/day = 1.16M/sec). Separate connection layer (WebSocket scale) from storage layer (Cassandra partitions). Mention E2E as constraint that blocks server-side features. Don't forget offline queue + push notification fallback — half of messages delivered to offline users.
8. Google Maps (Navigation & Routing)
Design a mapping service: serve map tiles globally, compute routes in sub-second time across continental road networks, incorporate real-time traffic, and support turn-by-turn navigation for hundreds of millions of users.
Functional requirements
- Map display — pan, zoom, satellite/terrain layers at 60 fps
- Route planning — origin → destination, multiple waypoints, avoid tolls/highways
- Turn-by-turn navigation — real-time rerouting on deviation or traffic change
- ETA — accurate arrival time using live traffic
- Search — places, addresses, POI autocomplete
- Traffic overlay — color-coded congestion on map
Scale numbers
1B+ MAU, ~1B km driven/day with Google Maps navigation.
Routing requests: ~50M/day explicit + navigation reroutes → ~600/sec average, peak ×20 → 12K/sec.
Tile requests: each map view ≈ 20 tiles; 1B users × 10 views/day × 20 = 200B tiles/day → 2.3M tiles/sec (CDN absorbs 99%).
Road graph: global road network ≈ 200M nodes, 400M edges (~50 GB compressed).
Traffic probes: Android/iOS location at scale → billions of speed samples/hour.
High-level architecture
flowchart TB Client[Mobile / Web Client] CDN[CDN\nmap tiles] TileSvc[Tile Server\n256×256 PNG/vector] RouteAPI[Routing API] CH[Contraction Hierarchies\npreprocessed graph] Traffic[Traffic Service\nreal-time speeds] Graph[(Road Graph Store)] Probe[Probe Ingestion\nKafka] Search[Places Search\nElasticsearch] ETA[ETA ML Model] Client --> CDN Client --> RouteAPI CDN -->|miss| TileSvc RouteAPI --> CH RouteAPI --> Traffic CH --> Graph Traffic --> Probe Traffic --> ETA Client --> Search
Map tiles — deep dive
World divided into zoom levels 0–22. At zoom Z, earth split into 4Z tiles. Each tile: 256×256 pixels (raster PNG) or vector (Mapbox Vector Tiles / protobuf).
| Zoom | Coverage | Tile size | Cache strategy |
|---|---|---|---|
| 0–5 | World / continent | ~5–20 KB vector | CDN cache 30+ days (rarely changes) |
| 10–14 | City / neighborhood | ~20–50 KB vector | CDN cache 7 days; invalidate on map update |
| 15–18 | Street level | ~50–200 KB | CDN + client disk cache |
| 19–22 | Building detail | ~200 KB–1 MB | On-demand; limited pre-generation |
Tile URL pattern: /tiles/{z}/{x}/{y}.pbf. CDN serves 99%+ from edge.
Traffic overlay is a separate transparent tile layer refreshed every 2–5 minutes — not baked into base map.
Graph routing — deep dive
Road network = weighted directed graph: nodes = intersections, edges = road segments, weight = travel time. Naive Dijkstra on 200M nodes: O(E log V) → seconds to minutes. Unacceptable for interactive routing.
Contraction Hierarchies (CH)
Preprocess graph offline: iteratively contract least-important nodes, adding shortcut edges that preserve shortest paths. Result: hierarchy where upward search from source and downward search from target meet in middle — explored nodes drop from millions to thousands.
- Preprocessing: hours on cluster for continent-sized graph; rerun weekly or on map update
- Query: bidirectional search on hierarchy → <100 ms on continental graph
- Memory: ~10–20× original graph size with shortcuts (acceptable for in-memory serving)
- Dynamic traffic: edge weights updated from traffic service; CH structure unchanged, weights hot-reloaded
flowchart TB
subgraph preprocess [Offline Preprocessing]
Raw[Raw road graph\n200M nodes] --> Contract[Iterative contraction\nadd shortcuts]
Contract --> CHGraph[CH Graph\nin memory]
end
subgraph query [Online Query — sub 100ms]
Src[Source] --> Up[Upward search\nin hierarchy]
Tgt[Target] --> Down[Downward search\nin hierarchy]
Up --> Meet[Meet at node]
Down --> Meet
Meet --> Path[Unwrap shortcuts\nfull route]
end
CHGraph --> query
Real-time traffic — deep dive
- Probe ingestion: anonymized GPS speed from Android/iOS Maps users → Kafka (billions/day)
- Aggregation: Flink windows compute median speed per road segment (edge) per 2-min bucket
- Weight update:
travel_time = segment_length / max(median_speed, 5 km/h) - Push to routing: in-memory edge weight table refreshed every 2–5 min
- ETA ML: gradient-boosted model on historical + live traffic + time-of-day + weather
CH preprocessing is fast at query but slow to update graph topology (new roads). A* with landmarks (ALT) or Customizable Route Planning (CRP) trade preprocessing flexibility for query speed. Google uses CH + CRP hybrid; OSRM uses Contraction Hierarchies open-source.
Alternative algorithms (know for L6)
| Algorithm | Query time | Preprocessing | Dynamic weights |
|---|---|---|---|
| Dijkstra | Seconds–minutes | None | Easy |
| A* + landmarks | 100 ms–1 sec | Moderate | Easy |
| Contraction Hierarchies | <100 ms | Hours | Weights only |
| Customizable Route Planning | <50 ms | Moderate | Fast customization |
Google Maps routing research published Contraction Hierarchies (Geisberger et al., 2008). OpenStreetMap + OSRM provides open CH-based routing. Apple Maps uses similar preprocessed graph techniques. Map tiles served via Google CDN with vector tile format replacing raster at high zoom levels.
"Tiles on CDN (2.3M RPS). Routing via Contraction Hierarchies — sub-100 ms on 200M node graph. Traffic weights from probe aggregation, refreshed every 2 min. Separate tile layer for traffic overlay. ETA uses ML on top of CH base time — cite 15–25% accuracy improvement over static routing."
Split into two sub-problems: tile serving (CDN, easy) and routing (hard). Never say Dijkstra on full graph — immediately follow with CH or A* landmarks. L6: discuss CRP, multi-criteria routing (tolls + time), and rerouting frequency during navigation (every 30 sec or on deviation >50m).