Database Design & Selection
The database is the hardest component to change after launch. Wrong choice costs months of migration; right choice buys years of scale. This chapter is the capstone data layer reference: a selection framework for every major store family, deep dives into RDBMS internals, sharding, NoSQL variants, search, graphs, and replication topologies—with numbers and trade-offs you can defend in a staff-level interview.
1. Database Selection Framework
Start with access patterns, not brand names. Every store optimizes for a workload shape: point reads, range scans, aggregations, full-text search, graph traversals, or append-only time series. The framework below maps requirements to families—and explicitly states when not to use each.
The five questions before you pick
- Read vs write ratio — 90% reads favors replicas + cache; write-heavy favors LSM/wide-column.
- Query shape — point lookup, range scan, join, aggregation, geo, full-text, traversal?
- Consistency needs — financial ledger (strong) vs social likes (eventual)?
- Scale trajectory — single node OK for 2 years, or 100M rows day one?
- Operational maturity — managed service vs self-hosted Cassandra cluster?
flowchart TD
start(["New data store needed"]) --> q1{"Structured schema\nwith joins?"}
q1 -->|Yes| sql["Relational SQL\nPostgreSQL / MySQL"]
q1 -->|No| q2{"Document-shaped\nnested JSON?"}
q2 -->|Yes| doc["Document DB\nMongoDB / DynamoDB"]
q2 -->|No| q3{"Massive write\nthroughput?"}
q3 -->|Yes| wc["Wide-column\nCassandra / HBase"]
q3 -->|No| q4{"Sub-ms latency\ncache/session?"}
q4 -->|Yes| kv["Key-value\nRedis / Memcached"]
q4 -->|No| q5{"Time-ordered\nmetrics/logs?"}
q5 -->|Yes| ts["Time-series\nInfluxDB / TimescaleDB / ClickHouse"]
q5 -->|No| q6{"Full-text or\nfaceted search?"}
q6 -->|Yes| search["Search engine\nElasticsearch / OpenSearch"]
q6 -->|No| q7{"Relationship\ntraversals?"}
q7 -->|Yes| graphdb["Graph DB\nNeo4j / Neptune"]
q7 -->|No| sql
Decision matrix by store family
| Family | Best for | Weak at | Typical scale | Examples |
|---|---|---|---|---|
| Relational (SQL) | ACID transactions, joins, ad-hoc queries, reporting | Horizontal write scale without sharding; schema rigidity | 10 TB single node; sharded to PB with Vitess/Citus | PostgreSQL, MySQL, Aurora |
| Document | Flexible schema, nested objects, rapid iteration | Multi-document ACID (limited), unbounded document growth | 100M+ docs per collection with sharding | MongoDB, CouchDB, DynamoDB (doc model) |
| Wide-column | High write throughput, time-series-ish rows, geo-distribution | Ad-hoc joins, secondary index fan-out, range queries without partition key | PB across datacenters | Cassandra, HBase, ScyllaDB |
| Key-value | Cache, sessions, rate limits, feature flags, leaderboards | Complex queries, durability without tuning (Redis) | 100K–1M ops/sec per node | Redis, Memcached, DynamoDB |
| Time-series | Metrics, IoT, observability, downsampling | General-purpose OLTP, mutable row updates | Millions of writes/sec with batching | InfluxDB, TimescaleDB, Prometheus, ClickHouse |
| Search | Full-text, fuzzy match, facets, relevance ranking | Source of truth, strong consistency with primary DB | TB indexes with sharding | Elasticsearch, OpenSearch, Solr |
| Graph | Short-path queries, fraud rings, recommendations, IAM | Bulk analytics, simple key lookups at extreme scale | 10B edges (Neptune), smaller for Neo4j | Neo4j, Amazon Neptune, JanusGraph |
Polyglot persistence — the norm at scale
Production systems rarely use one database. Uber runs MySQL (core trips), Cassandra (location), Redis (cache), Elasticsearch (search). The pattern: PostgreSQL as system of record, specialized stores as derived views. Sync via CDC (Debezium → Kafka) or dual-write (riskier). Interview signal: name the primary store and justify satellites.
L5: "We'll use PostgreSQL." L6: "PostgreSQL for orders and inventory with row-level locking on stock decrements; Redis cache-aside for product catalog (95% read); Elasticsearch fed by Kafka CDC for search—eventual consistency acceptable with 2s refresh SLA. DynamoDB only if we need single-digit-ms global reads without managing replicas."
One database vs many — operational simplicity and JOINs vs best-fit performance per workload. Start monolithic SQL; split when measured pain (write QPS, query latency, team boundaries) exceeds migration cost.
When asked "SQL or NoSQL?", reframe: "What's the access pattern?" Walk through the five questions aloud. Interviewers reward structured elimination over premature technology choice.
2. Relational Databases
Four decades of battle-testing make RDBMS the default system of record. Understanding ACID, isolation anomalies, indexing, and connection economics separates engineers who "use Postgres" from those who keep it alive at 50K QPS.
ACID properties
- Atomicity — all statements in a transaction commit or none do (WAL + rollback segments).
- Consistency — constraints (FK, CHECK, UNIQUE) hold after every committed transaction.
- Isolation — concurrent transactions behave as if serialized (level-dependent).
- Durability — committed data survives crash (fsync WAL before ACK).
Isolation levels and anomalies
ANSI SQL defines four isolation levels. PostgreSQL implements three (Read Uncommitted behaves as Read Committed). Higher isolation = fewer anomalies, more locking/retries, lower throughput.
| Level | Dirty read | Non-repeatable read | Phantom read | Implementation notes |
|---|---|---|---|---|
| Read Uncommitted | Possible | Possible | Possible | PostgreSQL upgrades to Read Committed; MySQL InnoDB rarely used |
| Read Committed | No | Possible | Possible | Default in PostgreSQL; each statement sees latest committed snapshot |
| Repeatable Read | No | No | Possible* | MySQL InnoDB default; PostgreSQL blocks phantoms via predicate locks |
| Serializable | No | No | No | SSI in PostgreSQL; highest safety, serialization failures → retry |
* PostgreSQL Repeatable Read prevents phantom reads for most cases via MVCC + predicate locking.
Lost updates happen at Read Committed without explicit locking.
Use SELECT … FOR UPDATE, optimistic versioning (UPDATE … WHERE version = ?),
or Serializable isolation for inventory/financial counters.
Indexes — B-tree, hash, GIN, partial
B-tree (default) supports equality and range on sortable columns. Composite index column order matters:
(user_id, created_at) serves WHERE user_id = ? and
WHERE user_id = ? AND created_at > ? but not WHERE created_at > ? alone.
- Covering index — INCLUDE columns avoid heap fetch (index-only scan).
- Partial index —
WHERE status = 'active'shrinks index size 10× for hot subsets. - GIN/GiST — JSONB, full-text, geo in PostgreSQL.
- Write amplification — every index doubles insert cost; over-indexing kills write-heavy tables.
-- Composite + partial index for active user orders
CREATE INDEX idx_orders_user_active
ON orders (user_id, created_at DESC)
WHERE status NOT IN ('cancelled', 'archived');
-- Covering index: avoid heap lookup for list view
CREATE INDEX idx_products_list
ON products (category_id, price)
INCLUDE (name, thumbnail_url);
-- Explain analyze before shipping
EXPLAIN (ANALYZE, BUFFERS)
SELECT id, name, price FROM products
WHERE category_id = 42 AND price BETWEEN 10 AND 100
ORDER BY price LIMIT 20;
The N+1 query problem
Loading 100 posts then issuing 100 queries for authors = 101 round trips (~101 × 1ms = 101ms vs 2ms with JOIN). ORMs lazy-load by default—fix with eager loading, JOIN FETCH, or batch IN queries.
// BAD: N+1 — 1 query for posts + N for each author
List<Post> posts = postRepo.findByUserId(userId);
posts.forEach(p -> p.getAuthor().getName()); // lazy load each author
// GOOD: single JOIN or @EntityGraph
@Query("SELECT p FROM Post p JOIN FETCH p.author WHERE p.userId = :uid")
List<Post> findWithAuthor(@Param("uid") Long userId);
Connection pooling
PostgreSQL uses ~10 MB RAM per connection and a dedicated backend process. Opening a connection costs 1–5 ms. At 500 app instances × 20 connections = 10,000 backends — the database dies before your app does. PgBouncer (transaction pooling) multiplexes thousands of clients onto ~100 server connections.
- Pool size rule of thumb —
connections = (core_count × 2) + effective_spindle_count(PgBouncer docs); often 20–50 per primary. - Transaction vs session pooling — transaction mode breaks prepared statements and temp tables; session mode safer for ORMs.
- RDS Proxy / Aurora — managed pooling + IAM auth + failover.
Read replicas
Streaming replication ships WAL to standbys. Replicas serve read traffic; lag typically 10ms–seconds under load. Replica lag means read-your-writes is NOT guaranteed unless routed to primary or tracked LSN.
flowchart LR app[Application] primary[(Primary PostgreSQL)] replica1[(Read Replica 1)] replica2[(Read Replica 2)] app -->|writes + strong reads| primary app -->|analytics / lists| replica1 app -->|search indexing| replica2 primary -->|WAL stream| replica1 primary -->|WAL stream| replica2
Read replicas vs caching — replicas give fresh-ish data without cache invalidation complexity; cache gives sub-ms latency. Common pattern: cache hot keys, replica for heavy reporting queries that tolerate lag.
PostgreSQL vs MySQL
| Dimension | PostgreSQL | MySQL (InnoDB) |
|---|---|---|
| SQL compliance / features | Window functions, CTEs, JSONB, arrays, full-text — richer ad-hoc analytics | Simpler subset; 8.0+ CTEs/window functions; JSON type less ergonomic |
| Replication | Streaming + logical decoding (CDC); built-in partitioning | Binlog (row/statement); mature multi-source; Vitess ecosystem |
| Default isolation | Read Committed | Repeatable Read (gap locks differ from PG) |
| Typical adopters | Startups, analytics-heavy, GIS (PostGIS), greenfield | Legacy web, Shopify-scale sharded MySQL, China ecosystem |
| Interview default | Prefer unless interviewer says MySQL or existing fleet | Cite when discussing Uber/Vitess, Facebook history |
Instagram started PostgreSQL, migrated to Cassandra for feed storage at scale — but metadata stayed relational. Notion runs PostgreSQL with aggressive indexing and sharding via Vitess-like patterns for block storage metadata.
3. Sharding
When vertical scaling and read replicas exhaust headroom, you partition data across nodes. Sharding is irreversible architecture — shard key choice determines whether you scale linearly or fight hot spots forever.
When to shard
- Single-node storage > 2–5 TB or write QPS exceeds ~10–50K sustained (hardware dependent).
- Replication lag unacceptable and writes are the bottleneck.
- Regulatory data residency requires geographic partitions.
Before sharding: archive cold data, denormalize, connection pool, read replicas, caching — each buys 10× cheaper than sharding.
Shard key selection
The shard key routes every query to the correct partition. Ideal properties:
- High cardinality — uniform distribution (user_id ✓, country_code ✗).
- Query alignment — most queries include shard key (avoid cross-shard scatter-gather).
- Immutability — changing shard key requires row migration.
- Monotonic keys avoided — auto-increment IDs concentrate writes on latest shard (use hash or snowflake IDs).
Sharding by tenant_id when one tenant is 80% of traffic recreates a monolith on one shard.
Mitigate: dedicated shard for whale tenants, sub-shard by user_id within tenant, or rate-limit enterprise tier.
Sharding strategies
| Strategy | Routing | Pros | Cons |
|---|---|---|---|
| Hash sharding | shard = hash(key) mod N |
Even distribution with good keys | Resharding requires remapping most keys when N changes |
| Range sharding | Key ranges per shard (A–M, N–Z) | Range scans local; easy time-based archival | Hot spots on latest range (time-ordered keys) |
| Directory sharding | Lookup table maps key → shard | Flexible reassignment, whale tenant isolation | Lookup service is SPOF; must cache and replicate |
| Geographic sharding | Region in key or directory | GDPR, latency compliance | Cross-region queries expensive; user migration hard |
Consistent hashing
Plain mod N remaps almost all keys when adding a shard. Consistent hashing places shards and keys on a ring;
adding a node only moves keys between adjacent neighbors (~K/N keys move). Used in DynamoDB, Cassandra, Redis Cluster, CDNs.
flowchart TB
subgraph ring["Hash ring (virtual nodes)"]
direction TB
n1["Shard A\nvnodes 0-85"]
n2["Shard B\nvnodes 86-170"]
n3["Shard C\nvnodes 171-255"]
n1 --> n2 --> n3 --> n1
end
key1["key: user_4821\n→ Shard B"]
key2["key: user_9901\n→ Shard A"]
Virtual nodes (vnodes) — each physical shard owns many ring positions for even load after node failure. Typical: 256 vnodes per Cassandra node.
Cross-shard operations
Operations without shard key in the WHERE clause hit every shard:
- Scatter-gather — query all shards, merge in coordinator (latency = slowest shard).
- Global secondary indexes — index table sharded differently = double write + eventual consistency.
- Cross-shard transactions — 2PC is slow and fragile; prefer saga, per-shard transactions, or redesign.
- JOINs across shards — application-level join or denormalize; SQL JOINs don't span Vitess shards without careful schema.
Propose colocated tables — orders and order_items share user_id shard key so order history
queries stay single-shard. Reserve cross-shard for rare admin/analytics with async aggregation.
Vitess — sharded MySQL at YouTube scale
Vitess sits between app and MySQL shards: VTGate routes SQL by shard key, VReplication handles resharding with minimal downtime, connection pooling built-in. Used by Slack, HubSpot, Square. Alternative: Citus for PostgreSQL (distributed tables + reference tables).
-- Vitess: colocated sharding schema
CREATE TABLE users (
user_id BIGINT NOT NULL,
email VARCHAR(255),
PRIMARY KEY (user_id)
);
CREATE TABLE orders (
order_id BIGINT NOT NULL,
user_id BIGINT NOT NULL,
total DECIMAL(10,2),
PRIMARY KEY (order_id),
KEY idx_user (user_id)
) VITESS_SHARDING_KEY = user_id; -- same shard as users row
100 shards × 5K write QPS each = 500K aggregate writes/sec. If p99 shard size diverges 3×, hottest shard caps effective throughput at 167K — monitor shard QPS variance, not just aggregate.
4. NoSQL Document Stores
Document databases model data as self-contained JSON/BSON documents. Schema flexibility accelerates product iteration; the design question shifts from normalization to embed vs reference and aggregation pipeline economics.
When documents win
- Heterogeneous records (CMS, catalogs, user profiles with optional fields).
- Read-heavy access patterns that fetch whole object graphs.
- Rapid schema evolution without ALTER TABLE migrations.
- Horizontal scale via sharded collections (MongoDB sharded cluster, DynamoDB partitions).
Embed vs reference
| Pattern | Use when | Avoid when |
|---|---|---|
| Embed (denormalize) | 1:few relationship; data read together; child doesn't exist independently | Unbounded arrays (comments on viral post → 16 MB doc limit); shared entities updated often |
| Reference (normalize) | 1:many unbounded; entity shared across parents; independent lifecycle | Every read needs join-like $lookup → latency + complexity |
| Hybrid | Embed summary + reference ID for detail (author name + author_id) | Stale embedded copies if update frequency high |
// EMBED: order with line items (bounded, always read together)
{
"_id": ObjectId("..."),
"userId": "u_4821",
"items": [
{ "sku": "WIDGET-1", "qty": 2, "price": 19.99 },
{ "sku": "WIDGET-2", "qty": 1, "price": 49.99 }
],
"total": 89.97,
"status": "shipped"
}
// REFERENCE: product reviews (unbounded, paginated separately)
{
"_id": ObjectId("..."),
"productId": "prod_991",
"rating": 5,
"text": "Great widget",
"userId": "u_100"
}
MongoDB 16 MB document limit — embedding all messages in a chat room document breaks at scale. Bucket pattern: one doc per 100 messages or reference collection with time-range queries.
Aggregation pipeline
Server-side pipeline stages: $match → $group → $sort → $lookup (join).
Push filtering early ($match first) to use indexes. Heavy analytics often belongs in warehouse (BigQuery) not OLTP Mongo.
// Top 10 products by revenue last 30 days
db.orders.aggregate([
{ $match: { createdAt: { $gte: ISODate("2026-05-08") }, status: "paid" } },
{ $unwind: "$items" },
{ $group: {
_id: "$items.sku",
revenue: { $sum: { $multiply: ["$items.qty", "$items.price"] } },
units: { $sum: "$items.qty" }
}},
{ $sort: { revenue: -1 } },
{ $limit: 10 }
]);
MongoDB sharding recap
- Shard key — immutable; choose compound key for locality (tenant_id + user_id).
- Chunk migration — balancer moves 64 MB chunks; jumbo chunks block balance.
- Transactions — multi-doc ACID within same shard since 4.0; cross-shard since 4.2 with performance cost.
Document vs SQL — documents eliminate JOIN latency for object reads; SQL wins multi-entity reporting, strict constraints, and mature tooling. Many teams use PostgreSQL JSONB as middle ground.
5. Wide-Column Stores
Cassandra and HBase optimize for write-heavy, geographically distributed workloads with predictable partition-key access. Data model is query-driven — you design tables around queries, not entities. Getting partition/cluster keys wrong is expensive to fix.
Data model: partition key + cluster key
- Partition key — determines which node(s) store the row; all rows with same partition key live on same replica set.
- Cluster key (clustering columns) — sort order within partition; enables range scans inside one partition.
- Primary key = partition key + optional cluster key(s).
-- Messages by user: partition = user_id, cluster = conversation_id, sent_at
CREATE TABLE messages_by_user (
user_id UUID,
conversation_id UUID,
sent_at TIMESTAMP,
message_id UUID,
body TEXT,
PRIMARY KEY ((user_id), conversation_id, sent_at)
) WITH CLUSTERING ORDER BY (conversation_id ASC, sent_at DESC);
-- Query: all messages in one conversation for user (single partition)
SELECT * FROM messages_by_user
WHERE user_id = ? AND conversation_id = ? LIMIT 50;
flowchart LR
subgraph node1["Node 1"]
p1["Partition: user_A\nrows sorted by cluster key"]
end
subgraph node2["Node 2"]
p2["Partition: user_B"]
end
subgraph node3["Node 3"]
p3["Partition: user_C"]
end
router["Coordinator"] -->|hash partition key| node1
router --> node2
router --> node3
Tunable consistency
Cassandra trades latency vs staleness per query via ConsistencyLevel and replication factor (RF).
QUORUM with RF=3 means 2 replicas must respond. LOCAL_QUORUM avoids cross-DC latency in multi-region.
| Level | Behavior (RF=3) | Use case |
|---|---|---|
| ONE | Single replica responds | Metrics, low-stakes reads; may be stale |
| QUORUM | Majority (2/3) agree | Default strong-ish read/write balance |
| ALL | All replicas must respond | Rare; one slow node blocks everything |
| LOCAL_QUORUM | Quorum in local DC only | Multi-DC without cross-WAN on every read |
Hinted handoff + read repair — temporary replica failure doesn't lose writes; anti-entropy fixes divergence on read. Lightweight transactions (LWT / Paxos) provide compare-and-set but are 4× slower — use sparingly.
Anti-patterns
- Hot partitions — celebrity user_id absorbs all writes for partition; salt key (
user_id:bucket). - ALLOW FILTERING — full table scan disguised as query; kills cluster at scale.
- Secondary indexes on high-cardinality columns — hidden scatter-gather; use materialized views or duplicate tables per query.
- Large partitions — >100 MB or millions of rows per partition causes compaction/repair pain; target <10 MB.
- Multi-DC strong consistency everywhere — WAN latency dominates; use LOCAL_QUORUM per DC.
HBase (brief)
HBase runs on HDFS — strong consistency per row, sorted by row key, built for billions of rows scanned in range. Write path through WAL + MemStore; reads merge StoreFiles. Used when Hadoop ecosystem already present (Facebook Messages history). Ops-heavy vs Cassandra's peer-to-peer simplicity; prefer Cassandra/Scylla for greenfield unless HDFS integration required.
Netflix uses Cassandra for viewing history and recommendations metadata. Apple runs one of the largest Cassandra fleets (Petabyte scale). Discord migrated message storage to ScyllaDB (Cassandra-compatible, C++ performance).
6. Key-Value Stores
The simplest interface — get/put by key — enables sub-millisecond latency at massive QPS. Redis is a data structures server; Memcached is pure cache; DynamoDB is durable managed KV with pay-per-request scaling.
Redis data structures
| Structure | Operations | Use cases |
|---|---|---|
| String | GET/SET/INCR, BITOP | Cache values, counters, distributed locks (with caution) |
| Hash | HGET/HSET field-level | Object cache without serializing whole JSON |
| List | LPUSH/RPOP, blocking pop | Queues, recent items (cap with LTRIM) |
| Set / Sorted Set | SADD, ZADD with score | Unique tags, leaderboards, rate limit windows |
| HyperLogLog | PFADD/PFCOUNT | Cardinality (~12 KB for billions of unique visitors) |
| Stream | XADD/XREADGROUP | Lightweight event log (not Kafka replacement) |
# Leaderboard: top 10 by score
ZADD leaderboard 9842.5 user:4821
ZADD leaderboard 9100.0 user:9901
ZREVRANGE leaderboard 0 9 WITHSCORES
# Rate limit: 100 requests per minute per IP
INCR ratelimit:192.168.1.1:202606071430
EXPIRE ratelimit:192.168.1.1:202606071430 60
# Cache-aside pattern (pseudo)
# val = GET key; if miss: val = db.query(); SET key val EX 300
Redis persistence
- RDB — point-in-time snapshots; fast restart; may lose last minutes of writes.
- AOF — append every write;
appendfsync everysectypical; more durable, larger files. - Neither — pure cache; acceptable if source of truth elsewhere.
- Redis Cluster — 16,384 hash slots across masters; replica failover; no multi-key ACID across slots.
flowchart TB client[Client] n1["Master 1\nslots 0-5460"] n2["Master 2\nslots 5461-10922"] n3["Master 3\nslots 10923-16383"] r1[Replica] r2[Replica] r3[Replica] client -->|CRC16 key → slot| n1 client --> n2 client --> n3 n1 --> r1 n2 --> r2 n3 --> r3
DynamoDB
- Partition key — required; 10 GB per partition soft limit; hot keys throttle.
- Sort key — optional; enables range queries within partition.
- GSI (Global Secondary Index) — alternate access pattern with own partition key; eventual consistency; extra WCU/RCU.
- Streams — ordered change feed per shard; triggers Lambda, CDC to Elasticsearch, audit.
- On-demand vs provisioned — on-demand for spiky/unknown; provisioned + auto-scaling for predictable cost at scale.
{
"TableName": "Orders",
"KeySchema": [
{ "AttributeName": "userId", "KeyType": "HASH" },
{ "AttributeName": "orderId", "KeyType": "RANGE" }
],
"GlobalSecondaryIndexes": [{
"IndexName": "ByStatus",
"KeySchema": [
{ "AttributeName": "status", "KeyType": "HASH" },
{ "AttributeName": "createdAt", "KeyType": "RANGE" }
],
"Projection": { "ProjectionType": "INCLUDE", "NonKeyAttributes": ["total", "userId"] }
}],
"StreamSpecification": { "StreamViewType": "NEW_AND_OLD_IMAGES" }
}
Memcached vs Redis
| Aspect | Memcached | Redis |
|---|---|---|
| Data model | Strings only | Rich structures, Lua scripting |
| Persistence | None (pure cache) | Optional RDB/AOF |
| Threading | Multi-threaded (scale vertical) | Single-threaded command loop (I/O threads in 6+) |
| When to pick | Simple cache layer at Facebook scale | Cache + structures + pub/sub + lightweight queue |
Redis as primary database without persistence/replication — data loss on restart. If durability matters, use DynamoDB/RDS or Redis with AOF + replicas + explicit backup strategy.
7. Time-Series Databases
Metrics, IoT sensor readings, and financial ticks share a pattern: append-only, time-ordered, high cardinality labels, aggregations over windows, aggressive retention downsampling. General OLTP databases fail here — inserts become index bloat hell.
Workload characteristics
- Write-heavy — millions of points/sec (Prometheus remote write, Influx line protocol).
- Time-range queries — last 1h, GROUP BY 1m buckets; rarely point lookups on old data.
- Retention tiers — raw 7d → 1min aggregates 90d → 1h aggregates 2y.
- Cardinality explosion — high-cardinality labels (user_id on every metric) OOM the index.
| Engine | Model | Strengths | Weaknesses |
|---|---|---|---|
| InfluxDB | Tag/set schema, TSM storage | DevOps metrics native; Flux/SQL queries | Cardinality limits; clustering in enterprise tier |
| TimescaleDB | PostgreSQL extension (hypertables) | SQL familiarity, JOIN with relational data | Not as fast as dedicated TSDB at extreme ingest |
| Prometheus | Pull metrics, local TSDB | K8s ecosystem standard; PromQL | Long-term storage needs Thanos/Mimir/Cortex |
| ClickHouse | Columnar OLAP | Billions rows/sec ingest; SQL analytics | Not a metrics scraper; batch insert tuning required |
-- TimescaleDB hypertable + continuous aggregate
CREATE TABLE metrics (
time TIMESTAMPTZ NOT NULL,
service TEXT NOT NULL,
endpoint TEXT NOT NULL,
latency_ms DOUBLE PRECISION,
status INT
);
SELECT create_hypertable('metrics', 'time');
CREATE MATERIALIZED VIEW metrics_hourly
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 hour', time) AS bucket,
service, endpoint,
avg(latency_ms) AS avg_latency,
percentile_cont(0.99) WITHIN GROUP (ORDER BY latency_ms) AS p99,
count(*) AS requests
FROM metrics
GROUP BY bucket, service, endpoint;
# Prometheus recording rule + alert
groups:
- name: api_slo
rules:
- record: job:http_request_duration_p99:5m
expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job))
- alert: HighLatencyP99
expr: job:http_request_duration_p99:5m > 0.5
for: 5m
labels:
severity: page
Cap label cardinality in Prometheus: use http_route="/users/:id" not raw path with IDs.
Drop high-cardinality labels at scrape or via relabel configs — prevents TSDB corruption.
Prometheus pull vs push (Influx) — pull gives service discovery and health; push (StatsD, Telegraf) needed for short-lived jobs and IoT devices behind NAT.
8. Search Engines
Relational LIKE '%foo%' scans every row. Search engines invert the problem: pre-tokenize text into postings lists
for microsecond lookups at billion-document scale. Elasticsearch is a derived view — never your financial source of truth.
Inverted index
Documents are tokenized (analyzed) into terms. Each term maps to a postings list of (doc_id, positions, scores). Query "quick brown fox" intersects postings lists; BM25 ranks by term frequency and document length normalization.
flowchart LR d1["Doc 1: quick brown fox"] d2["Doc 2: slow red fox"] d3["Doc 3: quick rabbit"] idx["Inverted index"] t1["fox → doc1, doc2"] t2["quick → doc1, doc3"] t3["brown → doc1"] d1 --> idx d2 --> idx d3 --> idx idx --> t1 idx --> t2 idx --> t3
Core concepts
- Index — logical namespace (like database).
- Shard — Lucene index instance; primary + replicas for HA and read scale.
- Analyzers — tokenizer + filters (lowercase, stemming, stop words); language-specific.
- Mappings — field types (
textanalyzed vskeywordexact); wrong mapping requires reindex. - Aggregations — facets, histograms, nested metrics without hitting primary DB.
PUT /products
{
"mappings": {
"properties": {
"title": { "type": "text", "analyzer": "english" },
"sku": { "type": "keyword" },
"category": { "type": "keyword" },
"price": { "type": "float" },
"tags": { "type": "keyword" },
"created_at": { "type": "date" }
}
}
}
GET /products/_search
{
"query": {
"bool": {
"must": [{ "match": { "title": "wireless keyboard" } }],
"filter": [{ "term": { "category": "electronics" } }]
}
},
"aggs": { "by_tag": { "terms": { "field": "tags", "size": 10 } } }
}
ELK sync via Kafka (CDC pattern)
Primary DB commits → Debezium captures binlog/WAL → Kafka topic → Elasticsearch sink connector indexes documents. Near-real-time search (1–5s lag) without dual-write race conditions.
flowchart LR pg[(PostgreSQL)] debezium[Debezium CDC] kafka[[Kafka topic:\nproducts.changes]] connect[ES Sink Connector] es[(Elasticsearch)] app[Search API] pg -->|WAL| debezium debezium --> kafka kafka --> connect connect -->|bulk index| es app -->|query| es
Dual-write to DB + ES — crash between writes causes drift. Use outbox pattern or CDC. Reindex jobs required when mapping changes — plan blue/green index alias swap.
Shopify indexes catalog via pipeline from MySQL to Elasticsearch. Uber uses Elasticsearch for trip/search geo queries with dedicated indexing fleet fed by Kafka.
9. Graph Databases
When the query IS the relationship — "friends of friends within 3 hops," "shortest path between two routers," "detect fraud ring" — relational JOINs on edge tables degrade as depth grows. Graph stores index adjacency natively.
Property graph model
- Nodes — entities (User, Account, Device) with properties.
- Edges — typed relationships (FOLLOWS, PAID, LOGGED_IN_FROM) with properties and direction.
- Traversal — follow pointers index-free per hop; cost ~O(edges traversed) not O(table size).
Neo4j vs Amazon Neptune
| Aspect | Neo4j | Amazon Neptune |
|---|---|---|
| Query languages | Cypher (declarative, SQL-like) | Gremlin (imperative), openCypher, SPARQL (RDF) |
| Deployment | Self-hosted, AuraDB managed | Fully managed AWS; IAM/VPC integration |
| Scale | Billions of relationships; sharding immature (Fabric for federated) | Up to 128 TB storage; read replicas |
| Best for | Dev velocity, complex Cypher, on-prem | AWS-native, compliance, multi-model (property + RDF) |
// Friends-of-friends recommendation (depth 2)
MATCH (me:User {id: $userId})-[:FOLLOWS]->(friend)-[:FOLLOWS]->(fof)
WHERE NOT (me)-[:FOLLOWS]->(fof) AND me <> fof
RETURN fof.id, count(*) AS mutualFriends
ORDER BY mutualFriends DESC
LIMIT 20;
// Fraud: shared device within 2 hops of flagged account
MATCH (flagged:Account {risk: 'high'})-[:USED_DEVICE|TRANSFERRED*1..2]-(suspect:Account)
WHERE suspect.risk IS NULL
RETURN suspect.id, count(DISTINCT flagged) AS proximityScore;
Use cases
- Social graphs — follow suggestions, feed ranking inputs (often hybrid with cache).
- Fraud / AML — connected components, cycle detection in payment graphs.
- Identity / IAM — permission inheritance, group nesting (Google Zanzibar-inspired).
- Network / IT ops — dependency mapping, blast radius analysis.
- Knowledge graphs — RAG entity linking, semantic search enrichment.
Graph vs relational — depth-1 relationships (user → orders) stay in SQL. Graph wins at depth ≥3 or variable-length paths. Most "graph problems" at FAANG scale use custom adjacency in KV + batch precompute.
For "design Facebook friend suggestions," mention graph traversal but also offline batch jobs (Spark) computing candidates, Redis serving precomputed lists — pure online graph traversal doesn't scale to 3B users.
10. Database Replication
Replication provides durability, read scale, and geographic proximity — at the cost of consistency complexity. Three topologies dominate system design interviews: single-leader, multi-leader, and leaderless (Dynamo-style).
Single-leader (primary/replica)
All writes go to the leader; replicas apply the replication log (PostgreSQL WAL, MySQL binlog). Reads can hit replicas with lag. Failover promotes a replica — risk of split-brain without fencing (STONITH, etcd lease).
flowchart TB w1[Writer clients] r1[Reader clients] leader[(Leader DB)] rep1[(Replica 1)] rep2[(Replica 2)] w1 -->|writes| leader r1 -->|reads| rep1 r1 -->|strong read| leader leader -->|async/sync replication| rep1 leader --> rep2
- Synchronous replica — write ACK after replica confirms; survives leader loss without data loss; higher latency.
- Asynchronous replica — leader ACK before replica; faster; failover may lose last seconds.
- Semi-sync — at least one replica ACK (MySQL, PostgreSQL synchronous_standby_names).
Multi-leader (active/active)
Multiple nodes accept writes; changes replicate bidirectionally. Enables multi-region write locality (user writes to nearest DC). Write conflicts inevitable — resolve via LWW (last-write-wins), version vectors, or application merge.
flowchart LR dc1["DC US-East\nLeader A"] dc2["DC EU-West\nLeader B"] dc3["DC APAC\nLeader C"] dc1 <-->|async repl| dc2 dc2 <-->|async repl| dc3 dc1 <-->|async repl| dc3 u1[US users] --> dc1 u2[EU users] --> dc2
Multi-leader on same entity — two users editing same document in different regions → conflict. Restrict multi-leader to disjoint data (user_id region affinity) or use CRDTs for merge-safe fields.
Leaderless (Dynamo-style quorum)
No leader — client writes to W replicas, reads from R replicas. Quorum condition R + W > N (N = replication factor) guarantees overlap: read sees at least one node with latest write (with version reconciliation).
| Parameter | Typical (N=3) | Effect |
|---|---|---|
| N (replication factor) | 3 | Copies of each key across cluster |
| W (write quorum) | 2 | Higher W → stronger durability, slower writes |
| R (read quorum) | 2 | Higher R → fresher reads, slower reads |
| R + W > N | 2 + 2 > 3 ✓ | Strong consistency possible with version merge |
| R + W ≤ N | 1 + 1 ≤ 3 | Eventual consistency only (faster, stale reads) |
sequenceDiagram participant C as Client participant N1 as Node 1 participant N2 as Node 2 participant N3 as Node 3 Note over N1,N3: N=3, W=2, R=2 C->>N1: write v2 C->>N2: write v2 N1-->>C: ACK N2-->>C: ACK Note over C: Write quorum satisfied C->>N2: read C->>N3: read N2-->>C: v2 N3-->>C: v1 stale Note over C: Return v2 (latest timestamp wins)
Topology comparison
| Topology | Consistency | Write latency | Conflict handling | Examples |
|---|---|---|---|---|
| Single-leader | Strong on leader; eventual on replicas | Low (local leader) | None (serial writes) | PostgreSQL, MySQL, MongoDB default |
| Multi-leader | Eventual; conflict resolution required | Low per region | LWW, merge, CRDT | CouchDB, old MySQL circular repl |
| Leaderless | Tunable via R/W/N | Depends on W and RTT | Version vectors, read repair | Cassandra, DynamoDB, Riak |
Read repair — during quorum read, stale replicas get updated in background. Anti-entropy — Merkle tree comparison between replicas fixes drift without waiting for reads. DynamoDB uses leaderless per partition with leader for coordination internally — interview simplification OK.
State explicit numbers: "RF=3, LOCAL_QUORUM writes in Cassandra → tolerate 1 node loss per DC without write failure. For checkout, stick to single-leader PostgreSQL with sync replica — leaderless is wrong tool for financial invariants."
Cross-region sync replication adds RTT (~150ms US-EU) to every write. Multi-region active/active trades that latency for local writes — quantify: 10K writes/sec × 150ms = need massive write pipelining or accept async replication lag.