Database Design & Selection

1. Database Selection Framework

Start with access patterns, not brand names. Every store optimizes for a workload shape: point reads, range scans, aggregations, full-text search, graph traversals, or append-only time series. The framework below maps requirements to families—and explicitly states when not to use each.

The five questions before you pick

Read vs write ratio — 90% reads favors replicas + cache; write-heavy favors LSM/wide-column.
Query shape — point lookup, range scan, join, aggregation, geo, full-text, traversal?
Consistency needs — financial ledger (strong) vs social likes (eventual)?
Scale trajectory — single node OK for 2 years, or 100M rows day one?
Operational maturity — managed service vs self-hosted Cassandra cluster?

flowchart TD
  start(["New data store needed"]) --> q1{"Structured schema\nwith joins?"}
  q1 -->|Yes| sql["Relational SQL\nPostgreSQL / MySQL"]
  q1 -->|No| q2{"Document-shaped\nnested JSON?"}
  q2 -->|Yes| doc["Document DB\nMongoDB / DynamoDB"]
  q2 -->|No| q3{"Massive write\nthroughput?"}
  q3 -->|Yes| wc["Wide-column\nCassandra / HBase"]
  q3 -->|No| q4{"Sub-ms latency\ncache/session?"}
  q4 -->|Yes| kv["Key-value\nRedis / Memcached"]
  q4 -->|No| q5{"Time-ordered\nmetrics/logs?"}
  q5 -->|Yes| ts["Time-series\nInfluxDB / TimescaleDB / ClickHouse"]
  q5 -->|No| q6{"Full-text or\nfaceted search?"}
  q6 -->|Yes| search["Search engine\nElasticsearch / OpenSearch"]
  q6 -->|No| q7{"Relationship\ntraversals?"}
  q7 -->|Yes| graphdb["Graph DB\nNeo4j / Neptune"]
  q7 -->|No| sql

Decision matrix by store family

Family	Best for	Weak at	Typical scale	Examples
Relational (SQL)	ACID transactions, joins, ad-hoc queries, reporting	Horizontal write scale without sharding; schema rigidity	10 TB single node; sharded to PB with Vitess/Citus	PostgreSQL, MySQL, Aurora
Document	Flexible schema, nested objects, rapid iteration	Multi-document ACID (limited), unbounded document growth	100M+ docs per collection with sharding	MongoDB, CouchDB, DynamoDB (doc model)
Wide-column	High write throughput, time-series-ish rows, geo-distribution	Ad-hoc joins, secondary index fan-out, range queries without partition key	PB across datacenters	Cassandra, HBase, ScyllaDB
Key-value	Cache, sessions, rate limits, feature flags, leaderboards	Complex queries, durability without tuning (Redis)	100K–1M ops/sec per node	Redis, Memcached, DynamoDB
Time-series	Metrics, IoT, observability, downsampling	General-purpose OLTP, mutable row updates	Millions of writes/sec with batching	InfluxDB, TimescaleDB, Prometheus, ClickHouse
Search	Full-text, fuzzy match, facets, relevance ranking	Source of truth, strong consistency with primary DB	TB indexes with sharding	Elasticsearch, OpenSearch, Solr
Graph	Short-path queries, fraud rings, recommendations, IAM	Bulk analytics, simple key lookups at extreme scale	10B edges (Neptune), smaller for Neo4j	Neo4j, Amazon Neptune, JanusGraph

Polyglot persistence — the norm at scale

Production systems rarely use one database. Uber runs MySQL (core trips), Cassandra (location), Redis (cache), Elasticsearch (search). The pattern: PostgreSQL as system of record, specialized stores as derived views. Sync via CDC (Debezium → Kafka) or dual-write (riskier). Interview signal: name the primary store and justify satellites.

🏆 Senior Signal

L5: "We'll use PostgreSQL." L6: "PostgreSQL for orders and inventory with row-level locking on stock decrements; Redis cache-aside for product catalog (95% read); Elasticsearch fed by Kafka CDC for search—eventual consistency acceptable with 2s refresh SLA. DynamoDB only if we need single-digit-ms global reads without managing replicas."

⚖️ Trade-off

One database vs many — operational simplicity and JOINs vs best-fit performance per workload. Start monolithic SQL; split when measured pain (write QPS, query latency, team boundaries) exceeds migration cost.

🎯 Interview Tip

When asked "SQL or NoSQL?", reframe: "What's the access pattern?" Walk through the five questions aloud. Interviewers reward structured elimination over premature technology choice.

2. Relational Databases

Four decades of battle-testing make RDBMS the default system of record. Understanding ACID, isolation anomalies, indexing, and connection economics separates engineers who "use Postgres" from those who keep it alive at 50K QPS.

ACID properties

Atomicity — all statements in a transaction commit or none do (WAL + rollback segments).
Consistency — constraints (FK, CHECK, UNIQUE) hold after every committed transaction.
Isolation — concurrent transactions behave as if serialized (level-dependent).
Durability — committed data survives crash (fsync WAL before ACK).

Isolation levels and anomalies

ANSI SQL defines four isolation levels. PostgreSQL implements three (Read Uncommitted behaves as Read Committed). Higher isolation = fewer anomalies, more locking/retries, lower throughput.

Level	Dirty read	Non-repeatable read	Phantom read	Implementation notes
Read Uncommitted	Possible	Possible	Possible	PostgreSQL upgrades to Read Committed; MySQL InnoDB rarely used
Read Committed	No	Possible	Possible	Default in PostgreSQL; each statement sees latest committed snapshot
Repeatable Read	No	No	Possible*	MySQL InnoDB default; PostgreSQL blocks phantoms via predicate locks
Serializable	No	No	No	SSI in PostgreSQL; highest safety, serialization failures → retry

* PostgreSQL Repeatable Read prevents phantom reads for most cases via MVCC + predicate locking.

⚠️ Pitfall

Lost updates happen at Read Committed without explicit locking. Use SELECT … FOR UPDATE, optimistic versioning (UPDATE … WHERE version = ?), or Serializable isolation for inventory/financial counters.

Indexes — B-tree, hash, GIN, partial

B-tree (default) supports equality and range on sortable columns. Composite index column order matters: (user_id, created_at) serves WHERE user_id = ? and WHERE user_id = ? AND created_at > ? but not WHERE created_at > ? alone.

Covering index — INCLUDE columns avoid heap fetch (index-only scan).
Partial index — WHERE status = 'active' shrinks index size 10× for hot subsets.
GIN/GiST — JSONB, full-text, geo in PostgreSQL.
Write amplification — every index doubles insert cost; over-indexing kills write-heavy tables.

-- Composite + partial index for active user orders
CREATE INDEX idx_orders_user_active
  ON orders (user_id, created_at DESC)
  WHERE status NOT IN ('cancelled', 'archived');

-- Covering index: avoid heap lookup for list view
CREATE INDEX idx_products_list
  ON products (category_id, price)
  INCLUDE (name, thumbnail_url);

-- Explain analyze before shipping
EXPLAIN (ANALYZE, BUFFERS)
SELECT id, name, price FROM products
WHERE category_id = 42 AND price BETWEEN 10 AND 100
ORDER BY price LIMIT 20;

The N+1 query problem

Loading 100 posts then issuing 100 queries for authors = 101 round trips (~101 × 1ms = 101ms vs 2ms with JOIN). ORMs lazy-load by default—fix with eager loading, JOIN FETCH, or batch IN queries.

// BAD: N+1 — 1 query for posts + N for each author
List<Post> posts = postRepo.findByUserId(userId);
posts.forEach(p -> p.getAuthor().getName()); // lazy load each author

// GOOD: single JOIN or @EntityGraph
@Query("SELECT p FROM Post p JOIN FETCH p.author WHERE p.userId = :uid")
List<Post> findWithAuthor(@Param("uid") Long userId);

Connection pooling

PostgreSQL uses ~10 MB RAM per connection and a dedicated backend process. Opening a connection costs 1–5 ms. At 500 app instances × 20 connections = 10,000 backends — the database dies before your app does. PgBouncer (transaction pooling) multiplexes thousands of clients onto ~100 server connections.

Pool size rule of thumb — connections = (core_count × 2) + effective_spindle_count (PgBouncer docs); often 20–50 per primary.
Transaction vs session pooling — transaction mode breaks prepared statements and temp tables; session mode safer for ORMs.
RDS Proxy / Aurora — managed pooling + IAM auth + failover.

Read replicas

Streaming replication ships WAL to standbys. Replicas serve read traffic; lag typically 10ms–seconds under load. Replica lag means read-your-writes is NOT guaranteed unless routed to primary or tracked LSN.

flowchart LR
  app[Application]
  primary[(Primary PostgreSQL)]
  replica1[(Read Replica 1)]
  replica2[(Read Replica 2)]
  app -->|writes + strong reads| primary
  app -->|analytics / lists| replica1
  app -->|search indexing| replica2
  primary -->|WAL stream| replica1
  primary -->|WAL stream| replica2

⚖️ Trade-off

Read replicas vs caching — replicas give fresh-ish data without cache invalidation complexity; cache gives sub-ms latency. Common pattern: cache hot keys, replica for heavy reporting queries that tolerate lag.

PostgreSQL vs MySQL

Dimension	PostgreSQL	MySQL (InnoDB)
SQL compliance / features	Window functions, CTEs, JSONB, arrays, full-text — richer ad-hoc analytics	Simpler subset; 8.0+ CTEs/window functions; JSON type less ergonomic
Replication	Streaming + logical decoding (CDC); built-in partitioning	Binlog (row/statement); mature multi-source; Vitess ecosystem
Default isolation	Read Committed	Repeatable Read (gap locks differ from PG)
Typical adopters	Startups, analytics-heavy, GIS (PostGIS), greenfield	Legacy web, Shopify-scale sharded MySQL, China ecosystem
Interview default	Prefer unless interviewer says MySQL or existing fleet	Cite when discussing Uber/Vitess, Facebook history

📦 Real World

Instagram started PostgreSQL, migrated to Cassandra for feed storage at scale — but metadata stayed relational. Notion runs PostgreSQL with aggressive indexing and sharding via Vitess-like patterns for block storage metadata.

L5+ L6: isolation + pooling + replica lag

3. Sharding

When vertical scaling and read replicas exhaust headroom, you partition data across nodes. Sharding is irreversible architecture — shard key choice determines whether you scale linearly or fight hot spots forever.

When to shard

Single-node storage > 2–5 TB or write QPS exceeds ~10–50K sustained (hardware dependent).
Replication lag unacceptable and writes are the bottleneck.
Regulatory data residency requires geographic partitions.

Before sharding: archive cold data, denormalize, connection pool, read replicas, caching — each buys 10× cheaper than sharding.

Shard key selection

The shard key routes every query to the correct partition. Ideal properties:

High cardinality — uniform distribution (user_id ✓, country_code ✗).
Query alignment — most queries include shard key (avoid cross-shard scatter-gather).
Immutability — changing shard key requires row migration.
Monotonic keys avoided — auto-increment IDs concentrate writes on latest shard (use hash or snowflake IDs).

⚠️ Pitfall

Sharding by tenant_id when one tenant is 80% of traffic recreates a monolith on one shard. Mitigate: dedicated shard for whale tenants, sub-shard by user_id within tenant, or rate-limit enterprise tier.

Sharding strategies

Strategy	Routing	Pros	Cons
Hash sharding	`shard = hash(key) mod N`	Even distribution with good keys	Resharding requires remapping most keys when N changes
Range sharding	Key ranges per shard (A–M, N–Z)	Range scans local; easy time-based archival	Hot spots on latest range (time-ordered keys)
Directory sharding	Lookup table maps key → shard	Flexible reassignment, whale tenant isolation	Lookup service is SPOF; must cache and replicate
Geographic sharding	Region in key or directory	GDPR, latency compliance	Cross-region queries expensive; user migration hard

Consistent hashing

Plain mod N remaps almost all keys when adding a shard. Consistent hashing places shards and keys on a ring; adding a node only moves keys between adjacent neighbors (~K/N keys move). Used in DynamoDB, Cassandra, Redis Cluster, CDNs.

flowchart TB
  subgraph ring["Hash ring (virtual nodes)"]
    direction TB
    n1["Shard A\nvnodes 0-85"]
    n2["Shard B\nvnodes 86-170"]
    n3["Shard C\nvnodes 171-255"]
    n1 --> n2 --> n3 --> n1
  end
  key1["key: user_4821\n→ Shard B"]
  key2["key: user_9901\n→ Shard A"]

Virtual nodes (vnodes) — each physical shard owns many ring positions for even load after node failure. Typical: 256 vnodes per Cassandra node.

Cross-shard operations

Operations without shard key in the WHERE clause hit every shard:

Scatter-gather — query all shards, merge in coordinator (latency = slowest shard).
Global secondary indexes — index table sharded differently = double write + eventual consistency.
Cross-shard transactions — 2PC is slow and fragile; prefer saga, per-shard transactions, or redesign.
JOINs across shards — application-level join or denormalize; SQL JOINs don't span Vitess shards without careful schema.

🏆 Senior Signal

Propose colocated tables — orders and order_items share user_id shard key so order history queries stay single-shard. Reserve cross-shard for rare admin/analytics with async aggregation.

Vitess — sharded MySQL at YouTube scale

Vitess sits between app and MySQL shards: VTGate routes SQL by shard key, VReplication handles resharding with minimal downtime, connection pooling built-in. Used by Slack, HubSpot, Square. Alternative: Citus for PostgreSQL (distributed tables + reference tables).

-- Vitess: colocated sharding schema
CREATE TABLE users (
  user_id BIGINT NOT NULL,
  email VARCHAR(255),
  PRIMARY KEY (user_id)
);

CREATE TABLE orders (
  order_id BIGINT NOT NULL,
  user_id BIGINT NOT NULL,
  total DECIMAL(10,2),
  PRIMARY KEY (order_id),
  KEY idx_user (user_id)
) VITESS_SHARDING_KEY = user_id;  -- same shard as users row

📐 Estimation

100 shards × 5K write QPS each = 500K aggregate writes/sec. If p99 shard size diverges 3×, hottest shard caps effective throughput at 167K — monitor shard QPS variance, not just aggregate.

4. NoSQL Document Stores

Document databases model data as self-contained JSON/BSON documents. Schema flexibility accelerates product iteration; the design question shifts from normalization to embed vs reference and aggregation pipeline economics.

When documents win

Heterogeneous records (CMS, catalogs, user profiles with optional fields).
Read-heavy access patterns that fetch whole object graphs.
Rapid schema evolution without ALTER TABLE migrations.
Horizontal scale via sharded collections (MongoDB sharded cluster, DynamoDB partitions).

Embed vs reference

Pattern	Use when	Avoid when
Embed (denormalize)	1:few relationship; data read together; child doesn't exist independently	Unbounded arrays (comments on viral post → 16 MB doc limit); shared entities updated often
Reference (normalize)	1:many unbounded; entity shared across parents; independent lifecycle	Every read needs join-like $lookup → latency + complexity
Hybrid	Embed summary + reference ID for detail (author name + author_id)	Stale embedded copies if update frequency high

// EMBED: order with line items (bounded, always read together)
{
  "_id": ObjectId("..."),
  "userId": "u_4821",
  "items": [
    { "sku": "WIDGET-1", "qty": 2, "price": 19.99 },
    { "sku": "WIDGET-2", "qty": 1, "price": 49.99 }
  ],
  "total": 89.97,
  "status": "shipped"
}

// REFERENCE: product reviews (unbounded, paginated separately)
{
  "_id": ObjectId("..."),
  "productId": "prod_991",
  "rating": 5,
  "text": "Great widget",
  "userId": "u_100"
}

⚠️ Pitfall

MongoDB 16 MB document limit — embedding all messages in a chat room document breaks at scale. Bucket pattern: one doc per 100 messages or reference collection with time-range queries.

Aggregation pipeline

Server-side pipeline stages: $match → $group → $sort → $lookup (join). Push filtering early ($match first) to use indexes. Heavy analytics often belongs in warehouse (BigQuery) not OLTP Mongo.

// Top 10 products by revenue last 30 days
db.orders.aggregate([
  { $match: { createdAt: { $gte: ISODate("2026-05-08") }, status: "paid" } },
  { $unwind: "$items" },
  { $group: {
      _id: "$items.sku",
      revenue: { $sum: { $multiply: ["$items.qty", "$items.price"] } },
      units: { $sum: "$items.qty" }
  }},
  { $sort: { revenue: -1 } },
  { $limit: 10 }
]);

MongoDB sharding recap

Shard key — immutable; choose compound key for locality (tenant_id + user_id).
Chunk migration — balancer moves 64 MB chunks; jumbo chunks block balance.
Transactions — multi-doc ACID within same shard since 4.0; cross-shard since 4.2 with performance cost.

⚖️ Trade-off

Document vs SQL — documents eliminate JOIN latency for object reads; SQL wins multi-entity reporting, strict constraints, and mature tooling. Many teams use PostgreSQL JSONB as middle ground.

5. Wide-Column Stores

Cassandra and HBase optimize for write-heavy, geographically distributed workloads with predictable partition-key access. Data model is query-driven — you design tables around queries, not entities. Getting partition/cluster keys wrong is expensive to fix.

Data model: partition key + cluster key

Partition key — determines which node(s) store the row; all rows with same partition key live on same replica set.
Cluster key (clustering columns) — sort order within partition; enables range scans inside one partition.
Primary key = partition key + optional cluster key(s).

-- Messages by user: partition = user_id, cluster = conversation_id, sent_at
CREATE TABLE messages_by_user (
  user_id UUID,
  conversation_id UUID,
  sent_at TIMESTAMP,
  message_id UUID,
  body TEXT,
  PRIMARY KEY ((user_id), conversation_id, sent_at)
) WITH CLUSTERING ORDER BY (conversation_id ASC, sent_at DESC);

-- Query: all messages in one conversation for user (single partition)
SELECT * FROM messages_by_user
WHERE user_id = ? AND conversation_id = ? LIMIT 50;

flowchart LR
  subgraph node1["Node 1"]
    p1["Partition: user_A\nrows sorted by cluster key"]
  end
  subgraph node2["Node 2"]
    p2["Partition: user_B"]
  end
  subgraph node3["Node 3"]
    p3["Partition: user_C"]
  end
  router["Coordinator"] -->|hash partition key| node1
  router --> node2
  router --> node3

Tunable consistency

Cassandra trades latency vs staleness per query via ConsistencyLevel and replication factor (RF). QUORUM with RF=3 means 2 replicas must respond. LOCAL_QUORUM avoids cross-DC latency in multi-region.

Level	Behavior (RF=3)	Use case
ONE	Single replica responds	Metrics, low-stakes reads; may be stale
QUORUM	Majority (2/3) agree	Default strong-ish read/write balance
ALL	All replicas must respond	Rare; one slow node blocks everything
LOCAL_QUORUM	Quorum in local DC only	Multi-DC without cross-WAN on every read

🔬 Under the Hood

Hinted handoff + read repair — temporary replica failure doesn't lose writes; anti-entropy fixes divergence on read. Lightweight transactions (LWT / Paxos) provide compare-and-set but are 4× slower — use sparingly.

Anti-patterns

Hot partitions — celebrity user_id absorbs all writes for partition; salt key (user_id:bucket).
ALLOW FILTERING — full table scan disguised as query; kills cluster at scale.
Secondary indexes on high-cardinality columns — hidden scatter-gather; use materialized views or duplicate tables per query.
Large partitions — >100 MB or millions of rows per partition causes compaction/repair pain; target <10 MB.
Multi-DC strong consistency everywhere — WAN latency dominates; use LOCAL_QUORUM per DC.

HBase (brief)

HBase runs on HDFS — strong consistency per row, sorted by row key, built for billions of rows scanned in range. Write path through WAL + MemStore; reads merge StoreFiles. Used when Hadoop ecosystem already present (Facebook Messages history). Ops-heavy vs Cassandra's peer-to-peer simplicity; prefer Cassandra/Scylla for greenfield unless HDFS integration required.

📦 Real World

Netflix uses Cassandra for viewing history and recommendations metadata. Apple runs one of the largest Cassandra fleets (Petabyte scale). Discord migrated message storage to ScyllaDB (Cassandra-compatible, C++ performance).

6. Key-Value Stores

The simplest interface — get/put by key — enables sub-millisecond latency at massive QPS. Redis is a data structures server; Memcached is pure cache; DynamoDB is durable managed KV with pay-per-request scaling.

Redis data structures

Structure	Operations	Use cases
String	GET/SET/INCR, BITOP	Cache values, counters, distributed locks (with caution)
Hash	HGET/HSET field-level	Object cache without serializing whole JSON
List	LPUSH/RPOP, blocking pop	Queues, recent items (cap with LTRIM)
Set / Sorted Set	SADD, ZADD with score	Unique tags, leaderboards, rate limit windows
HyperLogLog	PFADD/PFCOUNT	Cardinality (~12 KB for billions of unique visitors)
Stream	XADD/XREADGROUP	Lightweight event log (not Kafka replacement)

# Leaderboard: top 10 by score
ZADD leaderboard 9842.5 user:4821
ZADD leaderboard 9100.0 user:9901
ZREVRANGE leaderboard 0 9 WITHSCORES

# Rate limit: 100 requests per minute per IP
INCR ratelimit:192.168.1.1:202606071430
EXPIRE ratelimit:192.168.1.1:202606071430 60

# Cache-aside pattern (pseudo)
# val = GET key; if miss: val = db.query(); SET key val EX 300

Redis persistence

RDB — point-in-time snapshots; fast restart; may lose last minutes of writes.
AOF — append every write; appendfsync everysec typical; more durable, larger files.
Neither — pure cache; acceptable if source of truth elsewhere.
Redis Cluster — 16,384 hash slots across masters; replica failover; no multi-key ACID across slots.

flowchart TB
  client[Client]
  n1["Master 1\nslots 0-5460"]
  n2["Master 2\nslots 5461-10922"]
  n3["Master 3\nslots 10923-16383"]
  r1[Replica]
  r2[Replica]
  r3[Replica]
  client -->|CRC16 key → slot| n1
  client --> n2
  client --> n3
  n1 --> r1
  n2 --> r2
  n3 --> r3

DynamoDB

Partition key — required; 10 GB per partition soft limit; hot keys throttle.
Sort key — optional; enables range queries within partition.
GSI (Global Secondary Index) — alternate access pattern with own partition key; eventual consistency; extra WCU/RCU.
Streams — ordered change feed per shard; triggers Lambda, CDC to Elasticsearch, audit.
On-demand vs provisioned — on-demand for spiky/unknown; provisioned + auto-scaling for predictable cost at scale.

{
  "TableName": "Orders",
  "KeySchema": [
    { "AttributeName": "userId", "KeyType": "HASH" },
    { "AttributeName": "orderId", "KeyType": "RANGE" }
  ],
  "GlobalSecondaryIndexes": [{
    "IndexName": "ByStatus",
    "KeySchema": [
      { "AttributeName": "status", "KeyType": "HASH" },
      { "AttributeName": "createdAt", "KeyType": "RANGE" }
    ],
    "Projection": { "ProjectionType": "INCLUDE", "NonKeyAttributes": ["total", "userId"] }
  }],
  "StreamSpecification": { "StreamViewType": "NEW_AND_OLD_IMAGES" }
}

Memcached vs Redis

Aspect	Memcached	Redis
Data model	Strings only	Rich structures, Lua scripting
Persistence	None (pure cache)	Optional RDB/AOF
Threading	Multi-threaded (scale vertical)	Single-threaded command loop (I/O threads in 6+)
When to pick	Simple cache layer at Facebook scale	Cache + structures + pub/sub + lightweight queue

⚠️ Pitfall

Redis as primary database without persistence/replication — data loss on restart. If durability matters, use DynamoDB/RDS or Redis with AOF + replicas + explicit backup strategy.

7. Time-Series Databases

Metrics, IoT sensor readings, and financial ticks share a pattern: append-only, time-ordered, high cardinality labels, aggregations over windows, aggressive retention downsampling. General OLTP databases fail here — inserts become index bloat hell.

Workload characteristics

Write-heavy — millions of points/sec (Prometheus remote write, Influx line protocol).
Time-range queries — last 1h, GROUP BY 1m buckets; rarely point lookups on old data.
Retention tiers — raw 7d → 1min aggregates 90d → 1h aggregates 2y.
Cardinality explosion — high-cardinality labels (user_id on every metric) OOM the index.

Engine	Model	Strengths	Weaknesses
InfluxDB	Tag/set schema, TSM storage	DevOps metrics native; Flux/SQL queries	Cardinality limits; clustering in enterprise tier
TimescaleDB	PostgreSQL extension (hypertables)	SQL familiarity, JOIN with relational data	Not as fast as dedicated TSDB at extreme ingest
Prometheus	Pull metrics, local TSDB	K8s ecosystem standard; PromQL	Long-term storage needs Thanos/Mimir/Cortex
ClickHouse	Columnar OLAP	Billions rows/sec ingest; SQL analytics	Not a metrics scraper; batch insert tuning required

-- TimescaleDB hypertable + continuous aggregate
CREATE TABLE metrics (
  time        TIMESTAMPTZ NOT NULL,
  service     TEXT NOT NULL,
  endpoint    TEXT NOT NULL,
  latency_ms  DOUBLE PRECISION,
  status      INT
);
SELECT create_hypertable('metrics', 'time');

CREATE MATERIALIZED VIEW metrics_hourly
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 hour', time) AS bucket,
       service, endpoint,
       avg(latency_ms) AS avg_latency,
       percentile_cont(0.99) WITHIN GROUP (ORDER BY latency_ms) AS p99,
       count(*) AS requests
FROM metrics
GROUP BY bucket, service, endpoint;

# Prometheus recording rule + alert
groups:
  - name: api_slo
    rules:
      - record: job:http_request_duration_p99:5m
        expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job))
      - alert: HighLatencyP99
        expr: job:http_request_duration_p99:5m > 0.5
        for: 5m
        labels:
          severity: page

💡 Pro Tip

Cap label cardinality in Prometheus: use http_route="/users/:id" not raw path with IDs. Drop high-cardinality labels at scrape or via relabel configs — prevents TSDB corruption.

⚖️ Trade-off

Prometheus pull vs push (Influx) — pull gives service discovery and health; push (StatsD, Telegraf) needed for short-lived jobs and IoT devices behind NAT.

8. Search Engines

Relational LIKE '%foo%' scans every row. Search engines invert the problem: pre-tokenize text into postings lists for microsecond lookups at billion-document scale. Elasticsearch is a derived view — never your financial source of truth.

Inverted index

Documents are tokenized (analyzed) into terms. Each term maps to a postings list of (doc_id, positions, scores). Query "quick brown fox" intersects postings lists; BM25 ranks by term frequency and document length normalization.

flowchart LR
  d1["Doc 1: quick brown fox"]
  d2["Doc 2: slow red fox"]
  d3["Doc 3: quick rabbit"]
  idx["Inverted index"]
  t1["fox → doc1, doc2"]
  t2["quick → doc1, doc3"]
  t3["brown → doc1"]
  d1 --> idx
  d2 --> idx
  d3 --> idx
  idx --> t1
  idx --> t2
  idx --> t3

Core concepts

Index — logical namespace (like database).
Shard — Lucene index instance; primary + replicas for HA and read scale.
Analyzers — tokenizer + filters (lowercase, stemming, stop words); language-specific.
Mappings — field types (text analyzed vs keyword exact); wrong mapping requires reindex.
Aggregations — facets, histograms, nested metrics without hitting primary DB.

PUT /products
{
  "mappings": {
    "properties": {
      "title": { "type": "text", "analyzer": "english" },
      "sku": { "type": "keyword" },
      "category": { "type": "keyword" },
      "price": { "type": "float" },
      "tags": { "type": "keyword" },
      "created_at": { "type": "date" }
    }
  }
}

GET /products/_search
{
  "query": {
    "bool": {
      "must": [{ "match": { "title": "wireless keyboard" } }],
      "filter": [{ "term": { "category": "electronics" } }]
    }
  },
  "aggs": { "by_tag": { "terms": { "field": "tags", "size": 10 } } }
}

ELK sync via Kafka (CDC pattern)

Primary DB commits → Debezium captures binlog/WAL → Kafka topic → Elasticsearch sink connector indexes documents. Near-real-time search (1–5s lag) without dual-write race conditions.

flowchart LR
  pg[(PostgreSQL)]
  debezium[Debezium CDC]
  kafka[[Kafka topic:\nproducts.changes]]
  connect[ES Sink Connector]
  es[(Elasticsearch)]
  app[Search API]
  pg -->|WAL| debezium
  debezium --> kafka
  kafka --> connect
  connect -->|bulk index| es
  app -->|query| es

⚠️ Pitfall

Dual-write to DB + ES — crash between writes causes drift. Use outbox pattern or CDC. Reindex jobs required when mapping changes — plan blue/green index alias swap.

📦 Real World

Shopify indexes catalog via pipeline from MySQL to Elasticsearch. Uber uses Elasticsearch for trip/search geo queries with dedicated indexing fleet fed by Kafka.

9. Graph Databases

When the query IS the relationship — "friends of friends within 3 hops," "shortest path between two routers," "detect fraud ring" — relational JOINs on edge tables degrade as depth grows. Graph stores index adjacency natively.

Property graph model

Nodes — entities (User, Account, Device) with properties.
Edges — typed relationships (FOLLOWS, PAID, LOGGED_IN_FROM) with properties and direction.
Traversal — follow pointers index-free per hop; cost ~O(edges traversed) not O(table size).

Neo4j vs Amazon Neptune

Aspect	Neo4j	Amazon Neptune
Query languages	Cypher (declarative, SQL-like)	Gremlin (imperative), openCypher, SPARQL (RDF)
Deployment	Self-hosted, AuraDB managed	Fully managed AWS; IAM/VPC integration
Scale	Billions of relationships; sharding immature (Fabric for federated)	Up to 128 TB storage; read replicas
Best for	Dev velocity, complex Cypher, on-prem	AWS-native, compliance, multi-model (property + RDF)

// Friends-of-friends recommendation (depth 2)
MATCH (me:User {id: $userId})-[:FOLLOWS]->(friend)-[:FOLLOWS]->(fof)
WHERE NOT (me)-[:FOLLOWS]->(fof) AND me <> fof
RETURN fof.id, count(*) AS mutualFriends
ORDER BY mutualFriends DESC
LIMIT 20;

// Fraud: shared device within 2 hops of flagged account
MATCH (flagged:Account {risk: 'high'})-[:USED_DEVICE|TRANSFERRED*1..2]-(suspect:Account)
WHERE suspect.risk IS NULL
RETURN suspect.id, count(DISTINCT flagged) AS proximityScore;

Use cases

Social graphs — follow suggestions, feed ranking inputs (often hybrid with cache).
Fraud / AML — connected components, cycle detection in payment graphs.
Identity / IAM — permission inheritance, group nesting (Google Zanzibar-inspired).
Network / IT ops — dependency mapping, blast radius analysis.
Knowledge graphs — RAG entity linking, semantic search enrichment.

⚖️ Trade-off

Graph vs relational — depth-1 relationships (user → orders) stay in SQL. Graph wins at depth ≥3 or variable-length paths. Most "graph problems" at FAANG scale use custom adjacency in KV + batch precompute.

🎯 Interview Tip

For "design Facebook friend suggestions," mention graph traversal but also offline batch jobs (Spark) computing candidates, Redis serving precomputed lists — pure online graph traversal doesn't scale to 3B users.

10. Database Replication

Replication provides durability, read scale, and geographic proximity — at the cost of consistency complexity. Three topologies dominate system design interviews: single-leader, multi-leader, and leaderless (Dynamo-style).

Single-leader (primary/replica)

All writes go to the leader; replicas apply the replication log (PostgreSQL WAL, MySQL binlog). Reads can hit replicas with lag. Failover promotes a replica — risk of split-brain without fencing (STONITH, etcd lease).

flowchart TB
  w1[Writer clients]
  r1[Reader clients]
  leader[(Leader DB)]
  rep1[(Replica 1)]
  rep2[(Replica 2)]
  w1 -->|writes| leader
  r1 -->|reads| rep1
  r1 -->|strong read| leader
  leader -->|async/sync replication| rep1
  leader --> rep2

Synchronous replica — write ACK after replica confirms; survives leader loss without data loss; higher latency.
Asynchronous replica — leader ACK before replica; faster; failover may lose last seconds.
Semi-sync — at least one replica ACK (MySQL, PostgreSQL synchronous_standby_names).

Multi-leader (active/active)

Multiple nodes accept writes; changes replicate bidirectionally. Enables multi-region write locality (user writes to nearest DC). Write conflicts inevitable — resolve via LWW (last-write-wins), version vectors, or application merge.

flowchart LR
  dc1["DC US-East\nLeader A"]
  dc2["DC EU-West\nLeader B"]
  dc3["DC APAC\nLeader C"]
  dc1 <-->|async repl| dc2
  dc2 <-->|async repl| dc3
  dc1 <-->|async repl| dc3
  u1[US users] --> dc1
  u2[EU users] --> dc2

⚠️ Pitfall

Multi-leader on same entity — two users editing same document in different regions → conflict. Restrict multi-leader to disjoint data (user_id region affinity) or use CRDTs for merge-safe fields.

Leaderless (Dynamo-style quorum)

No leader — client writes to W replicas, reads from R replicas. Quorum condition R + W > N (N = replication factor) guarantees overlap: read sees at least one node with latest write (with version reconciliation).

Parameter	Typical (N=3)	Effect
N (replication factor)	3	Copies of each key across cluster
W (write quorum)	2	Higher W → stronger durability, slower writes
R (read quorum)	2	Higher R → fresher reads, slower reads
R + W > N	2 + 2 > 3 ✓	Strong consistency possible with version merge
R + W ≤ N	1 + 1 ≤ 3	Eventual consistency only (faster, stale reads)

sequenceDiagram
  participant C as Client
  participant N1 as Node 1
  participant N2 as Node 2
  participant N3 as Node 3
  Note over N1,N3: N=3, W=2, R=2
  C->>N1: write v2
  C->>N2: write v2
  N1-->>C: ACK
  N2-->>C: ACK
  Note over C: Write quorum satisfied
  C->>N2: read
  C->>N3: read
  N2-->>C: v2
  N3-->>C: v1 stale
  Note over C: Return v2 (latest timestamp wins)

Topology comparison

Topology	Consistency	Write latency	Conflict handling	Examples
Single-leader	Strong on leader; eventual on replicas	Low (local leader)	None (serial writes)	PostgreSQL, MySQL, MongoDB default
Multi-leader	Eventual; conflict resolution required	Low per region	LWW, merge, CRDT	CouchDB, old MySQL circular repl
Leaderless	Tunable via R/W/N	Depends on W and RTT	Version vectors, read repair	Cassandra, DynamoDB, Riak

🔬 Under the Hood

Read repair — during quorum read, stale replicas get updated in background. Anti-entropy — Merkle tree comparison between replicas fixes drift without waiting for reads. DynamoDB uses leaderless per partition with leader for coordination internally — interview simplification OK.

🏆 Senior Signal

State explicit numbers: "RF=3, LOCAL_QUORUM writes in Cassandra → tolerate 1 node loss per DC without write failure. For checkout, stick to single-leader PostgreSQL with sync replica — leaderless is wrong tool for financial invariants."

📐 Estimation

Cross-region sync replication adds RTT (~150ms US-EU) to every write. Multi-region active/active trades that latency for local writes — quantify: 10K writes/sec × 150ms = need massive write pipelining or accept async replication lag.