Database Design & Selection

The database is the hardest component to change after launch. Wrong choice costs months of migration; right choice buys years of scale. This chapter is the capstone data layer reference: a selection framework for every major store family, deep dives into RDBMS internals, sharding, NoSQL variants, search, graphs, and replication topologies—with numbers and trade-offs you can defend in a staff-level interview.

L4 L5 L6 learning PostgreSQL 16 Cassandra 4.x

1. Database Selection Framework

Start with access patterns, not brand names. Every store optimizes for a workload shape: point reads, range scans, aggregations, full-text search, graph traversals, or append-only time series. The framework below maps requirements to families—and explicitly states when not to use each.

The five questions before you pick

  1. Read vs write ratio — 90% reads favors replicas + cache; write-heavy favors LSM/wide-column.
  2. Query shape — point lookup, range scan, join, aggregation, geo, full-text, traversal?
  3. Consistency needs — financial ledger (strong) vs social likes (eventual)?
  4. Scale trajectory — single node OK for 2 years, or 100M rows day one?
  5. Operational maturity — managed service vs self-hosted Cassandra cluster?
flowchart TD
  start(["New data store needed"]) --> q1{"Structured schema\nwith joins?"}
  q1 -->|Yes| sql["Relational SQL\nPostgreSQL / MySQL"]
  q1 -->|No| q2{"Document-shaped\nnested JSON?"}
  q2 -->|Yes| doc["Document DB\nMongoDB / DynamoDB"]
  q2 -->|No| q3{"Massive write\nthroughput?"}
  q3 -->|Yes| wc["Wide-column\nCassandra / HBase"]
  q3 -->|No| q4{"Sub-ms latency\ncache/session?"}
  q4 -->|Yes| kv["Key-value\nRedis / Memcached"]
  q4 -->|No| q5{"Time-ordered\nmetrics/logs?"}
  q5 -->|Yes| ts["Time-series\nInfluxDB / TimescaleDB / ClickHouse"]
  q5 -->|No| q6{"Full-text or\nfaceted search?"}
  q6 -->|Yes| search["Search engine\nElasticsearch / OpenSearch"]
  q6 -->|No| q7{"Relationship\ntraversals?"}
  q7 -->|Yes| graphdb["Graph DB\nNeo4j / Neptune"]
  q7 -->|No| sql

Decision matrix by store family

Family Best for Weak at Typical scale Examples
Relational (SQL) ACID transactions, joins, ad-hoc queries, reporting Horizontal write scale without sharding; schema rigidity 10 TB single node; sharded to PB with Vitess/Citus PostgreSQL, MySQL, Aurora
Document Flexible schema, nested objects, rapid iteration Multi-document ACID (limited), unbounded document growth 100M+ docs per collection with sharding MongoDB, CouchDB, DynamoDB (doc model)
Wide-column High write throughput, time-series-ish rows, geo-distribution Ad-hoc joins, secondary index fan-out, range queries without partition key PB across datacenters Cassandra, HBase, ScyllaDB
Key-value Cache, sessions, rate limits, feature flags, leaderboards Complex queries, durability without tuning (Redis) 100K–1M ops/sec per node Redis, Memcached, DynamoDB
Time-series Metrics, IoT, observability, downsampling General-purpose OLTP, mutable row updates Millions of writes/sec with batching InfluxDB, TimescaleDB, Prometheus, ClickHouse
Search Full-text, fuzzy match, facets, relevance ranking Source of truth, strong consistency with primary DB TB indexes with sharding Elasticsearch, OpenSearch, Solr
Graph Short-path queries, fraud rings, recommendations, IAM Bulk analytics, simple key lookups at extreme scale 10B edges (Neptune), smaller for Neo4j Neo4j, Amazon Neptune, JanusGraph

Polyglot persistence — the norm at scale

Production systems rarely use one database. Uber runs MySQL (core trips), Cassandra (location), Redis (cache), Elasticsearch (search). The pattern: PostgreSQL as system of record, specialized stores as derived views. Sync via CDC (Debezium → Kafka) or dual-write (riskier). Interview signal: name the primary store and justify satellites.

🏆 Senior Signal

L5: "We'll use PostgreSQL." L6: "PostgreSQL for orders and inventory with row-level locking on stock decrements; Redis cache-aside for product catalog (95% read); Elasticsearch fed by Kafka CDC for search—eventual consistency acceptable with 2s refresh SLA. DynamoDB only if we need single-digit-ms global reads without managing replicas."

⚖️ Trade-off

One database vs many — operational simplicity and JOINs vs best-fit performance per workload. Start monolithic SQL; split when measured pain (write QPS, query latency, team boundaries) exceeds migration cost.

🎯 Interview Tip

When asked "SQL or NoSQL?", reframe: "What's the access pattern?" Walk through the five questions aloud. Interviewers reward structured elimination over premature technology choice.

2. Relational Databases

Four decades of battle-testing make RDBMS the default system of record. Understanding ACID, isolation anomalies, indexing, and connection economics separates engineers who "use Postgres" from those who keep it alive at 50K QPS.

ACID properties

  • Atomicity — all statements in a transaction commit or none do (WAL + rollback segments).
  • Consistency — constraints (FK, CHECK, UNIQUE) hold after every committed transaction.
  • Isolation — concurrent transactions behave as if serialized (level-dependent).
  • Durability — committed data survives crash (fsync WAL before ACK).

Isolation levels and anomalies

ANSI SQL defines four isolation levels. PostgreSQL implements three (Read Uncommitted behaves as Read Committed). Higher isolation = fewer anomalies, more locking/retries, lower throughput.

Level Dirty read Non-repeatable read Phantom read Implementation notes
Read Uncommitted Possible Possible Possible PostgreSQL upgrades to Read Committed; MySQL InnoDB rarely used
Read Committed No Possible Possible Default in PostgreSQL; each statement sees latest committed snapshot
Repeatable Read No No Possible* MySQL InnoDB default; PostgreSQL blocks phantoms via predicate locks
Serializable No No No SSI in PostgreSQL; highest safety, serialization failures → retry

* PostgreSQL Repeatable Read prevents phantom reads for most cases via MVCC + predicate locking.

⚠️ Pitfall

Lost updates happen at Read Committed without explicit locking. Use SELECT … FOR UPDATE, optimistic versioning (UPDATE … WHERE version = ?), or Serializable isolation for inventory/financial counters.

Indexes — B-tree, hash, GIN, partial

B-tree (default) supports equality and range on sortable columns. Composite index column order matters: (user_id, created_at) serves WHERE user_id = ? and WHERE user_id = ? AND created_at > ? but not WHERE created_at > ? alone.

  • Covering index — INCLUDE columns avoid heap fetch (index-only scan).
  • Partial indexWHERE status = 'active' shrinks index size 10× for hot subsets.
  • GIN/GiST — JSONB, full-text, geo in PostgreSQL.
  • Write amplification — every index doubles insert cost; over-indexing kills write-heavy tables.
-- Composite + partial index for active user orders
CREATE INDEX idx_orders_user_active
  ON orders (user_id, created_at DESC)
  WHERE status NOT IN ('cancelled', 'archived');

-- Covering index: avoid heap lookup for list view
CREATE INDEX idx_products_list
  ON products (category_id, price)
  INCLUDE (name, thumbnail_url);

-- Explain analyze before shipping
EXPLAIN (ANALYZE, BUFFERS)
SELECT id, name, price FROM products
WHERE category_id = 42 AND price BETWEEN 10 AND 100
ORDER BY price LIMIT 20;

The N+1 query problem

Loading 100 posts then issuing 100 queries for authors = 101 round trips (~101 × 1ms = 101ms vs 2ms with JOIN). ORMs lazy-load by default—fix with eager loading, JOIN FETCH, or batch IN queries.

// BAD: N+1 — 1 query for posts + N for each author
List<Post> posts = postRepo.findByUserId(userId);
posts.forEach(p -> p.getAuthor().getName()); // lazy load each author

// GOOD: single JOIN or @EntityGraph
@Query("SELECT p FROM Post p JOIN FETCH p.author WHERE p.userId = :uid")
List<Post> findWithAuthor(@Param("uid") Long userId);

Connection pooling

PostgreSQL uses ~10 MB RAM per connection and a dedicated backend process. Opening a connection costs 1–5 ms. At 500 app instances × 20 connections = 10,000 backends — the database dies before your app does. PgBouncer (transaction pooling) multiplexes thousands of clients onto ~100 server connections.

  • Pool size rule of thumbconnections = (core_count × 2) + effective_spindle_count (PgBouncer docs); often 20–50 per primary.
  • Transaction vs session pooling — transaction mode breaks prepared statements and temp tables; session mode safer for ORMs.
  • RDS Proxy / Aurora — managed pooling + IAM auth + failover.

Read replicas

Streaming replication ships WAL to standbys. Replicas serve read traffic; lag typically 10ms–seconds under load. Replica lag means read-your-writes is NOT guaranteed unless routed to primary or tracked LSN.

flowchart LR
  app[Application]
  primary[(Primary PostgreSQL)]
  replica1[(Read Replica 1)]
  replica2[(Read Replica 2)]
  app -->|writes + strong reads| primary
  app -->|analytics / lists| replica1
  app -->|search indexing| replica2
  primary -->|WAL stream| replica1
  primary -->|WAL stream| replica2
⚖️ Trade-off

Read replicas vs caching — replicas give fresh-ish data without cache invalidation complexity; cache gives sub-ms latency. Common pattern: cache hot keys, replica for heavy reporting queries that tolerate lag.

PostgreSQL vs MySQL

Dimension PostgreSQL MySQL (InnoDB)
SQL compliance / features Window functions, CTEs, JSONB, arrays, full-text — richer ad-hoc analytics Simpler subset; 8.0+ CTEs/window functions; JSON type less ergonomic
Replication Streaming + logical decoding (CDC); built-in partitioning Binlog (row/statement); mature multi-source; Vitess ecosystem
Default isolation Read Committed Repeatable Read (gap locks differ from PG)
Typical adopters Startups, analytics-heavy, GIS (PostGIS), greenfield Legacy web, Shopify-scale sharded MySQL, China ecosystem
Interview default Prefer unless interviewer says MySQL or existing fleet Cite when discussing Uber/Vitess, Facebook history
📦 Real World

Instagram started PostgreSQL, migrated to Cassandra for feed storage at scale — but metadata stayed relational. Notion runs PostgreSQL with aggressive indexing and sharding via Vitess-like patterns for block storage metadata.

L5+ L6: isolation + pooling + replica lag

3. Sharding

When vertical scaling and read replicas exhaust headroom, you partition data across nodes. Sharding is irreversible architecture — shard key choice determines whether you scale linearly or fight hot spots forever.

When to shard

  • Single-node storage > 2–5 TB or write QPS exceeds ~10–50K sustained (hardware dependent).
  • Replication lag unacceptable and writes are the bottleneck.
  • Regulatory data residency requires geographic partitions.

Before sharding: archive cold data, denormalize, connection pool, read replicas, caching — each buys 10× cheaper than sharding.

Shard key selection

The shard key routes every query to the correct partition. Ideal properties:

  • High cardinality — uniform distribution (user_id ✓, country_code ✗).
  • Query alignment — most queries include shard key (avoid cross-shard scatter-gather).
  • Immutability — changing shard key requires row migration.
  • Monotonic keys avoided — auto-increment IDs concentrate writes on latest shard (use hash or snowflake IDs).
⚠️ Pitfall

Sharding by tenant_id when one tenant is 80% of traffic recreates a monolith on one shard. Mitigate: dedicated shard for whale tenants, sub-shard by user_id within tenant, or rate-limit enterprise tier.

Sharding strategies

Strategy Routing Pros Cons
Hash sharding shard = hash(key) mod N Even distribution with good keys Resharding requires remapping most keys when N changes
Range sharding Key ranges per shard (A–M, N–Z) Range scans local; easy time-based archival Hot spots on latest range (time-ordered keys)
Directory sharding Lookup table maps key → shard Flexible reassignment, whale tenant isolation Lookup service is SPOF; must cache and replicate
Geographic sharding Region in key or directory GDPR, latency compliance Cross-region queries expensive; user migration hard

Consistent hashing

Plain mod N remaps almost all keys when adding a shard. Consistent hashing places shards and keys on a ring; adding a node only moves keys between adjacent neighbors (~K/N keys move). Used in DynamoDB, Cassandra, Redis Cluster, CDNs.

flowchart TB
  subgraph ring["Hash ring (virtual nodes)"]
    direction TB
    n1["Shard A\nvnodes 0-85"]
    n2["Shard B\nvnodes 86-170"]
    n3["Shard C\nvnodes 171-255"]
    n1 --> n2 --> n3 --> n1
  end
  key1["key: user_4821\n→ Shard B"]
  key2["key: user_9901\n→ Shard A"]

Virtual nodes (vnodes) — each physical shard owns many ring positions for even load after node failure. Typical: 256 vnodes per Cassandra node.

Cross-shard operations

Operations without shard key in the WHERE clause hit every shard:

  • Scatter-gather — query all shards, merge in coordinator (latency = slowest shard).
  • Global secondary indexes — index table sharded differently = double write + eventual consistency.
  • Cross-shard transactions — 2PC is slow and fragile; prefer saga, per-shard transactions, or redesign.
  • JOINs across shards — application-level join or denormalize; SQL JOINs don't span Vitess shards without careful schema.
🏆 Senior Signal

Propose colocated tables — orders and order_items share user_id shard key so order history queries stay single-shard. Reserve cross-shard for rare admin/analytics with async aggregation.

Vitess — sharded MySQL at YouTube scale

Vitess sits between app and MySQL shards: VTGate routes SQL by shard key, VReplication handles resharding with minimal downtime, connection pooling built-in. Used by Slack, HubSpot, Square. Alternative: Citus for PostgreSQL (distributed tables + reference tables).

-- Vitess: colocated sharding schema
CREATE TABLE users (
  user_id BIGINT NOT NULL,
  email VARCHAR(255),
  PRIMARY KEY (user_id)
);

CREATE TABLE orders (
  order_id BIGINT NOT NULL,
  user_id BIGINT NOT NULL,
  total DECIMAL(10,2),
  PRIMARY KEY (order_id),
  KEY idx_user (user_id)
) VITESS_SHARDING_KEY = user_id;  -- same shard as users row
📐 Estimation

100 shards × 5K write QPS each = 500K aggregate writes/sec. If p99 shard size diverges 3×, hottest shard caps effective throughput at 167K — monitor shard QPS variance, not just aggregate.

4. NoSQL Document Stores

Document databases model data as self-contained JSON/BSON documents. Schema flexibility accelerates product iteration; the design question shifts from normalization to embed vs reference and aggregation pipeline economics.

When documents win

  • Heterogeneous records (CMS, catalogs, user profiles with optional fields).
  • Read-heavy access patterns that fetch whole object graphs.
  • Rapid schema evolution without ALTER TABLE migrations.
  • Horizontal scale via sharded collections (MongoDB sharded cluster, DynamoDB partitions).

Embed vs reference

Pattern Use when Avoid when
Embed (denormalize) 1:few relationship; data read together; child doesn't exist independently Unbounded arrays (comments on viral post → 16 MB doc limit); shared entities updated often
Reference (normalize) 1:many unbounded; entity shared across parents; independent lifecycle Every read needs join-like $lookup → latency + complexity
Hybrid Embed summary + reference ID for detail (author name + author_id) Stale embedded copies if update frequency high
// EMBED: order with line items (bounded, always read together)
{
  "_id": ObjectId("..."),
  "userId": "u_4821",
  "items": [
    { "sku": "WIDGET-1", "qty": 2, "price": 19.99 },
    { "sku": "WIDGET-2", "qty": 1, "price": 49.99 }
  ],
  "total": 89.97,
  "status": "shipped"
}

// REFERENCE: product reviews (unbounded, paginated separately)
{
  "_id": ObjectId("..."),
  "productId": "prod_991",
  "rating": 5,
  "text": "Great widget",
  "userId": "u_100"
}
⚠️ Pitfall

MongoDB 16 MB document limit — embedding all messages in a chat room document breaks at scale. Bucket pattern: one doc per 100 messages or reference collection with time-range queries.

Aggregation pipeline

Server-side pipeline stages: $match$group$sort$lookup (join). Push filtering early ($match first) to use indexes. Heavy analytics often belongs in warehouse (BigQuery) not OLTP Mongo.

// Top 10 products by revenue last 30 days
db.orders.aggregate([
  { $match: { createdAt: { $gte: ISODate("2026-05-08") }, status: "paid" } },
  { $unwind: "$items" },
  { $group: {
      _id: "$items.sku",
      revenue: { $sum: { $multiply: ["$items.qty", "$items.price"] } },
      units: { $sum: "$items.qty" }
  }},
  { $sort: { revenue: -1 } },
  { $limit: 10 }
]);

MongoDB sharding recap

  • Shard key — immutable; choose compound key for locality (tenant_id + user_id).
  • Chunk migration — balancer moves 64 MB chunks; jumbo chunks block balance.
  • Transactions — multi-doc ACID within same shard since 4.0; cross-shard since 4.2 with performance cost.
⚖️ Trade-off

Document vs SQL — documents eliminate JOIN latency for object reads; SQL wins multi-entity reporting, strict constraints, and mature tooling. Many teams use PostgreSQL JSONB as middle ground.

5. Wide-Column Stores

Cassandra and HBase optimize for write-heavy, geographically distributed workloads with predictable partition-key access. Data model is query-driven — you design tables around queries, not entities. Getting partition/cluster keys wrong is expensive to fix.

Data model: partition key + cluster key

  • Partition key — determines which node(s) store the row; all rows with same partition key live on same replica set.
  • Cluster key (clustering columns) — sort order within partition; enables range scans inside one partition.
  • Primary key = partition key + optional cluster key(s).
-- Messages by user: partition = user_id, cluster = conversation_id, sent_at
CREATE TABLE messages_by_user (
  user_id UUID,
  conversation_id UUID,
  sent_at TIMESTAMP,
  message_id UUID,
  body TEXT,
  PRIMARY KEY ((user_id), conversation_id, sent_at)
) WITH CLUSTERING ORDER BY (conversation_id ASC, sent_at DESC);

-- Query: all messages in one conversation for user (single partition)
SELECT * FROM messages_by_user
WHERE user_id = ? AND conversation_id = ? LIMIT 50;
flowchart LR
  subgraph node1["Node 1"]
    p1["Partition: user_A\nrows sorted by cluster key"]
  end
  subgraph node2["Node 2"]
    p2["Partition: user_B"]
  end
  subgraph node3["Node 3"]
    p3["Partition: user_C"]
  end
  router["Coordinator"] -->|hash partition key| node1
  router --> node2
  router --> node3

Tunable consistency

Cassandra trades latency vs staleness per query via ConsistencyLevel and replication factor (RF). QUORUM with RF=3 means 2 replicas must respond. LOCAL_QUORUM avoids cross-DC latency in multi-region.

Level Behavior (RF=3) Use case
ONE Single replica responds Metrics, low-stakes reads; may be stale
QUORUM Majority (2/3) agree Default strong-ish read/write balance
ALL All replicas must respond Rare; one slow node blocks everything
LOCAL_QUORUM Quorum in local DC only Multi-DC without cross-WAN on every read
🔬 Under the Hood

Hinted handoff + read repair — temporary replica failure doesn't lose writes; anti-entropy fixes divergence on read. Lightweight transactions (LWT / Paxos) provide compare-and-set but are 4× slower — use sparingly.

Anti-patterns

  • Hot partitions — celebrity user_id absorbs all writes for partition; salt key (user_id:bucket).
  • ALLOW FILTERING — full table scan disguised as query; kills cluster at scale.
  • Secondary indexes on high-cardinality columns — hidden scatter-gather; use materialized views or duplicate tables per query.
  • Large partitions — >100 MB or millions of rows per partition causes compaction/repair pain; target <10 MB.
  • Multi-DC strong consistency everywhere — WAN latency dominates; use LOCAL_QUORUM per DC.

HBase (brief)

HBase runs on HDFS — strong consistency per row, sorted by row key, built for billions of rows scanned in range. Write path through WAL + MemStore; reads merge StoreFiles. Used when Hadoop ecosystem already present (Facebook Messages history). Ops-heavy vs Cassandra's peer-to-peer simplicity; prefer Cassandra/Scylla for greenfield unless HDFS integration required.

📦 Real World

Netflix uses Cassandra for viewing history and recommendations metadata. Apple runs one of the largest Cassandra fleets (Petabyte scale). Discord migrated message storage to ScyllaDB (Cassandra-compatible, C++ performance).

6. Key-Value Stores

The simplest interface — get/put by key — enables sub-millisecond latency at massive QPS. Redis is a data structures server; Memcached is pure cache; DynamoDB is durable managed KV with pay-per-request scaling.

Redis data structures

Structure Operations Use cases
String GET/SET/INCR, BITOP Cache values, counters, distributed locks (with caution)
Hash HGET/HSET field-level Object cache without serializing whole JSON
List LPUSH/RPOP, blocking pop Queues, recent items (cap with LTRIM)
Set / Sorted Set SADD, ZADD with score Unique tags, leaderboards, rate limit windows
HyperLogLog PFADD/PFCOUNT Cardinality (~12 KB for billions of unique visitors)
Stream XADD/XREADGROUP Lightweight event log (not Kafka replacement)
# Leaderboard: top 10 by score
ZADD leaderboard 9842.5 user:4821
ZADD leaderboard 9100.0 user:9901
ZREVRANGE leaderboard 0 9 WITHSCORES

# Rate limit: 100 requests per minute per IP
INCR ratelimit:192.168.1.1:202606071430
EXPIRE ratelimit:192.168.1.1:202606071430 60

# Cache-aside pattern (pseudo)
# val = GET key; if miss: val = db.query(); SET key val EX 300

Redis persistence

  • RDB — point-in-time snapshots; fast restart; may lose last minutes of writes.
  • AOF — append every write; appendfsync everysec typical; more durable, larger files.
  • Neither — pure cache; acceptable if source of truth elsewhere.
  • Redis Cluster — 16,384 hash slots across masters; replica failover; no multi-key ACID across slots.
flowchart TB
  client[Client]
  n1["Master 1\nslots 0-5460"]
  n2["Master 2\nslots 5461-10922"]
  n3["Master 3\nslots 10923-16383"]
  r1[Replica]
  r2[Replica]
  r3[Replica]
  client -->|CRC16 key → slot| n1
  client --> n2
  client --> n3
  n1 --> r1
  n2 --> r2
  n3 --> r3

DynamoDB

  • Partition key — required; 10 GB per partition soft limit; hot keys throttle.
  • Sort key — optional; enables range queries within partition.
  • GSI (Global Secondary Index) — alternate access pattern with own partition key; eventual consistency; extra WCU/RCU.
  • Streams — ordered change feed per shard; triggers Lambda, CDC to Elasticsearch, audit.
  • On-demand vs provisioned — on-demand for spiky/unknown; provisioned + auto-scaling for predictable cost at scale.
{
  "TableName": "Orders",
  "KeySchema": [
    { "AttributeName": "userId", "KeyType": "HASH" },
    { "AttributeName": "orderId", "KeyType": "RANGE" }
  ],
  "GlobalSecondaryIndexes": [{
    "IndexName": "ByStatus",
    "KeySchema": [
      { "AttributeName": "status", "KeyType": "HASH" },
      { "AttributeName": "createdAt", "KeyType": "RANGE" }
    ],
    "Projection": { "ProjectionType": "INCLUDE", "NonKeyAttributes": ["total", "userId"] }
  }],
  "StreamSpecification": { "StreamViewType": "NEW_AND_OLD_IMAGES" }
}

Memcached vs Redis

Aspect Memcached Redis
Data model Strings only Rich structures, Lua scripting
Persistence None (pure cache) Optional RDB/AOF
Threading Multi-threaded (scale vertical) Single-threaded command loop (I/O threads in 6+)
When to pick Simple cache layer at Facebook scale Cache + structures + pub/sub + lightweight queue
⚠️ Pitfall

Redis as primary database without persistence/replication — data loss on restart. If durability matters, use DynamoDB/RDS or Redis with AOF + replicas + explicit backup strategy.

7. Time-Series Databases

Metrics, IoT sensor readings, and financial ticks share a pattern: append-only, time-ordered, high cardinality labels, aggregations over windows, aggressive retention downsampling. General OLTP databases fail here — inserts become index bloat hell.

Workload characteristics

  • Write-heavy — millions of points/sec (Prometheus remote write, Influx line protocol).
  • Time-range queries — last 1h, GROUP BY 1m buckets; rarely point lookups on old data.
  • Retention tiers — raw 7d → 1min aggregates 90d → 1h aggregates 2y.
  • Cardinality explosion — high-cardinality labels (user_id on every metric) OOM the index.
Engine Model Strengths Weaknesses
InfluxDB Tag/set schema, TSM storage DevOps metrics native; Flux/SQL queries Cardinality limits; clustering in enterprise tier
TimescaleDB PostgreSQL extension (hypertables) SQL familiarity, JOIN with relational data Not as fast as dedicated TSDB at extreme ingest
Prometheus Pull metrics, local TSDB K8s ecosystem standard; PromQL Long-term storage needs Thanos/Mimir/Cortex
ClickHouse Columnar OLAP Billions rows/sec ingest; SQL analytics Not a metrics scraper; batch insert tuning required
-- TimescaleDB hypertable + continuous aggregate
CREATE TABLE metrics (
  time        TIMESTAMPTZ NOT NULL,
  service     TEXT NOT NULL,
  endpoint    TEXT NOT NULL,
  latency_ms  DOUBLE PRECISION,
  status      INT
);
SELECT create_hypertable('metrics', 'time');

CREATE MATERIALIZED VIEW metrics_hourly
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 hour', time) AS bucket,
       service, endpoint,
       avg(latency_ms) AS avg_latency,
       percentile_cont(0.99) WITHIN GROUP (ORDER BY latency_ms) AS p99,
       count(*) AS requests
FROM metrics
GROUP BY bucket, service, endpoint;
# Prometheus recording rule + alert
groups:
  - name: api_slo
    rules:
      - record: job:http_request_duration_p99:5m
        expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job))
      - alert: HighLatencyP99
        expr: job:http_request_duration_p99:5m > 0.5
        for: 5m
        labels:
          severity: page
💡 Pro Tip

Cap label cardinality in Prometheus: use http_route="/users/:id" not raw path with IDs. Drop high-cardinality labels at scrape or via relabel configs — prevents TSDB corruption.

⚖️ Trade-off

Prometheus pull vs push (Influx) — pull gives service discovery and health; push (StatsD, Telegraf) needed for short-lived jobs and IoT devices behind NAT.

8. Search Engines

Relational LIKE '%foo%' scans every row. Search engines invert the problem: pre-tokenize text into postings lists for microsecond lookups at billion-document scale. Elasticsearch is a derived view — never your financial source of truth.

Inverted index

Documents are tokenized (analyzed) into terms. Each term maps to a postings list of (doc_id, positions, scores). Query "quick brown fox" intersects postings lists; BM25 ranks by term frequency and document length normalization.

flowchart LR
  d1["Doc 1: quick brown fox"]
  d2["Doc 2: slow red fox"]
  d3["Doc 3: quick rabbit"]
  idx["Inverted index"]
  t1["fox → doc1, doc2"]
  t2["quick → doc1, doc3"]
  t3["brown → doc1"]
  d1 --> idx
  d2 --> idx
  d3 --> idx
  idx --> t1
  idx --> t2
  idx --> t3

Core concepts

  • Index — logical namespace (like database).
  • Shard — Lucene index instance; primary + replicas for HA and read scale.
  • Analyzers — tokenizer + filters (lowercase, stemming, stop words); language-specific.
  • Mappings — field types (text analyzed vs keyword exact); wrong mapping requires reindex.
  • Aggregations — facets, histograms, nested metrics without hitting primary DB.
PUT /products
{
  "mappings": {
    "properties": {
      "title": { "type": "text", "analyzer": "english" },
      "sku": { "type": "keyword" },
      "category": { "type": "keyword" },
      "price": { "type": "float" },
      "tags": { "type": "keyword" },
      "created_at": { "type": "date" }
    }
  }
}

GET /products/_search
{
  "query": {
    "bool": {
      "must": [{ "match": { "title": "wireless keyboard" } }],
      "filter": [{ "term": { "category": "electronics" } }]
    }
  },
  "aggs": { "by_tag": { "terms": { "field": "tags", "size": 10 } } }
}

ELK sync via Kafka (CDC pattern)

Primary DB commits → Debezium captures binlog/WAL → Kafka topic → Elasticsearch sink connector indexes documents. Near-real-time search (1–5s lag) without dual-write race conditions.

flowchart LR
  pg[(PostgreSQL)]
  debezium[Debezium CDC]
  kafka[[Kafka topic:\nproducts.changes]]
  connect[ES Sink Connector]
  es[(Elasticsearch)]
  app[Search API]
  pg -->|WAL| debezium
  debezium --> kafka
  kafka --> connect
  connect -->|bulk index| es
  app -->|query| es
⚠️ Pitfall

Dual-write to DB + ES — crash between writes causes drift. Use outbox pattern or CDC. Reindex jobs required when mapping changes — plan blue/green index alias swap.

📦 Real World

Shopify indexes catalog via pipeline from MySQL to Elasticsearch. Uber uses Elasticsearch for trip/search geo queries with dedicated indexing fleet fed by Kafka.

9. Graph Databases

When the query IS the relationship — "friends of friends within 3 hops," "shortest path between two routers," "detect fraud ring" — relational JOINs on edge tables degrade as depth grows. Graph stores index adjacency natively.

Property graph model

  • Nodes — entities (User, Account, Device) with properties.
  • Edges — typed relationships (FOLLOWS, PAID, LOGGED_IN_FROM) with properties and direction.
  • Traversal — follow pointers index-free per hop; cost ~O(edges traversed) not O(table size).

Neo4j vs Amazon Neptune

Aspect Neo4j Amazon Neptune
Query languages Cypher (declarative, SQL-like) Gremlin (imperative), openCypher, SPARQL (RDF)
Deployment Self-hosted, AuraDB managed Fully managed AWS; IAM/VPC integration
Scale Billions of relationships; sharding immature (Fabric for federated) Up to 128 TB storage; read replicas
Best for Dev velocity, complex Cypher, on-prem AWS-native, compliance, multi-model (property + RDF)
// Friends-of-friends recommendation (depth 2)
MATCH (me:User {id: $userId})-[:FOLLOWS]->(friend)-[:FOLLOWS]->(fof)
WHERE NOT (me)-[:FOLLOWS]->(fof) AND me <> fof
RETURN fof.id, count(*) AS mutualFriends
ORDER BY mutualFriends DESC
LIMIT 20;

// Fraud: shared device within 2 hops of flagged account
MATCH (flagged:Account {risk: 'high'})-[:USED_DEVICE|TRANSFERRED*1..2]-(suspect:Account)
WHERE suspect.risk IS NULL
RETURN suspect.id, count(DISTINCT flagged) AS proximityScore;

Use cases

  • Social graphs — follow suggestions, feed ranking inputs (often hybrid with cache).
  • Fraud / AML — connected components, cycle detection in payment graphs.
  • Identity / IAM — permission inheritance, group nesting (Google Zanzibar-inspired).
  • Network / IT ops — dependency mapping, blast radius analysis.
  • Knowledge graphs — RAG entity linking, semantic search enrichment.
⚖️ Trade-off

Graph vs relational — depth-1 relationships (user → orders) stay in SQL. Graph wins at depth ≥3 or variable-length paths. Most "graph problems" at FAANG scale use custom adjacency in KV + batch precompute.

🎯 Interview Tip

For "design Facebook friend suggestions," mention graph traversal but also offline batch jobs (Spark) computing candidates, Redis serving precomputed lists — pure online graph traversal doesn't scale to 3B users.

10. Database Replication

Replication provides durability, read scale, and geographic proximity — at the cost of consistency complexity. Three topologies dominate system design interviews: single-leader, multi-leader, and leaderless (Dynamo-style).

Single-leader (primary/replica)

All writes go to the leader; replicas apply the replication log (PostgreSQL WAL, MySQL binlog). Reads can hit replicas with lag. Failover promotes a replica — risk of split-brain without fencing (STONITH, etcd lease).

flowchart TB
  w1[Writer clients]
  r1[Reader clients]
  leader[(Leader DB)]
  rep1[(Replica 1)]
  rep2[(Replica 2)]
  w1 -->|writes| leader
  r1 -->|reads| rep1
  r1 -->|strong read| leader
  leader -->|async/sync replication| rep1
  leader --> rep2
  • Synchronous replica — write ACK after replica confirms; survives leader loss without data loss; higher latency.
  • Asynchronous replica — leader ACK before replica; faster; failover may lose last seconds.
  • Semi-sync — at least one replica ACK (MySQL, PostgreSQL synchronous_standby_names).

Multi-leader (active/active)

Multiple nodes accept writes; changes replicate bidirectionally. Enables multi-region write locality (user writes to nearest DC). Write conflicts inevitable — resolve via LWW (last-write-wins), version vectors, or application merge.

flowchart LR
  dc1["DC US-East\nLeader A"]
  dc2["DC EU-West\nLeader B"]
  dc3["DC APAC\nLeader C"]
  dc1 <-->|async repl| dc2
  dc2 <-->|async repl| dc3
  dc1 <-->|async repl| dc3
  u1[US users] --> dc1
  u2[EU users] --> dc2
⚠️ Pitfall

Multi-leader on same entity — two users editing same document in different regions → conflict. Restrict multi-leader to disjoint data (user_id region affinity) or use CRDTs for merge-safe fields.

Leaderless (Dynamo-style quorum)

No leader — client writes to W replicas, reads from R replicas. Quorum condition R + W > N (N = replication factor) guarantees overlap: read sees at least one node with latest write (with version reconciliation).

Parameter Typical (N=3) Effect
N (replication factor) 3 Copies of each key across cluster
W (write quorum) 2 Higher W → stronger durability, slower writes
R (read quorum) 2 Higher R → fresher reads, slower reads
R + W > N 2 + 2 > 3 ✓ Strong consistency possible with version merge
R + W ≤ N 1 + 1 ≤ 3 Eventual consistency only (faster, stale reads)
sequenceDiagram
  participant C as Client
  participant N1 as Node 1
  participant N2 as Node 2
  participant N3 as Node 3
  Note over N1,N3: N=3, W=2, R=2
  C->>N1: write v2
  C->>N2: write v2
  N1-->>C: ACK
  N2-->>C: ACK
  Note over C: Write quorum satisfied
  C->>N2: read
  C->>N3: read
  N2-->>C: v2
  N3-->>C: v1 stale
  Note over C: Return v2 (latest timestamp wins)

Topology comparison

Topology Consistency Write latency Conflict handling Examples
Single-leader Strong on leader; eventual on replicas Low (local leader) None (serial writes) PostgreSQL, MySQL, MongoDB default
Multi-leader Eventual; conflict resolution required Low per region LWW, merge, CRDT CouchDB, old MySQL circular repl
Leaderless Tunable via R/W/N Depends on W and RTT Version vectors, read repair Cassandra, DynamoDB, Riak
🔬 Under the Hood

Read repair — during quorum read, stale replicas get updated in background. Anti-entropy — Merkle tree comparison between replicas fixes drift without waiting for reads. DynamoDB uses leaderless per partition with leader for coordination internally — interview simplification OK.

🏆 Senior Signal

State explicit numbers: "RF=3, LOCAL_QUORUM writes in Cassandra → tolerate 1 node loss per DC without write failure. For checkout, stick to single-leader PostgreSQL with sync replica — leaderless is wrong tool for financial invariants."

📐 Estimation

Cross-region sync replication adds RTT (~150ms US-EU) to every write. Multi-region active/active trades that latency for local writes — quantify: 10K writes/sec × 150ms = need massive write pipelining or accept async replication lag.