Storage Patterns & Data Modeling

Schema design is not about normalizing tables—it is about matching storage layout to access patterns. Read-heavy feeds need different shapes than write-heavy ledgers. This chapter covers fan-out strategies, Lambda and Kappa architectures, CQRS, event sourcing, and partitioning—patterns that separate L5 candidates who draw boxes from L6 candidates who reason about data movement at scale.

L5 L6 learning CQRS · Event Sourcing · Cassandra · DynamoDB

Access-pattern-driven data modeling

Start with the queries, not the entities. In distributed systems—especially Cassandra, DynamoDB, and wide-column stores— you design tables around how data is read and written, not around object-oriented nouns.

Relational modeling normalizes to reduce redundancy. At scale, joins across shards are expensive or impossible. The alternative: denormalize deliberately, duplicate data across tables optimized for specific queries, and accept eventual consistency between copies.

Read:write ratio drives architecture

Ratio Profile Examples Storage strategy
100:1 read-heavy Reads dominate; writes infrequent Product catalog, user profiles, config Cache-aside Redis, read replicas, CDN, materialized views
10:1 read-heavy Mostly reads with moderate writes Social feeds, news sites, search indexes Denormalized read models, fan-out, Elasticsearch
1:1 balanced Reads ≈ writes Messaging threads, collaborative docs Partition by conversation/doc ID; careful indexing both directions
1:10 write-heavy Writes dominate reads IoT telemetry, clickstream, audit logs Append-only logs (Kafka), time-series DB, batch aggregation
1:100 write-heavy Massive ingest, rare queries Metrics, logs, sensor data Columnar stores (ClickHouse, BigQuery), S3 + Athena, downsampling

Query-first table design (NoSQL)

For DynamoDB/Cassandra, list every access pattern before creating a table:

  1. Get user profile by user_id
  2. List orders for user_id, sorted by date DESC
  3. Get order by order_id
  4. Lookup orders by status for admin dashboard

Pattern 1 and 2 might share a table with PK=user_id, SK=order#date. Pattern 3 needs a GSI or separate table with PK=order_id. Pattern 4 might use a GSI on status or export to Elasticsearch for flexible admin search.

# DynamoDB — one table, two access patterns via composite keys
# Pattern: get user profile
#   PK = USER#123, SK = PROFILE
# Pattern: list user orders newest first
#   PK = USER#123, SK = ORDER#2024-06-07T10:00:00#order-456
# Pattern: get order by ID (GSI)
#   GSI1PK = ORDER#456 → full item

item = {
    "PK": f"USER#{user_id}",
    "SK": f"ORDER#{timestamp}#{order_id}",
    "amount": 99.99,
    "status": "shipped",
    "GSI1PK": f"ORDER#{order_id}"  # inverted index for lookup by order_id
}
🔧 Under the Hood

DynamoDB GSIs are eventually consistent separate partitions—you pay duplicate storage and separate WCU/RCU. Each GSI is effectively a materialized view maintained asynchronously. Design GSIs only for high-QPS patterns; low-frequency admin queries belong in Redshift/BigQuery via CDC export.

⚖️ Trade-off

Denormalization trades write complexity for read speed. Updating a username stored in 5 denormalized tables requires multi-table writes or async propagation. At 100:1 read ratio, this trade almost always wins; at 1:1, normalized Postgres with careful indexing often suffices.

📐 Estimation

35K read QPS, 500 write QPS → 70:1 read-heavy. Cache 80% of reads in Redis → 7K DB QPS. One Postgres read replica handles ~10K indexed reads/sec; you're fine with primary + 1 replica. At 350K read QPS, you need CDN + Redis + read replicas or sharding—model the ratio first.

Fan-out on write vs fan-out on read

Building a home timeline is the canonical fan-out problem: when user A posts, how do followers B, C, D… see it? Push at write time, pull at read time, or hybrid—the answer depends on follower distribution.

Fan-out on write (push model)

When a user tweets, write the tweet ID into every follower's timeline cache/table. Read is O(1)—fetch precomputed timeline. Write is O(followers)—expensive for celebrities.

flowchart LR
  Post[User posts tweet] --> TW[Tweet store]
  Post --> F1[Follower 1 timeline]
  Post --> F2[Follower 2 timeline]
  Post --> FN[Follower N timeline]
  Read[User reads feed] --> F1

Write cost: Katy Perry with 100M followers → 100M writes per tweet. At 1 ms/write, that's 27+ hours serially—impossible. Even at 100K writes/sec parallel, 1000 seconds per tweet.

Fan-out on read (pull model)

Store tweets only in the author's outbox. On timeline read, fetch tweets from all followed users, merge-sort by time. Write is O(1); read is O(following) — expensive for users who follow 5,000 accounts.

flowchart LR
  Post[User posts tweet] --> TW[Author outbox]
  Read[User reads feed] --> Q[Query 500 followed users]
  Q --> TW
  Q --> Merge[Merge-sort by timestamp]
  Merge --> Timeline[Rendered timeline]

Hybrid model (Twitter's approach)

Regular users (< celebrity threshold): fan-out on write into Redis timeline sorted sets. Celebrities (> threshold, e.g. 1M followers): skip push; followers merge celebrity tweets at read time.

Model Write cost Read cost Best when
Fan-out on write O(followers) per post O(1) — precomputed Low follower counts; read-heavy; low latency reads critical
Fan-out on read O(1) per post O(following) per read High follower counts (celebrities); write-heavy posters
Hybrid O(min(followers, threshold)) + O(1) for celebrities O(1) + O(celebrities followed) merge Real social graphs with power-law follower distribution

Celebrity threshold calculation

Define threshold T where push cost exceeds pull cost. If average user follows F accounts and average post rate is P tweets/sec system-wide:

  • Push cost for user with F_followers: F_followers writes/post
  • Pull cost per read: F_following tweet lookups
  • Break-even roughly when F_followers × post_rate ≈ F_following × read_rate
📐 Estimation

Twitter circa 2012: ~300M users, ~500M tweets/day ≈ 6K tweets/sec. Median follower count ~200; push = 200 writes. Celebrity at 10M followers = 10M writes. Threshold ~100K–1M followers: below push, above pull-at-read. ~0.01% of users are celebrities—hybrid covers 99.99% with fast push reads.

📦 Real World

Twitter documented hybrid fan-out in engineering blogs: Redis timelines for normal users, merge-on-read for celebrities. Instagram uses a similar push model for feed ranking with ML-scored candidate generation—not pure chronological fan-out. LinkedIn feed uses multi-stage fetch (followed + recommended) with heavy precomputation for active users.

🎯 Interview Tip

"Design Twitter feed" answer arc: (1) fan-out on write for fast reads, (2) celebrity problem breaks push, (3) hybrid with 1M follower threshold, (4) Redis sorted sets for timelines, (5) cache tweet content separately by ID. Quantify: 6K tweets/sec × 200 avg followers = 1.2M timeline writes/sec—need sharded Redis cluster.

⚠️ Pitfall

Hot key on celebrity timeline reads. Even with pull model, millions reading a celebrity's profile simultaneously hammer one shard. Replicate hot user data across cache nodes, use read-through CDN for public profiles, or pre-warm caches before expected spikes (product launches, breaking news).

Lambda vs Kappa architecture

Big data systems need to answer queries over both real-time and historical data. Lambda duplicates logic in batch and stream paths; Kappa bets everything on a single stream replay model.

Lambda architecture

Three layers: batch layer (Hadoop/Spark on full dataset, high latency, accurate), speed layer (Storm/Flink on recent data, low latency, approximate), serving layer (merges batch + speed views for queries).

flowchart TB
  Events[Incoming events] --> Batch[Batch layer\nSpark daily job]
  Events --> Speed[Speed layer\nFlink streaming]
  Batch --> Serving[Serving layer\nmerged views]
  Speed --> Serving
  Query[Queries] --> Serving

Pros: batch recompute corrects speed layer errors; mature tooling for both paths.

Cons: duplicate business logic (batch + stream codebases diverge); operational complexity; merge logic in serving layer is fragile.

Kappa architecture

Single stream processing layer (Kafka + Flink/Kafka Streams). Historical reprocessing = replay the log from offset 0 with updated code. No separate batch pipeline.

flowchart TB
  Events[Incoming events] --> Log[Immutable event log\nKafka]
  Log --> Stream[Stream processor\nFlink job v2]
  Stream --> Views[Materialized views\nDB / cache]
  Query[Queries] --> Views
  Log -.->|replay on code change| Stream

Pros: one codebase; replay for bug fixes and new features; simpler mental model.

Cons: replay time proportional to log retention; complex aggregations over years of data may still need batch export to warehouse; stream processor must handle full history volume on redeploy.

Dimension Lambda Kappa
Code paths Two (batch + stream) — logic drift risk One stream job — single source of truth
Correction mechanism Batch layer overwrites speed approximations nightly Replay log from beginning with fixed code
Latency Speed layer: seconds; batch: hours Stream: seconds; full recompute: hours–days depending on log size
Best for Petabyte-scale analytics with mature Hadoop estate Event-driven microservices, moderate history, Kafka-native orgs
Examples Early Twitter analytics, LinkedIn data infra Confluent recommendation, many greenfield streaming platforms
🏆 Senior Signal

"We run Kappa for real-time dashboards—Flink job materializes to Redis/Postgres. Nightly Spark job on S3 parquet export for ad-hoc analytics is batch, not Lambda—it's a warehouse path, not a duplicate serving merge. Pragmatic hybrid beats pure ideology."

⚖️ Trade-off

Kappa replay of 30 days × 1M events/sec = 2.6 trillion events—reprocessing takes hours and doubles cluster cost during replay. Keep bounded state in stream processors; export aggregates to warehouse for unbounded historical queries instead of replaying everything.

💡 Pro Tip

In 2026 interviews, Lambda is historical context—most teams say "Kappa with a warehouse." Real-time path: Kafka → Flink → serving store. Historical/ad-hoc: CDC → BigQuery/Snowflake. Don't propose maintaining identical Spark and Flink jobs unless asked about Lambda specifically.

CQRS in practice

Command Query Responsibility Segregation separates write models (commands, business rules, normalization) from read models (denormalized projections optimized for queries). Not every app needs it—know when the complexity pays off.

Traditional CRUD uses one schema for reads and writes. CQRS splits them: command side validates and persists events/state; query side serves from purpose-built read stores updated asynchronously.

flowchart LR
  Client -->|command| API[Command API]
  API --> WDB[(Write DB\nnormalized)]
  WDB -->|CDC / events| Projector[Projection workers]
  Projector --> RDB1[(Read DB — list view)]
  Projector --> RDB2[(Elasticsearch — search)]
  Client -->|query| QAPI[Query API]
  QAPI --> RDB1
  QAPI --> RDB2

When CQRS makes sense

  • Read and write patterns diverge radically (complex writes, many read shapes)
  • Read QPS >> write QPS and read models need heavy denormalization
  • Different scaling requirements (scale read replicas/Elasticsearch independently)
  • Combined with event sourcing for audit and temporal queries

When CQRS is overkill

  • Simple CRUD app with similar read/write shapes
  • Team < 5 engineers without streaming/CDC experience
  • Strong consistency required on reads immediately after writes (CQRS is usually eventual)
// Command side — validate and persist
public OrderId placeOrder(PlaceOrderCommand cmd) {
    Order order = Order.create(cmd.items(), cmd.userId());
    orderRepo.save(order);                    // normalized write DB
    eventBus.publish(new OrderPlaced(order)); // triggers projections
    return order.id();
}

// Query side — denormalized read model (updated by projector)
// Table: orders_by_user — PK user_id, SK created_at, denormalized items + totals
public List<OrderSummary> getOrdersForUser(UserId userId) {
    return readRepo.queryByUser(userId);     // no joins — precomputed view
}
Aspect Traditional CRUD CQRS
Schema One model for read + write Separate write + N read models
Consistency Strong (same transaction) Eventual (projection lag typically 100ms–5s)
Complexity Low High — projectors, versioning, sync monitoring
Scaling Scale whole DB Scale read stores independently
📦 Real World

Microsoft documented CQRS in Azure patterns for high-scale LOB apps. Shopify uses read replicas and search indexes fed by CDC—not pure CQRS, but same separation spirit. Banking often uses CQRS + event sourcing for account ledgers where audit trail is regulatory requirement.

⚠️ Pitfall

Projection lag UX bugs. User creates order, redirects to list page, order missing for 2 seconds. Mitigations: read-your-writes routing (query command-side DB for recent IDs), optimistic UI, or synchronous projection for critical paths only.

🎯 Interview Tip

Propose CQRS when reads dominate and need multiple views: "Write to Postgres, CDC to Elasticsearch for search and Redis for dashboard aggregates. Accept 1–2 sec eventual consistency; monitor projection lag." Don't propose CQRS for a todo app.

Event sourcing in practice

Instead of storing current state, store the sequence of state-changing events. Rebuild any past state by replaying events—a time machine for your domain model.

Traditional: UPDATE accounts SET balance = 150 WHERE id = 1 — previous balance lost. Event sourcing: append MoneyDeposited($50), MoneyWithdrawn($20) — full history preserved; current balance = fold(events).

flowchart LR
  Cmd[Command] --> Agg[Aggregate]
  Agg -->|append| ES[(Event Store)]
  ES --> E1[Event 1: AccountOpened]
  ES --> E2[Event 2: MoneyDeposited]
  ES --> E3[Event 3: MoneyWithdrawn]
  ES --> Proj[Projector]
  Proj --> View[(Current state view)]
  ES -->|replay| Agg

Core components

  • Event store — append-only log (Kafka topic, EventStoreDB, custom Postgres table)
  • Aggregate — domain object rebuilt by folding events; enforces invariants on commands
  • Snapshots — periodic state checkpoints to avoid replaying 10K events on every load
  • Projections — read models built from event stream (often paired with CQRS)
# Rebuild account state from events
def apply_events(events):
    account = Account.empty()
    for event in events:
        match event.type:
            case "AccountOpened":
                account = Account.open(event.user_id)
            case "MoneyDeposited":
                account = account.deposit(event.amount)
            case "MoneyWithdrawn":
                account = account.withdraw(event.amount)
    return account

# Snapshot every 100 events — load snapshot + events since
def load_account(account_id):
    snapshot, version = snapshot_store.get(account_id)
    new_events = event_store.get_since(account_id, version)
    return apply_events([snapshot] + new_events)

Benefits

  • Audit trail — complete history for compliance (finance, healthcare)
  • Temporal queries — "what was balance on March 1?"
  • Debugging — replay production events in staging
  • Multiple projections — build new read models from existing event log

Costs

  • Event schema evolution (upcasting, versioning) is non-trivial
  • Queries on current state require projections—can't SQL JOIN the event log efficiently
  • Storage grows forever unless compaction/archival policy
  • Team must think in events, not CRUD—steep learning curve
🔧 Under the Hood

EventStoreDB stores events in streams keyed by aggregate ID with optimistic concurrency (expected version check). Kafka as event store uses compacted topics for snapshot-like behavior but lacks per-aggregate version CAS— use Kafka + Postgres event table for production event sourcing with Kafka as transport, not sole store.

🏆 Senior Signal

"Event sourcing for the ledger aggregate only—not the whole monolith. User profile stays CRUD. Account transactions are event-sourced for audit. Snapshots every 50 events; project to read DB for balance queries. GDPR delete = tombstone event, not physical erase from log (legal hold considerations)."

⚖️ Trade-off

Event sourcing + CQRS is the most powerful and most operationally expensive data pattern. Use for bounded contexts with clear audit/temporal requirements—not as default architecture. Most teams need "audit log table" + CDC, not full event sourcing.

Data partitioning strategies

When one node can't hold the data or serve the QPS, you partition. Strategy choice determines query flexibility, rebalance pain, and hot spot risk.

Horizontal partitioning (sharding)

Split rows across nodes by a shard key. Most common scaling approach for OLTP at billion-row scale.

  • Hash partitioningshard = hash(user_id) % N; even distribution; range queries cross all shards
  • Range partitioning — user_id 1–1M on shard 1; efficient range scans; hot spots on latest range
  • Consistent hashing — minimal key movement when adding nodes; used in DynamoDB, Cassandra, Redis Cluster
flowchart TB
  Router[Shard router] --> S1[Shard 1\nuser_id hash 0-33%]
  Router --> S2[Shard 2\nuser_id hash 34-66%]
  Router --> S3[Shard 3\nuser_id hash 67-100%]

Vertical partitioning

Split columns or tables by access pattern—hot columns separate from cold BLOBs. Example: users table (id, name, email) + user_profiles (id, bio, avatar_blob). Reads for user list skip heavy BLOB columns.

Functional partitioning

Split by domain/service—users DB, orders DB, inventory DB. Microservices default. Cross-domain queries require API composition or read-side CQRS joins. Independent scaling: orders shard differently than user profiles.

Temporal partitioning

Partition by time—monthly tables, date-keyed S3 prefixes, TimescaleDB hypertable chunks. Ideal for append-only time-series, logs, metrics. Drop old partitions for retention (cheap delete).

-- Temporal partitioning — Postgres declarative partitions
CREATE TABLE events (
    id          BIGSERIAL,
    created_at  TIMESTAMPTZ NOT NULL,
    payload     JSONB
) PARTITION BY RANGE (created_at);

CREATE TABLE events_2024_06 PARTITION OF events
    FOR VALUES FROM ('2024-06-01') TO ('2024-07-01');

-- Drop retention: detach and drop old partition (instant vs DELETE millions of rows)
DROP TABLE events_2023_01;

Composite partitioning

Combine strategies—e.g. functional (orders service) + horizontal (hash by order_id) + temporal (archive to S3 after 90 days). DynamoDB: PK for entity type + SK for time enables composite access patterns on one table.

Strategy Split dimension Pros Cons
Horizontal Rows by shard key Linear scale-out for row count and QPS Cross-shard queries expensive; resharding painful
Vertical Columns / LOB separation Faster hot-path queries; smaller working set Joins across partitions; app must know layout
Functional Domain / service boundary Team autonomy; independent deploy and scale No cross-domain transactions; distributed query pain
Temporal Time range Efficient retention; time-range scans local Cross-period queries; uneven load if recent bias
Composite Multiple dimensions Matches real access patterns precisely Highest complexity; operational overhead

Shard key selection checklist

  1. High cardinality — avoid skew (don't shard by country if 80% users in US)
  2. Co-locate related data — orders and order_items share user_id or order_id shard key
  3. Match hottest queries — if 95% queries filter by tenant_id, shard by tenant
  4. Plan for growth — consistent hashing or virtual nodes for adding shards without full reshuffle
📐 Estimation

1B users, 2 KB profile each ≈ 2 TB data. Postgres practical limit ~500 GB–1 TB per node before pain. Need 4+ shards minimum. At 50K read QPS, 4 shards = 12.5K QPS each—within Postgres capacity with replicas. Plan 8 shards for headroom and hot spot absorption.

⚠️ Pitfall

Shard key change is a migration project. Picking user_id then needing multi-tenant isolation by org_id later requires full re-shard. Interview answer: "Shard by tenant_id for B2B SaaS—co-locates all org data, enables per-tenant export/delete for GDPR."

💡 Pro Tip

Start with single DB + read replicas until proven otherwise. Shard when storage > 500 GB or write QPS exceeds single-node ceiling (~5–10K writes/sec indexed). Premature sharding is a top-3 engineering time sink.

🎯 Interview Tip

Partitioning answer structure: (1) estimate data size and QPS, (2) pick strategy with justification, (3) name shard key and why, (4) acknowledge cross-shard query cost, (5) mention resharding plan (consistent hashing, dual-write migration). "1B records → hash shard by user_id, 16 shards, ~62M users each."