Storage Patterns & Data Modeling
Schema design is not about normalizing tables—it is about matching storage layout to access patterns. Read-heavy feeds need different shapes than write-heavy ledgers. This chapter covers fan-out strategies, Lambda and Kappa architectures, CQRS, event sourcing, and partitioning—patterns that separate L5 candidates who draw boxes from L6 candidates who reason about data movement at scale.
Access-pattern-driven data modeling
Start with the queries, not the entities. In distributed systems—especially Cassandra, DynamoDB, and wide-column stores— you design tables around how data is read and written, not around object-oriented nouns.
Relational modeling normalizes to reduce redundancy. At scale, joins across shards are expensive or impossible. The alternative: denormalize deliberately, duplicate data across tables optimized for specific queries, and accept eventual consistency between copies.
Read:write ratio drives architecture
| Ratio | Profile | Examples | Storage strategy |
|---|---|---|---|
| 100:1 read-heavy | Reads dominate; writes infrequent | Product catalog, user profiles, config | Cache-aside Redis, read replicas, CDN, materialized views |
| 10:1 read-heavy | Mostly reads with moderate writes | Social feeds, news sites, search indexes | Denormalized read models, fan-out, Elasticsearch |
| 1:1 balanced | Reads ≈ writes | Messaging threads, collaborative docs | Partition by conversation/doc ID; careful indexing both directions |
| 1:10 write-heavy | Writes dominate reads | IoT telemetry, clickstream, audit logs | Append-only logs (Kafka), time-series DB, batch aggregation |
| 1:100 write-heavy | Massive ingest, rare queries | Metrics, logs, sensor data | Columnar stores (ClickHouse, BigQuery), S3 + Athena, downsampling |
Query-first table design (NoSQL)
For DynamoDB/Cassandra, list every access pattern before creating a table:
- Get user profile by user_id
- List orders for user_id, sorted by date DESC
- Get order by order_id
- Lookup orders by status for admin dashboard
Pattern 1 and 2 might share a table with PK=user_id, SK=order#date.
Pattern 3 needs a GSI or separate table with PK=order_id.
Pattern 4 might use a GSI on status or export to Elasticsearch for flexible admin search.
# DynamoDB — one table, two access patterns via composite keys
# Pattern: get user profile
# PK = USER#123, SK = PROFILE
# Pattern: list user orders newest first
# PK = USER#123, SK = ORDER#2024-06-07T10:00:00#order-456
# Pattern: get order by ID (GSI)
# GSI1PK = ORDER#456 → full item
item = {
"PK": f"USER#{user_id}",
"SK": f"ORDER#{timestamp}#{order_id}",
"amount": 99.99,
"status": "shipped",
"GSI1PK": f"ORDER#{order_id}" # inverted index for lookup by order_id
}
DynamoDB GSIs are eventually consistent separate partitions—you pay duplicate storage and separate WCU/RCU. Each GSI is effectively a materialized view maintained asynchronously. Design GSIs only for high-QPS patterns; low-frequency admin queries belong in Redshift/BigQuery via CDC export.
Denormalization trades write complexity for read speed. Updating a username stored in 5 denormalized tables requires multi-table writes or async propagation. At 100:1 read ratio, this trade almost always wins; at 1:1, normalized Postgres with careful indexing often suffices.
35K read QPS, 500 write QPS → 70:1 read-heavy. Cache 80% of reads in Redis → 7K DB QPS. One Postgres read replica handles ~10K indexed reads/sec; you're fine with primary + 1 replica. At 350K read QPS, you need CDN + Redis + read replicas or sharding—model the ratio first.
Fan-out on write vs fan-out on read
Building a home timeline is the canonical fan-out problem: when user A posts, how do followers B, C, D… see it? Push at write time, pull at read time, or hybrid—the answer depends on follower distribution.
Fan-out on write (push model)
When a user tweets, write the tweet ID into every follower's timeline cache/table. Read is O(1)—fetch precomputed timeline. Write is O(followers)—expensive for celebrities.
flowchart LR Post[User posts tweet] --> TW[Tweet store] Post --> F1[Follower 1 timeline] Post --> F2[Follower 2 timeline] Post --> FN[Follower N timeline] Read[User reads feed] --> F1
Write cost: Katy Perry with 100M followers → 100M writes per tweet. At 1 ms/write, that's 27+ hours serially—impossible. Even at 100K writes/sec parallel, 1000 seconds per tweet.
Fan-out on read (pull model)
Store tweets only in the author's outbox. On timeline read, fetch tweets from all followed users, merge-sort by time. Write is O(1); read is O(following) — expensive for users who follow 5,000 accounts.
flowchart LR Post[User posts tweet] --> TW[Author outbox] Read[User reads feed] --> Q[Query 500 followed users] Q --> TW Q --> Merge[Merge-sort by timestamp] Merge --> Timeline[Rendered timeline]
Hybrid model (Twitter's approach)
Regular users (< celebrity threshold): fan-out on write into Redis timeline sorted sets. Celebrities (> threshold, e.g. 1M followers): skip push; followers merge celebrity tweets at read time.
| Model | Write cost | Read cost | Best when |
|---|---|---|---|
| Fan-out on write | O(followers) per post | O(1) — precomputed | Low follower counts; read-heavy; low latency reads critical |
| Fan-out on read | O(1) per post | O(following) per read | High follower counts (celebrities); write-heavy posters |
| Hybrid | O(min(followers, threshold)) + O(1) for celebrities | O(1) + O(celebrities followed) merge | Real social graphs with power-law follower distribution |
Celebrity threshold calculation
Define threshold T where push cost exceeds pull cost. If average user follows F accounts and average post rate is P tweets/sec system-wide:
- Push cost for user with F_followers: F_followers writes/post
- Pull cost per read: F_following tweet lookups
- Break-even roughly when F_followers × post_rate ≈ F_following × read_rate
Twitter circa 2012: ~300M users, ~500M tweets/day ≈ 6K tweets/sec. Median follower count ~200; push = 200 writes. Celebrity at 10M followers = 10M writes. Threshold ~100K–1M followers: below push, above pull-at-read. ~0.01% of users are celebrities—hybrid covers 99.99% with fast push reads.
Twitter documented hybrid fan-out in engineering blogs: Redis timelines for normal users, merge-on-read for celebrities. Instagram uses a similar push model for feed ranking with ML-scored candidate generation—not pure chronological fan-out. LinkedIn feed uses multi-stage fetch (followed + recommended) with heavy precomputation for active users.
"Design Twitter feed" answer arc: (1) fan-out on write for fast reads, (2) celebrity problem breaks push, (3) hybrid with 1M follower threshold, (4) Redis sorted sets for timelines, (5) cache tweet content separately by ID. Quantify: 6K tweets/sec × 200 avg followers = 1.2M timeline writes/sec—need sharded Redis cluster.
Hot key on celebrity timeline reads. Even with pull model, millions reading a celebrity's profile simultaneously hammer one shard. Replicate hot user data across cache nodes, use read-through CDN for public profiles, or pre-warm caches before expected spikes (product launches, breaking news).
Lambda vs Kappa architecture
Big data systems need to answer queries over both real-time and historical data. Lambda duplicates logic in batch and stream paths; Kappa bets everything on a single stream replay model.
Lambda architecture
Three layers: batch layer (Hadoop/Spark on full dataset, high latency, accurate), speed layer (Storm/Flink on recent data, low latency, approximate), serving layer (merges batch + speed views for queries).
flowchart TB Events[Incoming events] --> Batch[Batch layer\nSpark daily job] Events --> Speed[Speed layer\nFlink streaming] Batch --> Serving[Serving layer\nmerged views] Speed --> Serving Query[Queries] --> Serving
Pros: batch recompute corrects speed layer errors; mature tooling for both paths.
Cons: duplicate business logic (batch + stream codebases diverge); operational complexity; merge logic in serving layer is fragile.
Kappa architecture
Single stream processing layer (Kafka + Flink/Kafka Streams). Historical reprocessing = replay the log from offset 0 with updated code. No separate batch pipeline.
flowchart TB Events[Incoming events] --> Log[Immutable event log\nKafka] Log --> Stream[Stream processor\nFlink job v2] Stream --> Views[Materialized views\nDB / cache] Query[Queries] --> Views Log -.->|replay on code change| Stream
Pros: one codebase; replay for bug fixes and new features; simpler mental model.
Cons: replay time proportional to log retention; complex aggregations over years of data may still need batch export to warehouse; stream processor must handle full history volume on redeploy.
| Dimension | Lambda | Kappa |
|---|---|---|
| Code paths | Two (batch + stream) — logic drift risk | One stream job — single source of truth |
| Correction mechanism | Batch layer overwrites speed approximations nightly | Replay log from beginning with fixed code |
| Latency | Speed layer: seconds; batch: hours | Stream: seconds; full recompute: hours–days depending on log size |
| Best for | Petabyte-scale analytics with mature Hadoop estate | Event-driven microservices, moderate history, Kafka-native orgs |
| Examples | Early Twitter analytics, LinkedIn data infra | Confluent recommendation, many greenfield streaming platforms |
"We run Kappa for real-time dashboards—Flink job materializes to Redis/Postgres. Nightly Spark job on S3 parquet export for ad-hoc analytics is batch, not Lambda—it's a warehouse path, not a duplicate serving merge. Pragmatic hybrid beats pure ideology."
Kappa replay of 30 days × 1M events/sec = 2.6 trillion events—reprocessing takes hours and doubles cluster cost during replay. Keep bounded state in stream processors; export aggregates to warehouse for unbounded historical queries instead of replaying everything.
In 2026 interviews, Lambda is historical context—most teams say "Kappa with a warehouse." Real-time path: Kafka → Flink → serving store. Historical/ad-hoc: CDC → BigQuery/Snowflake. Don't propose maintaining identical Spark and Flink jobs unless asked about Lambda specifically.
CQRS in practice
Command Query Responsibility Segregation separates write models (commands, business rules, normalization) from read models (denormalized projections optimized for queries). Not every app needs it—know when the complexity pays off.
Traditional CRUD uses one schema for reads and writes. CQRS splits them: command side validates and persists events/state; query side serves from purpose-built read stores updated asynchronously.
flowchart LR Client -->|command| API[Command API] API --> WDB[(Write DB\nnormalized)] WDB -->|CDC / events| Projector[Projection workers] Projector --> RDB1[(Read DB — list view)] Projector --> RDB2[(Elasticsearch — search)] Client -->|query| QAPI[Query API] QAPI --> RDB1 QAPI --> RDB2
When CQRS makes sense
- Read and write patterns diverge radically (complex writes, many read shapes)
- Read QPS >> write QPS and read models need heavy denormalization
- Different scaling requirements (scale read replicas/Elasticsearch independently)
- Combined with event sourcing for audit and temporal queries
When CQRS is overkill
- Simple CRUD app with similar read/write shapes
- Team < 5 engineers without streaming/CDC experience
- Strong consistency required on reads immediately after writes (CQRS is usually eventual)
// Command side — validate and persist
public OrderId placeOrder(PlaceOrderCommand cmd) {
Order order = Order.create(cmd.items(), cmd.userId());
orderRepo.save(order); // normalized write DB
eventBus.publish(new OrderPlaced(order)); // triggers projections
return order.id();
}
// Query side — denormalized read model (updated by projector)
// Table: orders_by_user — PK user_id, SK created_at, denormalized items + totals
public List<OrderSummary> getOrdersForUser(UserId userId) {
return readRepo.queryByUser(userId); // no joins — precomputed view
}
| Aspect | Traditional CRUD | CQRS |
|---|---|---|
| Schema | One model for read + write | Separate write + N read models |
| Consistency | Strong (same transaction) | Eventual (projection lag typically 100ms–5s) |
| Complexity | Low | High — projectors, versioning, sync monitoring |
| Scaling | Scale whole DB | Scale read stores independently |
Microsoft documented CQRS in Azure patterns for high-scale LOB apps. Shopify uses read replicas and search indexes fed by CDC—not pure CQRS, but same separation spirit. Banking often uses CQRS + event sourcing for account ledgers where audit trail is regulatory requirement.
Projection lag UX bugs. User creates order, redirects to list page, order missing for 2 seconds. Mitigations: read-your-writes routing (query command-side DB for recent IDs), optimistic UI, or synchronous projection for critical paths only.
Propose CQRS when reads dominate and need multiple views: "Write to Postgres, CDC to Elasticsearch for search and Redis for dashboard aggregates. Accept 1–2 sec eventual consistency; monitor projection lag." Don't propose CQRS for a todo app.
Event sourcing in practice
Instead of storing current state, store the sequence of state-changing events. Rebuild any past state by replaying events—a time machine for your domain model.
Traditional: UPDATE accounts SET balance = 150 WHERE id = 1 — previous balance lost.
Event sourcing: append MoneyDeposited($50), MoneyWithdrawn($20) — full history preserved;
current balance = fold(events).
flowchart LR Cmd[Command] --> Agg[Aggregate] Agg -->|append| ES[(Event Store)] ES --> E1[Event 1: AccountOpened] ES --> E2[Event 2: MoneyDeposited] ES --> E3[Event 3: MoneyWithdrawn] ES --> Proj[Projector] Proj --> View[(Current state view)] ES -->|replay| Agg
Core components
- Event store — append-only log (Kafka topic, EventStoreDB, custom Postgres table)
- Aggregate — domain object rebuilt by folding events; enforces invariants on commands
- Snapshots — periodic state checkpoints to avoid replaying 10K events on every load
- Projections — read models built from event stream (often paired with CQRS)
# Rebuild account state from events
def apply_events(events):
account = Account.empty()
for event in events:
match event.type:
case "AccountOpened":
account = Account.open(event.user_id)
case "MoneyDeposited":
account = account.deposit(event.amount)
case "MoneyWithdrawn":
account = account.withdraw(event.amount)
return account
# Snapshot every 100 events — load snapshot + events since
def load_account(account_id):
snapshot, version = snapshot_store.get(account_id)
new_events = event_store.get_since(account_id, version)
return apply_events([snapshot] + new_events)
Benefits
- Audit trail — complete history for compliance (finance, healthcare)
- Temporal queries — "what was balance on March 1?"
- Debugging — replay production events in staging
- Multiple projections — build new read models from existing event log
Costs
- Event schema evolution (upcasting, versioning) is non-trivial
- Queries on current state require projections—can't SQL JOIN the event log efficiently
- Storage grows forever unless compaction/archival policy
- Team must think in events, not CRUD—steep learning curve
EventStoreDB stores events in streams keyed by aggregate ID with optimistic concurrency (expected version check). Kafka as event store uses compacted topics for snapshot-like behavior but lacks per-aggregate version CAS— use Kafka + Postgres event table for production event sourcing with Kafka as transport, not sole store.
"Event sourcing for the ledger aggregate only—not the whole monolith. User profile stays CRUD. Account transactions are event-sourced for audit. Snapshots every 50 events; project to read DB for balance queries. GDPR delete = tombstone event, not physical erase from log (legal hold considerations)."
Event sourcing + CQRS is the most powerful and most operationally expensive data pattern. Use for bounded contexts with clear audit/temporal requirements—not as default architecture. Most teams need "audit log table" + CDC, not full event sourcing.
Data partitioning strategies
When one node can't hold the data or serve the QPS, you partition. Strategy choice determines query flexibility, rebalance pain, and hot spot risk.
Horizontal partitioning (sharding)
Split rows across nodes by a shard key. Most common scaling approach for OLTP at billion-row scale.
- Hash partitioning —
shard = hash(user_id) % N; even distribution; range queries cross all shards - Range partitioning — user_id 1–1M on shard 1; efficient range scans; hot spots on latest range
- Consistent hashing — minimal key movement when adding nodes; used in DynamoDB, Cassandra, Redis Cluster
flowchart TB Router[Shard router] --> S1[Shard 1\nuser_id hash 0-33%] Router --> S2[Shard 2\nuser_id hash 34-66%] Router --> S3[Shard 3\nuser_id hash 67-100%]
Vertical partitioning
Split columns or tables by access pattern—hot columns separate from cold BLOBs.
Example: users table (id, name, email) + user_profiles (id, bio, avatar_blob).
Reads for user list skip heavy BLOB columns.
Functional partitioning
Split by domain/service—users DB, orders DB, inventory DB. Microservices default. Cross-domain queries require API composition or read-side CQRS joins. Independent scaling: orders shard differently than user profiles.
Temporal partitioning
Partition by time—monthly tables, date-keyed S3 prefixes, TimescaleDB hypertable chunks. Ideal for append-only time-series, logs, metrics. Drop old partitions for retention (cheap delete).
-- Temporal partitioning — Postgres declarative partitions
CREATE TABLE events (
id BIGSERIAL,
created_at TIMESTAMPTZ NOT NULL,
payload JSONB
) PARTITION BY RANGE (created_at);
CREATE TABLE events_2024_06 PARTITION OF events
FOR VALUES FROM ('2024-06-01') TO ('2024-07-01');
-- Drop retention: detach and drop old partition (instant vs DELETE millions of rows)
DROP TABLE events_2023_01;
Composite partitioning
Combine strategies—e.g. functional (orders service) + horizontal (hash by order_id) + temporal (archive to S3 after 90 days). DynamoDB: PK for entity type + SK for time enables composite access patterns on one table.
| Strategy | Split dimension | Pros | Cons |
|---|---|---|---|
| Horizontal | Rows by shard key | Linear scale-out for row count and QPS | Cross-shard queries expensive; resharding painful |
| Vertical | Columns / LOB separation | Faster hot-path queries; smaller working set | Joins across partitions; app must know layout |
| Functional | Domain / service boundary | Team autonomy; independent deploy and scale | No cross-domain transactions; distributed query pain |
| Temporal | Time range | Efficient retention; time-range scans local | Cross-period queries; uneven load if recent bias |
| Composite | Multiple dimensions | Matches real access patterns precisely | Highest complexity; operational overhead |
Shard key selection checklist
- High cardinality — avoid skew (don't shard by
countryif 80% users in US) - Co-locate related data — orders and order_items share
user_idororder_idshard key - Match hottest queries — if 95% queries filter by
tenant_id, shard by tenant - Plan for growth — consistent hashing or virtual nodes for adding shards without full reshuffle
1B users, 2 KB profile each ≈ 2 TB data. Postgres practical limit ~500 GB–1 TB per node before pain. Need 4+ shards minimum. At 50K read QPS, 4 shards = 12.5K QPS each—within Postgres capacity with replicas. Plan 8 shards for headroom and hot spot absorption.
Shard key change is a migration project. Picking user_id then needing
multi-tenant isolation by org_id later requires full re-shard.
Interview answer: "Shard by tenant_id for B2B SaaS—co-locates all org data, enables per-tenant export/delete for GDPR."
Start with single DB + read replicas until proven otherwise. Shard when storage > 500 GB or write QPS exceeds single-node ceiling (~5–10K writes/sec indexed). Premature sharding is a top-3 engineering time sink.
Partitioning answer structure: (1) estimate data size and QPS, (2) pick strategy with justification, (3) name shard key and why, (4) acknowledge cross-shard query cost, (5) mention resharding plan (consistent hashing, dual-write migration). "1B records → hash shard by user_id, 16 shards, ~62M users each."