Architecture Patterns Library

A visual catalog of the patterns senior engineers reach for—not definitions, but when to use each, what it costs, and where you've seen it in production. Organized by category for quick lookup during mock interviews and design reviews. Pair with Interview Framework for structured delivery.

L5 L6 interview 72 patterns

Scalability Patterns

Handle growth in users, traffic, and data volume without redesigning from scratch.

Problem · Read/write asymmetry

CQRS

Solution: Separate command (write) and query (read) models; writes go to OLTP, reads served from denormalized projections.

Trade-off: Eventual consistency on read side; dual-model maintenance and sync pipeline complexity.

Real-world: Microsoft Azure uses CQRS in event-sourced order processing; read models rebuilt from event log.

Problem · Audit trail + temporal queries

Event Sourcing

Solution: Persist state changes as immutable events; current state derived by replaying event stream.

Trade-off: Storage grows unbounded; snapshotting required; replay latency for cold aggregates.

Real-world: LMAX exchange processes billions of trades via event sourcing with in-memory replay.

Problem · Dataset exceeds single-node capacity

Sharding / Partitioning

Solution: Horizontally split data by shard key (user_id, tenant_id); route queries to correct shard.

Trade-off: Cross-shard queries expensive; hot shards; resharding is painful without consistent hashing.

Real-world: Instagram shards user data by user_id; Twitter shards tweets by tweet_id across thousands of nodes.

Problem · Uneven key distribution on ring

Consistent Hashing

Solution: Map keys and nodes to hash ring; each key owned by next clockwise node; virtual nodes for balance.

Trade-off: Still possible hot spots; range queries unsupported; more complex than modulo hashing.

Real-world: Amazon DynamoDB and Cassandra use consistent hashing for partition placement.

Problem · Expensive read path on every request

Materialized Views / Denormalization

Solution: Precompute join results into read-optimized tables or documents updated on write or via CDC.

Trade-off: Write amplification; stale reads unless sync is tight; schema duplication.

Real-world: Facebook TAO materializes social graph edges for sub-ms friend-list reads.

Problem · Celebrity traffic spikes

Fan-out on Write

Solution: On tweet/post, push to all follower feeds at write time; reads are O(1) cache lookup.

Trade-off: Write amplification for users with millions of followers; Katy Perry problem.

Real-world: Twitter hybrid: fan-out on write for normal users, fan-out on read for celebrities.

Problem · Millions of followers, write amplification

Fan-out on Read

Solution: Store posts in author timeline; merge follower timelines at read time.

Trade-off: Read latency grows with follower count; complex pagination.

Real-world: Twitter uses fan-out on read for accounts with >1M followers.

Problem · Mixed follower distribution

Hybrid Fan-out

Solution: Fan-out on write below threshold; fan-out on read above; tiered by follower count.

Trade-off: Two code paths to maintain; threshold tuning per product.

Real-world: Twitter's famous hybrid model documented in their engineering blog.

Problem · Traffic exceeds single server

Horizontal Scaling

Solution: Add stateless app servers behind load balancer; partition stateful tiers separately.

Trade-off: Requires shared-nothing or externalized state; session stickiness pitfalls.

Real-world: Netflix scales API tiers horizontally; state lives in Cassandra and EVCache.

Problem · Predictable traffic bursts

Auto-scaling

Solution: Scale instance count on CPU, QPS, or queue depth metrics with cooldown periods.

Trade-off: Cold start latency; over-provisioning cost; scale-down too aggressive causes flapping.

Real-world: AWS Auto Scaling Groups behind ALB for Black Friday e-commerce peaks.

Problem · Global users, latency

Geo-sharding / Multi-region

Solution: Deploy data and compute in regional cells; route users to nearest region.

Trade-off: Cross-region consistency hard; data residency compliance; operational complexity.

Real-world: Google Spanner multi-region with TrueTime; Uber geo-partitions ride data.

Problem · Read-heavy, write-rare

Read Replicas

Solution: Primary handles writes; N replicas serve read traffic with async replication lag.

Trade-off: Replication lag causes stale reads; failover promotion complexity.

Real-world: PostgreSQL read replicas on AWS RDS for analytics queries off primary.

Reliability Patterns

Survive partial failures, prevent cascades, and recover gracefully.

Problem · Downstream dependency failures cascade

Circuit Breaker

Solution: Track failure rate; open circuit after threshold; fail fast with fallback; half-open probe.

Trade-off: Tuning thresholds is hard; false opens under transient blips; needs metrics.

Real-world: Netflix Hystrix (now resilience4j) protects microservice chains from cascade failures.

Problem · One bad tenant consumes all resources

Bulkhead

Solution: Isolate thread pools, connection pools, or queues per tenant/service so failure is contained.

Trade-off: Lower overall utilization; more infrastructure to provision per bulkhead.

Real-world: Hystrix thread pools per dependency; Kubernetes resource quotas per namespace.

Problem · Transient network / service errors

Retry with Exponential Backoff

Solution: Retry failed calls with increasing delay and jitter; cap max attempts.

Trade-off: Thundering herd if all clients retry simultaneously; idempotency required.

Real-world: AWS SDK built-in retry with jitter for S3 and DynamoDB API calls.

Problem · Duplicate requests cause double-charge

Idempotency Keys

Solution: Client sends unique idempotency key; server stores result keyed by key for replay window.

Trade-off: Storage for idempotency records; TTL management; key collision handling.

Real-world: Stripe API requires Idempotency-Key header on all POST requests.

Problem · Partial failure in multi-step workflow

Compensating Transactions

Solution: Each step has undo operation; on failure, run compensations in reverse order.

Trade-off: Not all operations are reversible; saga orchestration complexity.

Real-world: Airline booking: cancel seat reservation if payment fails.

Problem · Service unavailable, must degrade

Graceful Degradation

Solution: Serve cached/stale data or reduced feature set when dependencies fail.

Trade-off: User sees degraded UX; must define acceptable degradation levels.

Real-world: Netflix serves stale recommendations when personalization service is down.

Problem · Unknown failure state after timeout

Fail-safe Defaults

Solution: On timeout or error, return safe default rather than error or stale critical data.

Trade-off: May hide real problems; defaults must be product-acceptable.

Real-world: Rate limiter fails open vs closed depending on abuse risk tolerance.

Problem · Process crash loses in-flight work

Checkpointing

Solution: Periodically persist processing state so restart resumes from last checkpoint.

Trade-off: Checkpoint frequency vs recovery granularity trade-off; storage cost.

Real-world: Kafka consumer offsets as checkpoints; Flink savepoints for stream jobs.

Problem · Need 99.99% despite component failures

Redundancy + HA pairs

Solution: Deploy N+1 or active-active across AZs; health checks trigger failover.

Trade-off: Cost doubles or more; split-brain risk without proper fencing.

Real-world: RDS Multi-AZ automatic failover; Kubernetes pod anti-affinity across nodes.

Problem · Detect failures quickly

Health Checks + Heartbeats

Solution: Liveness/readiness probes; periodic heartbeats with timeout-based failure detection.

Trade-off: False positives from GC pauses; phi-accrual improves on fixed timeouts.

Real-world: Kubernetes liveness probes; Consul health checks for service mesh.

Problem · Chaos validation

Fault Injection

Solution: Deliberately inject latency, errors, and kills in staging/prod to validate resilience.

Trade-off: Risk in production without blast-radius controls; requires mature observability.

Real-world: Netflix Chaos Monkey randomly terminates instances in production.

Problem · Overload causes total collapse

Load Shedding

Solution: Drop or reject low-priority requests when saturation detected to protect core path.

Trade-off: Dropped requests = errors for some users; priority classification needed.

Real-world: Google SRE practice: return 503 early rather than queue infinitely.

Data Patterns

Consistency, caching, replication, and integrity across distributed data stores.

Problem · Dual-write DB + queue race condition

Transactional Outbox

Solution: Write business row + outbox row in same DB transaction; relay publishes outbox to queue.

Trade-off: Relay lag; at-least-once delivery requires consumer idempotency.

Real-world: Debezium CDC from outbox table; Uber uses outbox for reliable event publishing.

Problem · Distributed transaction across services

Saga (Orchestration)

Solution: Central orchestrator coordinates local transactions with compensating steps on failure.

Trade-off: Orchestrator SPOF unless HA; complex state machine; debugging distributed flows.

Real-world: Uber's Cadence/Temporal workflows for multi-step ride and payment sagas.

Problem · Distributed transaction, choreography

Saga (Choreography)

Solution: Services publish events; each reacts and publishes next step; no central coordinator.

Trade-off: Hard to trace; cyclic dependencies; compensations harder to reason about.

Real-world: Microservices order flow: Order → Payment → Inventory via domain events.

Problem · Cache and DB out of sync

Cache-Aside (Lazy Loading)

Solution: App reads cache first; on miss, read DB and populate cache; writes update DB then invalidate cache.

Trade-off: Cache stampede on miss; thundering herd after invalidation.

Real-world: Redis cache-aside for product catalog at most e-commerce sites.

Problem · Write-heavy, read from same path

Write-Through Cache

Solution: App writes to cache; cache synchronously writes to DB before acknowledging.

Trade-off: Write latency includes cache + DB; cache becomes write bottleneck.

Real-world: CDN edge write-through for user-uploaded assets with origin sync.

Problem · DB change must invalidate cache

CDC (Change Data Capture)

Solution: Stream DB WAL/binlog changes to consumers for cache invalidation or search index sync.

Trade-off: Ordering guarantees per partition; schema evolution; lag monitoring.

Real-world: LinkedIn Databus; Debezium streaming PostgreSQL WAL to Kafka.

Problem · Hot key overloads single partition

Key Splitting / Salting

Solution: Append random salt to hot keys to spread writes across partitions; merge on read.

Trade-off: Read path must aggregate salted shards; application complexity.

Real-world: YouTube view counts split across multiple counter shards.

Problem · Need approximate counts at scale

Probabilistic Data Structures

Solution: HyperLogLog for cardinality, Count-Min Sketch for frequency, Bloom filter for membership.

Trade-off: Approximate only; tuning parameters; not suitable for exact financial data.

Real-world: Redis HyperLogLog for unique visitor counts; Cassandra Bloom filters on SSTables.

Problem · Multi-master write conflicts

Last-Write-Wins (LWW)

Solution: Resolve conflicts by timestamp; latest write wins across replicas.

Trade-off: Clock skew causes data loss; not suitable for counters or collaborative edits.

Real-world: DynamoDB and Cassandra default conflict resolution for concurrent puts.

Problem · Collaborative editing without locks

CRDTs

Solution: Conflict-free replicated data types merge concurrent updates deterministically.

Trade-off: Limited to specific data types; memory overhead; garbage collection of tombstones.

Real-world: Figma uses CRDTs for real-time collaborative design document sync.

Problem · Strong ordering in distributed log

Single-Leader Replication

Solution: One leader accepts writes; followers replicate log; reads from leader or followers.

Trade-off: Leader bottleneck; failover election window; AZ failure risk.

Real-world: PostgreSQL streaming replication; Kafka partition leader.

Problem · High write availability multi-region

Leaderless / Quorum

Solution: Writes go to W nodes; reads from R nodes where W+R>N for overlap consistency.

Trade-off: Conflict resolution on concurrent writes; read repair complexity.

Real-world: Cassandra QUORUM reads/writes; DynamoDB with N=3, W=2, R=2.

Messaging Patterns

Async communication, ordering, delivery guarantees, and event-driven architecture.

Problem · Async decoupling of producers/consumers

Message Queue (Point-to-Point)

Solution: Producer sends to queue; one consumer processes each message; ack deletes.

Trade-off: No replay without DLQ; ordering only per partition; backpressure management.

Real-world: AWS SQS for order processing workers; RabbitMQ task queues.

Problem · Multiple subscribers, event broadcast

Pub/Sub

Solution: Publisher sends to topic; all subscribers receive copy; decoupled fan-out.

Trade-off: Subscriber slow → backlog; need retention policy; ordering per partition only.

Real-world: Google Pub/Sub; Kafka topics with multiple consumer groups.

Problem · Event log with replay

Event Streaming (Kafka)

Solution: Durable ordered log; consumers track offset; replay from any point; retention policy.

Trade-off: Operational complexity; partition count planning; exactly-once is hard.

Real-world: LinkedIn Kafka origins; Uber processes 4T+ messages/day on Kafka.

Problem · Guaranteed processing despite failures

At-Least-Once + Idempotent Consumer

Solution: Ack after processing; retry on failure; consumer deduplicates by message ID.

Trade-off: Duplicate processing window; idempotency store overhead.

Real-world: Payment processors dedupe by transaction_id on Kafka consumer replay.

Problem · No duplicate processing ever

Exactly-Once Semantics

Solution: Transactional producer + idempotent consumer + offset commit in same transaction.

Trade-off: Performance cost; limited broker support; end-to-end still needs idempotent sinks.

Real-world: Kafka transactions API; Flink exactly-once with checkpointed sinks.

Problem · Poison messages block queue

Dead Letter Queue (DLQ)

Solution: After N failed attempts, move message to DLQ for manual inspection and replay.

Trade-off: DLQ monitoring required; replay tooling; alert fatigue if DLQ fills.

Real-world: SQS redrive policy to DLQ after maxReceiveCount exceeded.

Problem · Priority handling under load

Priority Queues

Solution: Separate queues or priority levels; high-priority messages processed first.

Trade-off: Starvation of low priority; complexity in routing; monitoring per tier.

Real-world: Hospital triage systems; payment fraud checks on high-priority queue.

Problem · Backpressure when consumer slow

Consumer Lag Monitoring + Scale

Solution: Track consumer lag metric; auto-scale consumers or throttle producers when lag exceeds SLO.

Trade-off: Scaling consumers has limits (partition count); unbounded lag is data loss risk.

Real-world: Kafka consumer lag alerts in Datadog; K8s HPA on consumer deployment.

Problem · Ordered processing per entity

Partition Key Ordering

Solution: Route all events for same entity (user_id) to same partition for FIFO guarantee.

Trade-off: Hot partitions if key distribution skewed; limits parallelism per key.

Real-world: Kafka partition by order_id ensures ordered payment state transitions.

Problem · Schema evolution in events

Schema Registry

Solution: Central registry for Avro/Protobuf schemas; compatibility checks on publish.

Trade-off: Operational dependency; breaking schema changes need migration plan.

Real-world: Confluent Schema Registry with BACKWARD compatibility for Kafka producers.

Problem · Request-reply over async bus

Correlation ID Pattern

Solution: Request includes correlation_id; reply routed back via temporary reply queue or callback.

Trade-off: Timeout handling; orphaned replies; callback endpoint availability.

Real-world: RabbitMQ reply-to header; AWS SNS/SQS request-response with correlation IDs.

Problem · Scheduled / delayed delivery

Delayed Message Queue

Solution: Message visible after delay interval; used for retries, reminders, scheduled jobs.

Trade-off: Clock precision; max delay limits; not a cron replacement at scale.

Real-world: SQS delay queues; RabbitMQ delayed message plugin for retry backoff.

Deployment Patterns

Ship safely, roll back fast, and migrate legacy without downtime.

Problem · Zero-downtime releases

Rolling Deployment

Solution: Replace instances incrementally; old and new versions run simultaneously during rollout.

Trade-off: Mixed-version bugs; slow rollback; schema compatibility required.

Real-world: Kubernetes default RollingUpdate strategy for Deployments.

Problem · Instant rollback, two versions live

Blue-Green Deployment

Solution: Deploy new version (green) alongside old (blue); switch traffic via LB; keep blue for rollback.

Trade-off: Double infrastructure cost during switch; database migration complexity.

Real-world: AWS CodeDeploy blue-green with ALB target group swap.

Problem · Gradual risk reduction

Canary Release

Solution: Route small % of traffic to new version; monitor metrics; incrementally increase.

Trade-off: Metric noise at low traffic; requires automated promotion/rollback.

Real-world: Spinnaker canary analysis; Istio traffic splitting 5% → 25% → 100%.

Problem · Per-user or per-tenant testing

Feature Flags / Toggles

Solution: Runtime config routes users to new features without redeploy; kill switch included.

Trade-off: Flag debt accumulates; testing matrix explosion; needs flag management platform.

Real-world: LaunchDarkly at Netflix; internal flag systems at most FAANG companies.

Problem · Incremental legacy replacement

Strangler Fig

Solution: New functionality built on new system; proxy routes traffic gradually from monolith.

Trade-off: Long migration timeline; dual maintenance; routing layer complexity.

Real-world: Amazon's gradual extraction of services from original monolithic codebase.

Problem · Immutable infrastructure

Immutable Deployments

Solution: Never patch running servers; replace entire instance/image on every deploy.

Trade-off: Slower deploy if images large; state must be externalized.

Real-world: Docker container replacement; AMIs rebuilt not patched in Netflix OSS.

Problem · Database schema changes safely

Expand-Contract Migration

Solution: Expand: add new column nullable; migrate data; contract: remove old column in later release.

Trade-off: Requires multiple deploy cycles; application handles both schemas temporarily.

Real-world: Stripe's approach to zero-downtime PostgreSQL migrations.

Problem · Traffic shift for maintenance

Traffic Draining

Solution: Stop sending new requests to instance; wait for in-flight to complete; then terminate.

Trade-off: Drain timeout must exceed longest request; health check coordination.

Real-world: Kubernetes preStop hook + terminationGracePeriodSeconds for graceful shutdown.

Problem · Multi-environment promotion

Environment Promotion Pipeline

Solution: Dev → staging → prod with gates; config differs per env; same artifact promoted.

Trade-off: Staging never matches prod; environment drift; test data gaps.

Real-world: CI/CD pipelines in GitHub Actions with manual prod approval gate.

Problem · Container orchestration at scale

Kubernetes Deployment

Solution: Declarative desired state; scheduler places pods; self-healing; rolling updates.

Trade-off: Operational learning curve; networking complexity; stateful workload challenges.

Real-world: Spotify, Airbnb, Pinterest run production on Kubernetes.

Problem · Edge traffic management

CDN + Edge Deployment

Solution: Static and cacheable content at edge PoPs; origin shield for collapse protection.

Trade-off: Cache invalidation latency; dynamic content harder; cost at scale.

Real-world: Cloudflare and Akamai serve majority of static web assets globally.

Problem · Infra as versioned code

Infrastructure as Code

Solution: Terraform/Pulumi define infra declaratively; plan/apply with review; drift detection.

Trade-off: State file management; module versioning; blast radius of bad apply.

Real-world: All major cloud-native companies manage infra via Terraform or CDK.

Distributed Systems Patterns

Consensus, clocks, failure detection, and coordination at scale.

Problem · Network partition, need consensus

Raft / Paxos Consensus

Solution: Leader election + log replication; majority quorum for commit; linearizable writes.

Trade-off: Latency of quorum round; leader bottleneck; not for high-throughput writes.

Real-world: etcd uses Raft; CockroachDB and TiKV built on Raft for distributed SQL.

Problem · Leader election without split-brain

Leader Election (etcd/ZK)

Solution: Candidates compete via ephemeral nodes; fencing tokens prevent stale leader writes.

Trade-off: ZooKeeper operational burden; etcd quorum sizing; watch latency.

Real-world: Kubernetes controller-manager leader election via etcd leases.

Problem · Distributed mutual exclusion

Distributed Locks (fencing tokens)

Solution: Acquire lock via Redis SETNX or etcd; fencing token invalidates stale lock holders.

Trade-off: Redlock controversy; lock holder crash leaves lock until TTL; always use fencing.

Real-world: Google Chubby for GFS master election; Martin Kleppmann's fencing token critique.

Problem · Global timestamps without perfect clocks

Hybrid Logical Clocks (HLC)

Solution: Combine physical timestamp with logical counter for causally consistent ordering.

Trade-off: Not true wall-clock; drift handling; not linearizable globally.

Real-world: CockroachDB uses HLC for distributed transaction timestamps.

Problem · True global ordering

TrueTime (Google Spanner)

Solution: GPS + atomic clocks bound clock uncertainty; commit wait for uncertainty window.

Trade-off: Expensive hardware; ~5-10ms commit wait; not available on commodity cloud.

Real-world: Google Spanner external consistency via TrueTime API.

Problem · Membership and failure detection

Gossip Protocol

Solution: Nodes periodically exchange state; epidemic dissemination; phi-accrual failure detection.

Trade-off: Eventual membership convergence; bandwidth at scale; not for strong consistency.

Real-world: Amazon Dynamo membership; Cassandra gossip for ring state.

Problem · Consistent view of cluster state

Service Discovery (Consul/etcd)

Solution: Register services; health check; DNS or sidecar proxy resolves healthy instances.

Trade-off: Cache staleness; split-brain during partition; needs HA for registry itself.

Real-world: Kubernetes DNS + Endpoints; Consul service mesh at HashiCorp deployments.

Problem · Partial failures invisible to client

Fallacies of Distributed Computing

Solution: Design assuming network is unreliable, latency is non-zero, bandwidth is finite.

Trade-off: Over-engineering if ignored; under-engineering if forgotten.

Real-world: Every senior interview probes whether you account for partial failure.

Problem · Unreliable network, need delivery guarantee

At-Most-Once vs At-Least-Once vs Exactly-Once

Solution: Choose delivery semantics per use case; financial = exactly-once or idempotent at-least-once.

Trade-off: Exactly-once end-to-end is myth without idempotent sinks; performance trade-offs.

Real-world: Kafka delivery guarantees documented per producer acks and consumer config.

Problem · Clock skew breaks ordering

Lamport / Vector Clocks

Solution: Logical clocks establish happened-before without synchronized physical clocks.

Trade-off: Vector clock size grows with replicas; not human-readable timestamps.

Real-world: Dynamo paper uses vector clocks for conflict detection.

Problem · Split brain during partition

Quorum + Fencing

Solution: Require majority for writes; fence old leader with generation numbers or STONITH.

Trade-off: Availability loss during partition minority side; operator confusion.

Real-world: MongoDB replica set elections; Redis Sentinel with quorum.

Problem · State machine replication

Replicated State Machine

Solution: All nodes apply same ordered log of commands; deterministic state machine yields same state.

Trade-off: Log compaction needed; snapshot + replay for new nodes; command size limits.

Real-world: Raft replicated state machine pattern in etcd and Consul.

💡 Interview Tip

Never name-drop a pattern without stating the problem it solves and its trade-off. "I'd use CQRS because reads are 100× writes and we can denormalize the feed" beats "we'll use CQRS" every time.