Production Architecture Patterns
Moving from "it works in dev" to "it survives production" means making explicit choices about availability zones, regions, deployment models, and recovery time. This chapter covers the patterns AWS teams actually run — three-tier Multi-AZ stacks, active-passive and active-active multi-region designs, ECS vs EKS microservices, serverless pitfalls that only appear at scale, CI/CD with safe rollouts, and disaster recovery strategies mapped to RTO and RPO. These are the decisions that separate a demo from a system that earns customer trust.
Multi-AZ architecture
An Availability Zone (AZ) is an isolated data center within a region. A single AZ outage is not rare enough to ignore — power, networking, and hardware failures happen. Production workloads spread across at least two AZs; stateful services use AWS-managed Multi-AZ failover instead of DIY replication.
Three-tier across AZs
The classic production pattern: web tier (stateless), application tier (stateless), data tier (stateful). Web and app tiers run in every AZ; the load balancer distributes traffic; the database runs Multi-AZ with synchronous standby. Each tier lives in private subnets per AZ, with only the load balancer in public subnets.
flowchart TB
subgraph Internet
USERS["Users / API clients"]
end
subgraph Region["AWS Region (e.g. eu-west-1)"]
ALB["Application Load Balancer\n(public subnets, spans AZs)"]
subgraph AZ1["Availability Zone A"]
WEB1["Web tier\nECS tasks / EC2"]
APP1["App tier\nSpring Boot / Node"]
RDS_PRI["RDS Primary\n(Multi-AZ)"]
end
subgraph AZ2["Availability Zone B"]
WEB2["Web tier\nECS tasks / EC2"]
APP2["App tier\nSpring Boot / Node"]
RDS_STBY["RDS Standby\n(synchronous replica)"]
end
subgraph AZ3["Availability Zone C"]
WEB3["Web tier\nECS tasks / EC2"]
APP3["App tier\nSpring Boot / Node"]
end
CACHE["ElastiCache\n(cluster mode, multi-AZ)"]
S3["S3\n(regional, 11 nines)"]
end
USERS --> ALB
ALB --> WEB1 & WEB2 & WEB3
WEB1 --> APP1
WEB2 --> APP2
WEB3 --> APP3
APP1 & APP2 & APP3 --> RDS_PRI
RDS_PRI -.->|sync replication| RDS_STBY
APP1 & APP2 & APP3 --> CACHE
APP1 & APP2 & APP3 --> S3
ALB + ASG + RDS Multi-AZ — the production baseline
| Component | Multi-AZ behavior | Production rule |
|---|---|---|
| ALB | Automatically spans AZs; health checks route around unhealthy targets | Register targets in ≥2 AZs; set healthy_threshold and deregistration delay |
| ASG | Launch template distributes instances/tasks across AZ subnets | minHealthyPercentage during deploys; use capacity rebalancing for Spot |
| RDS Multi-AZ | Synchronous standby in another AZ; automatic failover (~60–120 s) | Enable for all production databases; test failover quarterly |
| ElastiCache | Replication group with automatic failover across AZs | Multi-AZ with at least one replica per shard; not a source of truth |
| EFS | Storage replicated across all AZs in the region automatically | Mount targets in every AZ where compute runs |
Never single-AZ for stateful
Stateless compute can tolerate AZ loss if enough capacity exists elsewhere — ASG replaces instances, ALB drains unhealthy targets. Stateful services cannot be rebuilt from another AZ without data. Running RDS, ElastiCache primary, or self-managed databases in a single AZ means a zone outage equals data unavailability until manual recovery — often hours. AWS-managed Multi-AZ handles failover, DNS updates, and storage replication for you.
- RDS / Aurora — Multi-AZ or Aurora with replicas in separate AZs
- DynamoDB — Multi-AZ by default; no single-AZ option to misconfigure
- OpenSearch — dedicated master nodes + data nodes across ≥3 AZs for production
- MSK (Kafka) — broker placement across AZs; min.insync.replicas ≥ 2
# RDS Multi-AZ — cannot be undone without brief outage when disabling
aws rds create-db-instance \
--db-instance-identifier orders-prod \
--engine postgres \
--db-instance-class db.r6g.large \
--allocated-storage 100 \
--multi-az \
--vpc-security-group-ids sg-abc123 \
--db-subnet-group-name prod-db-subnets
# ASG across three AZs
aws autoscaling create-auto-scaling-group \
--auto-scaling-group-name order-service-asg \
--min-size 3 --max-size 12 --desired-capacity 6 \
--vpc-zone-identifier "subnet-az1,subnet-az2,subnet-az3" \
--health-check-type ELB --health-check-grace-period 120
resource "aws_db_instance" "orders" {
identifier = "orders-prod"
engine = "postgres"
instance_class = "db.r6g.large"
multi_az = true
db_subnet_group_name = aws_db_subnet_group.prod.name
vpc_security_group_ids = [aws_security_group.db.id]
}
resource "aws_autoscaling_group" "app" {
name = "order-service-asg"
vpc_zone_identifier = aws_subnet.private[*].id
min_size = 3
max_size = 12
desired_capacity = 6
health_check_type = "ELB"
tag {
key = "Name"
value = "order-service"
propagate_at_launch = true
}
}
import * as ec2 from 'aws-cdk-lib/aws-ec2';
import * as rds from 'aws-cdk-lib/aws-rds';
const vpc = new ec2.Vpc(this, 'ProdVpc', {
maxAzs: 3,
natGateways: 3, // one per AZ for resilience
});
const db = new rds.DatabaseInstance(this, 'OrdersDb', {
engine: rds.DatabaseInstanceEngine.postgres({ version: rds.PostgresEngineVersion.VER_16 }),
vpc,
multiAz: true,
vpcSubnets: { subnetType: ec2.SubnetType.PRIVATE_ISOLATED },
});
Running Multi-AZ RDS but placing all application instances in one AZ. When that AZ fails, ALB has no healthy targets even though the database failover succeeded. ASG vpc-zone-identifier must list subnets in every AZ, and minimum capacity should be ≥ number of AZs.
Multi-AZ ≠ read scaling. RDS Multi-AZ standby is not readable (except Aurora, which has separate reader endpoints). For read scaling use read replicas; for HA use Multi-AZ. Exam questions often conflate the two.
RDS Multi-AZ failover updates the CNAME of the primary endpoint to point at the former standby. DNS TTL is low, but clients with aggressive connection pooling may hold stale connections for minutes. Use connection pools with validation queries (SELECT 1) and retry logic on connection refused.
Multi-region design
Multi-AZ protects against data center failure within a region. Multi-region protects against regional outages, satisfies data residency requirements, and reduces latency for global users. The hard part is data — not routing.
Active-passive vs active-active
| Pattern | Traffic flow | Data model | When to use |
|---|---|---|---|
| Active-passive | Primary region serves 100%; secondary on standby | Async replication; manual or automated failover | DR site, regulatory standby, cost-sensitive production |
| Active-active | Both regions serve traffic simultaneously | Multi-master or conflict-aware replication required | Global low-latency, zero-downtime failover, high RTO/RPO bar |
Active-passive is simpler: Route53 failover routing sends traffic to the secondary only when health checks fail on the primary. Active-active requires solving write conflicts — DynamoDB Global Tables (last-writer-wins), Aurora Global Database (one primary writer per cluster, promoted on failover), or application-level idempotency keys.
Route53 failover and latency routing
- Failover routing — primary/secondary records with health checks (HTTP, HTTPS, or CloudWatch alarm). Secondary is SECONDARY type; only used when primary fails.
- Latency-based routing — Route53 returns the record with lowest latency to the user's resolver. Good for active-active read paths.
- Geolocation routing — route EU users to eu-west-1, US users to us-east-1. Combine with data residency policies.
- Health checks — must test end-to-end (ALB path), not just EC2 ping. Use calculated health checks aggregating multiple child checks.
Global Tables and Aurora Global Database
- Multi-region, multi-active — writes in any region replicate to all (typically <1 s)
- Conflict resolution: last writer wins based on timestamp
- Ideal for session stores, user profiles, idempotent event logs
- Not ideal when strong ordering or transactional cross-item guarantees are required
- One primary cluster accepts writes; up to five secondary regions (<1 s replication lag)
- Managed failover: promote secondary to primary in <1 minute (RTO)
- Secondary clusters can serve read-only traffic — good for active-passive reads
- PostgreSQL and MySQL compatible; use for relational workloads needing global DR
Data residency
GDPR, HIPAA, and national banking regulations often require data to remain in specific jurisdictions. Architecture implications: deploy a full stack per region (not just replicate), use separate KMS keys per region, restrict cross-region S3 replication with bucket policies, and document data flows for auditors. Route53 geolocation alone does not enforce residency — a determined client can call any regional API endpoint. Enforce at the API Gateway resource policy and IAM condition keys (aws:RequestedRegion).
# Route53 failover record — primary ALB in eu-west-1
aws route53 change-resource-record-sets --hosted-zone-id Z1234567890 \
--change-batch '{
"Changes": [{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "api.example.com",
"Type": "A",
"SetIdentifier": "primary-eu",
"Failover": "PRIMARY",
"AliasTarget": {
"HostedZoneId": "Z32O12X0K2O1VO",
"DNSName": "dualstack.my-alb-eu-1234567890.eu-west-1.elb.amazonaws.com",
"EvaluateTargetHealth": true
},
"HealthCheckId": "abc-def-ghi"
}
}]
}'
# DynamoDB Global Table — add replica region
aws dynamodb update-table --table-name Sessions \
--replica-updates '[{"Create": {"RegionName": "us-east-1"}}]'
resource "aws_route53_record" "api_primary" {
zone_id = aws_route53_zone.main.zone_id
name = "api.example.com"
type = "A"
set_identifier = "primary-eu"
failover_routing_policy { type = "PRIMARY" }
health_check_id = aws_route53_health_check.api.id
alias {
name = aws_lb.eu.dns_name
zone_id = aws_lb.eu.zone_id
evaluate_target_health = true
}
}
resource "aws_dynamodb_table" "sessions" {
name = "Sessions"
billing_mode = "PAY_PER_REQUEST"
hash_key = "sessionId"
stream_enabled = true
stream_view_type = "NEW_AND_OLD_IMAGES"
replica { region_name = "us-east-1" }
replica { region_name = "eu-west-1" }
}
import * as dynamodb from 'aws-cdk-lib/aws-dynamodb';
import * as route53 from 'aws-cdk-lib/aws-route53';
const table = new dynamodb.Table(this, 'Sessions', {
partitionKey: { name: 'sessionId', type: dynamodb.AttributeType.STRING },
billingMode: dynamodb.BillingMode.PAY_PER_REQUEST,
replicationRegions: ['us-east-1', 'eu-west-1'],
stream: dynamodb.StreamViewType.NEW_AND_OLD_IMAGES,
});
// Failover routing configured via Route53 CfnRecordSet or separate stack per region
Active-active vs cost: running full capacity in two regions doubles compute and data transfer bills. Active-passive with scaled-down standby (pilot light) cuts cost but increases RTO. Match spend to the business SLA — a B2B admin portal and a consumer payments API have different recovery requirements.
Slack runs regional cells with failover boundaries — a regional outage affects one cell, not the entire product. Duolingo uses DynamoDB Global Tables for user progress so learners keep streaks regardless of which region serves their session. Both invested in data layer design before multi-region routing.
Microservices on AWS
Microservices on AWS means choosing an orchestrator (ECS or EKS), how services find each other, whether you need a service mesh, and how to isolate environments. The orchestrator choice affects operational burden more than application code quality.
ECS vs EKS
| Dimension | ECS (Fargate or EC2) | EKS (Kubernetes) |
|---|---|---|
| Operational overhead | Lower — AWS-managed control plane, simpler mental model | Higher — control plane + node upgrades, CRDs, cluster add-ons |
| Portability | AWS-native; limited multi-cloud | Kubernetes manifests portable across clouds and on-prem |
| Scaling model | Service auto-scaling on CPU/memory/request count | HPA, VPA, Karpenter, Cluster Autoscaler — more knobs |
| Networking | awsvpc mode — each task gets an ENI (Fargate limits apply) | VPC CNI, service networking, Ingress controllers (ALB, NGINX) |
| Best fit | Small-to-mid teams, Spring Boot services, fast time-to-production | Platform teams, polyglot microservices, existing K8s investment |
Default recommendation for most backend teams: ECS on Fargate until you have a concrete reason for Kubernetes (existing Helm charts, service mesh requirements, multi-cloud mandate). EKS shines when a platform team can amortize cluster operations across dozens of services.
Service discovery
- ECS Service Connect — built-in service mesh-lite; services call http://orders:8080 without hardcoded IPs
- AWS Cloud Map — DNS-based or API-based registry; integrates with ECS and EKS
- ALB path routing — external clients and BFF layer; internal east-west traffic should not hairpin through ALB
- EKS CoreDNS — standard K8s service discovery via ClusterIP; headless services for StatefulSets
AWS App Mesh
App Mesh is a managed service mesh using Envoy sidecars. It provides mutual TLS, traffic shifting, retries, and observability between services without embedding that logic in application code. Use when you have many inter-service calls and need consistent retry/timeout policies — but accept sidecar overhead (memory + latency). Alternatives: ECS Service Connect for simpler cases; Istio/Linkerd on EKS if you want CNCF ecosystem.
VPC per environment
Production pattern: separate VPC per environment (dev, staging, prod) or per account in AWS Organizations. Never share a VPC between prod and non-prod — a misconfigured security group in dev should not reach prod RDS. Cross-environment access goes through CI/CD roles and explicit peering/transit gateway — not flat networking.
# ECS Service Connect — enable on service
aws ecs create-service \
--cluster prod-cluster \
--service-name order-service \
--task-definition order-service:42 \
--desired-count 3 \
--launch-type FARGATE \
--network-configuration "awsvpcConfiguration={subnets=[subnet-a,subnet-b],securityGroups=[sg-app]}" \
--service-connect-configuration '{
"enabled": true,
"namespace": "prod.local",
"services": [{
"portName": "http",
"discoveryName": "orders",
"clientAliases": [{ "port": 8080, "dnsName": "orders" }]
}]
}'
resource "aws_service_discovery_private_dns_namespace" "prod" {
name = "prod.local"
vpc = aws_vpc.prod.id
}
resource "aws_ecs_service" "orders" {
name = "order-service"
cluster = aws_ecs_cluster.prod.id
task_definition = aws_ecs_task_definition.orders.arn
desired_count = 3
launch_type = "FARGATE"
service_connect_configuration {
enabled = true
namespace = aws_service_discovery_private_dns_namespace.prod.arn
service {
port_name = "http"
discovery_name = "orders"
client_alias {
port = 8080
dns_name = "orders"
}
}
}
}
import * as ecs from 'aws-cdk-lib/aws-ecs';
import * as ec2 from 'aws-cdk-lib/aws-ec2';
const cluster = new ecs.Cluster(this, 'ProdCluster', { vpc, defaultCloudMapNamespace: { name: 'prod.local' } });
const orderService = new ecs.FargateService(this, 'OrderService', {
cluster,
taskDefinition: orderTaskDef,
serviceConnectConfiguration: {
services: [{
portMappingName: 'http',
dnsName: 'orders',
port: 8080,
}],
},
});
Start with one ECS cluster per environment, not per service. Namespace isolation via Service Connect or separate Cloud Map namespaces. Split clusters only when blast radius or compliance requires it (e.g. PCI services in a dedicated cluster).
EKS control plane costs $0.10/hour per cluster (~$73/month) regardless of workload size. Fargate has no cluster fee but higher per-vCPU pricing than EC2-backed ECS. For 5–10 microservices at steady load, EC2-backed ECS with Savings Plans often wins on cost; Fargate wins on ops simplicity.
Serverless architecture
The canonical AWS serverless stack — API Gateway, Lambda, DynamoDB — scales to zero and charges per request. Production serverless fails on cold starts, concurrency limits, and DynamoDB hot partitions long before it fails on business logic.
API Gateway + Lambda + DynamoDB
Request flow: client → API Gateway (auth, throttling, request validation) → Lambda (business logic) → DynamoDB (state). Add SQS between API Gateway and Lambda when you need buffering; add ElastiCache when read patterns are hot and cacheable. Keep Lambda functions small and single-purpose — one function per route group, not one monolith handler parsing every path.
Step Functions for orchestration
When a workflow spans multiple Lambdas, external APIs, or human approval steps, use Step Functions instead of chaining Lambdas via SNS/SQS spaghetti. Standard workflows for long-running processes (up to one year); Express workflows for high-volume short flows (<5 minutes). Built-in retry, catch, and parallel states replace custom orchestration code.
- Order fulfillment — charge payment → reserve inventory → ship → notify (with compensating transactions on failure)
- Document processing — upload to S3 → Textract → summarize with Bedrock → store in DynamoDB
- Human-in-the-loop — waitForTaskToken for manual approval before provisioning
Production pitfalls
| Pitfall | Symptom | Mitigation |
|---|---|---|
| Cold starts | P99 latency spikes on first request after idle; JVM/Spring especially bad | Provisioned concurrency; SnapStart (Java); smaller runtimes (Node, Python); keep-warm (last resort) |
| Concurrency limits | 429 TooManyRequestsException during traffic spikes | Request account limit increase; reserved concurrency per critical function; SQS buffer for async |
| Hot partitions | Throttling on one DynamoDB partition despite low overall table usage | Design partition keys for even distribution; use write sharding suffixes; on-demand mode helps burst |
| Timeout cascades | API Gateway 29 s max vs Lambda 15 min — sync chains timeout | Async processing via SQS + Step Functions; return 202 Accepted with polling endpoint |
| Connection exhaustion | RDS from Lambda opens too many connections per concurrent execution | RDS Proxy; DynamoDB instead; limit reserved concurrency to match pool size |
# Provisioned concurrency — eliminates cold starts for critical path
aws lambda put-provisioned-concurrency-config \
--function-name order-api \
--qualifier live \
--provisioned-concurrent-executions 10
# DynamoDB on-demand with PITR for production
aws dynamodb create-table \
--table-name Orders \
--attribute-definitions AttributeName=orderId,AttributeType=S \
--key-schema AttributeName=orderId,KeyType=HASH \
--billing-mode PAY_PER_REQUEST
aws dynamodb update-continuous-backups \
--table-name Orders \
--point-in-time-recovery-specification PointInTimeRecoveryEnabled=true
resource "aws_lambda_provisioned_concurrency_config" "order_api" {
function_name = aws_lambda_function.order_api.function_name
qualifier = aws_lambda_alias.live.name
provisioned_concurrent_executions = 10
}
resource "aws_dynamodb_table" "orders" {
name = "Orders"
billing_mode = "PAY_PER_REQUEST"
hash_key = "orderId"
attribute { name = "orderId" type = "S" }
point_in_time_recovery { enabled = true }
}
import * as lambda from 'aws-cdk-lib/aws-lambda';
import * as dynamodb from 'aws-cdk-lib/aws-dynamodb';
import * as apigw from 'aws-cdk-lib/aws-apigateway';
const table = new dynamodb.Table(this, 'Orders', {
partitionKey: { name: 'orderId', type: dynamodb.AttributeType.STRING },
billingMode: dynamodb.BillingMode.PAY_PER_REQUEST,
pointInTimeRecovery: true,
});
const fn = new lambda.Function(this, 'OrderApi', {
runtime: lambda.Runtime.NODEJS_20_X,
handler: 'index.handler',
code: lambda.Code.fromAsset('lambda'),
currentVersionOptions: { provisionedConcurrentExecutions: 10 },
});
new apigw.LambdaRestApi(this, 'OrdersApi', { handler: fn });
Using userId as the sole DynamoDB partition key for a social feed. Celebrity users create hot partitions that throttle the entire table. Shard writes with userId#${random(0,N)} and scatter-gather reads, or use GSI with time-based sort keys for feed queries.
API Gateway resource policies and Lambda authorizers are not interchangeable. Use Cognito User Pools or IAM SigV4 for machine clients; JWT authorizers for SPA/mobile. Never expose DynamoDB directly to the internet — always go through Lambda or AppSync with fine-grained resolver auth.
CI/CD on AWS
Production CI/CD pushes artifacts to ECR, deploys to ECS/EKS with zero-downtime strategies, and gates releases with automated tests and manual approvals. The deploy mechanism matters as much as the pipeline — a green build that takes down production is worse than no pipeline at all.
GitHub Actions → ECR → ECS/EKS
The standard flow for containerized services:
- Developer merges PR to main
- GitHub Actions assumes AWS role via OIDC (no long-lived keys)
- Build Docker image, run tests, scan with Trivy or ECR image scanning
- Push to ECR with immutable tag (sha-abc1234)
- Update ECS task definition or EKS deployment manifest with new image digest
- Deploy via aws ecs update-service, CodeDeploy, or ArgoCD sync
# .github/workflows/deploy.yml (excerpt)
jobs:
deploy:
runs-on: ubuntu-latest
permissions:
id-token: write
contents: read
steps:
- uses: actions/checkout@v4
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/github-actions-deploy
aws-region: eu-west-1
- uses: aws-actions/amazon-ecr-login@v2
- run: |
docker build -t $ECR_REGISTRY/order-service:${{ github.sha }} .
docker push $ECR_REGISTRY/order-service:${{ github.sha }}
- run: |
aws ecs update-service --cluster prod --service order-service \
--force-new-deployment \
--task-definition $(aws ecs register-task-definition ...)
Blue/green on ECS
CodeDeploy for ECS shifts ALB listener traffic from "blue" (current) to "green" (new) task set. Deployment configuration controls traffic shift speed and bake time. Automatic rollback on CloudWatch alarm breach. Requires two target groups and a CodeDeploy application — more setup than rolling deploy, but true zero-downtime with instant rollback.
- Canary 10% → 50% → 100% — gradual shift over configurable intervals
- All-at-once — instant cutover after health checks pass (similar to K8s rolling update)
- Rollback — revert listener rules to blue target group in seconds
Canary Lambda aliases
Lambda aliases (live, beta) point to function versions. Weighted alias routing sends 5% of traffic to the new version while 95% stays on the current. CloudWatch monitors error rate; CodeDeploy automates weight shifts. Ideal for serverless APIs where ECS-style blue/green is overkill.
$ # Publish new version $ aws lambda publish-version --function-name order-api → Version: 47 $ # Route 10% to v47, 90% to v46 via alias $ aws lambda update-alias --function-name order-api --name live \ --routing-config '{"AdditionalVersionWeights": {"47": 0.1}}' $ # Promote to 100% after bake time $ aws lambda update-alias --function-name order-api --name live --function-version 47
GitOps on EKS
GitOps (Flux or ArgoCD) watches a manifest repo and reconciles cluster state — drift detection reverts manual kubectl edit changes. Pair with Sealed Secrets or External Secrets; never commit plaintext credentials. Image updater bots open PRs when ECR receives new digests.
CodeDeploy vs CodePipeline: CodePipeline orchestrates stages (build, test, deploy); CodeDeploy performs the actual deployment strategy (in-place, blue/green, canary). Lambda aliases with weighted routing are the serverless equivalent of ECS blue/green — know which service handles which concern.
FT.com and many AWS-native teams use GitHub Actions with OIDC to deploy to EKS via ArgoCD — the pipeline builds and pushes; GitOps reconciles. Separating "build credentials" from "deploy credentials" limits blast radius if a workflow is compromised.
Disaster recovery
Disaster recovery is a business decision expressed in two numbers: RTO (how fast you recover) and RPO (how much data you can lose). AWS offers four classic patterns plus AWS Elastic Disaster Recovery (DRS) for block-level replication of entire servers.
RTO and RPO defined
- RTO (Recovery Time Objective) — maximum acceptable downtime after a disaster. RTO = 4 hours means the business tolerates four hours offline.
- RPO (Recovery Point Objective) — maximum acceptable data loss measured in time. RPO = 15 minutes means you may lose up to 15 minutes of writes.
Lower RTO/RPO costs more. A backup-restore strategy with RPO of 24 hours is cheap; active-active with RPO near zero requires continuous replication and conflict resolution infrastructure.
DR strategy comparison
| Strategy | RTO | RPO | Cost | What's running in DR site | AWS services |
|---|---|---|---|---|---|
| Backup & restore | Hours – days | Hours (last backup) | $ | Nothing — restore from backups when needed | AWS Backup, RDS snapshots, S3 versioning, AMIs |
| Pilot light | 1 – 4 hours | Minutes – hours | $$ | Core data layer live; compute scaled to zero or minimal | RDS cross-region read replica, S3 CRR, AMIs in DR region |
| Warm standby | Minutes – 1 hour | Minutes – seconds | $$$ | Scaled-down but functional copy of production | ASG min=1 in DR, Aurora Global, Route53 failover |
| Active-active | Near zero | Near zero – seconds | $$$$ | Full production capacity in multiple regions | Global Tables, Aurora Global, Route53 latency routing, multi-region ALB |
Strategy details
- Backup & restore — AWS Backup policies for RDS, EBS, DynamoDB, EFS, S3; cross-account vault copy; quarterly restore drills. RTO dominated by IaC provisioning, not the restore API.
- Pilot light — data plane warm (RDS cross-region replica, S3 CRR, DynamoDB exports); compute dormant until failover; Route53 health-check failover.
- Warm standby — scaled-down but running DR copy (~10% capacity); continuous replication; scale up on failover. Best enterprise balance.
- Active-active — see multi-region; idempotent APIs and conflict-aware stores required.
- AWS DRS — block-level continuous replication (formerly CloudEndure) for EC2/on-prem; failover in minutes without refactoring to managed services. Pay for staging storage + drill/failover instance hours.
# AWS Backup — daily RDS with cross-account copy
aws backup create-backup-plan --backup-plan '{
"BackupPlanName": "prod-daily",
"Rules": [{
"RuleName": "rds-daily-35d",
"TargetBackupVaultName": "prod-vault",
"ScheduleExpression": "cron(0 5 ? * * *)",
"StartWindowMinutes": 60,
"CompletionWindowMinutes": 180,
"Lifecycle": { "DeleteAfterDays": 35 },
"CopyActions": [{
"DestinationBackupVaultArn": "arn:aws:backup:us-east-1:999999999999:backup-vault:dr-vault",
"Lifecycle": { "DeleteAfterDays": 35 }
}]
}]
}'
# RDS cross-region read replica (pilot light data layer)
aws rds create-db-instance-read-replica \
--db-instance-identifier orders-dr-replica \
--source-db-instance-identifier arn:aws:rds:eu-west-1:123456789012:db:orders-prod \
--region us-east-1
resource "aws_backup_plan" "prod" {
name = "prod-daily"
rule {
rule_name = "rds-daily"
target_vault_name = aws_backup_vault.prod.name
schedule = "cron(0 5 ? * * *)"
lifecycle { delete_after = 35 }
copy_action {
destination_vault_arn = aws_backup_vault.dr.arn
lifecycle { delete_after = 35 }
}
}
}
resource "aws_db_instance" "dr_replica" {
provider = aws.dr
identifier = "orders-dr-replica"
replicate_source_db = aws_db_instance.orders.arn
instance_class = "db.r6g.large"
publicly_accessible = false
skip_final_snapshot = true
}
import * as backup from 'aws-cdk-lib/aws-backup';
import * as rds from 'aws-cdk-lib/aws-rds';
const plan = backup.BackupPlan.dailyMonthly1YearRetention(this, 'ProdBackup');
plan.addSelection('RdsSelection', {
resources: [backup.BackupResource.fromRdsDatabaseInstance(db)],
});
// Cross-region replica — deploy db stack in DR region with replicateSourceDb
new rds.DatabaseInstanceReadReplica(this, 'DrReplica', {
sourceDatabaseInstance: db,
instanceType: ec2.InstanceType.of(ec2.InstanceClass.R6G, ec2.InstanceSize.LARGE),
});
DRS vs refactor: DRS replicates what you have — including technical debt and single-AZ mistakes. Long-term, migrating stateful tiers to Aurora Global, DynamoDB Global Tables, and S3 CRR is cheaper to operate than perpetual block-level replication. Use DRS as a bridge, not a destination.
Tag every resource with dr-tier=1|2|3 and rto-hours in your IaC modules. During architecture reviews, filter by tier to ensure DR investment matches business criticality — not every Lambda needs cross-region failover.