Production Architecture Patterns

Moving from "it works in dev" to "it survives production" means making explicit choices about availability zones, regions, deployment models, and recovery time. This chapter covers the patterns AWS teams actually run — three-tier Multi-AZ stacks, active-passive and active-active multi-region designs, ECS vs EKS microservices, serverless pitfalls that only appear at scale, CI/CD with safe rollouts, and disaster recovery strategies mapped to RTO and RPO. These are the decisions that separate a demo from a system that earns customer trust.

developer devops architect Regional + global patterns DRS free tier available

Multi-AZ architecture

An Availability Zone (AZ) is an isolated data center within a region. A single AZ outage is not rare enough to ignore — power, networking, and hardware failures happen. Production workloads spread across at least two AZs; stateful services use AWS-managed Multi-AZ failover instead of DIY replication.

Three-tier across AZs

The classic production pattern: web tier (stateless), application tier (stateless), data tier (stateful). Web and app tiers run in every AZ; the load balancer distributes traffic; the database runs Multi-AZ with synchronous standby. Each tier lives in private subnets per AZ, with only the load balancer in public subnets.

flowchart TB
  subgraph Internet
    USERS["Users / API clients"]
  end

  subgraph Region["AWS Region (e.g. eu-west-1)"]
    ALB["Application Load Balancer\n(public subnets, spans AZs)"]

    subgraph AZ1["Availability Zone A"]
      WEB1["Web tier\nECS tasks / EC2"]
      APP1["App tier\nSpring Boot / Node"]
      RDS_PRI["RDS Primary\n(Multi-AZ)"]
    end

    subgraph AZ2["Availability Zone B"]
      WEB2["Web tier\nECS tasks / EC2"]
      APP2["App tier\nSpring Boot / Node"]
      RDS_STBY["RDS Standby\n(synchronous replica)"]
    end

    subgraph AZ3["Availability Zone C"]
      WEB3["Web tier\nECS tasks / EC2"]
      APP3["App tier\nSpring Boot / Node"]
    end

    CACHE["ElastiCache\n(cluster mode, multi-AZ)"]
    S3["S3\n(regional, 11 nines)"]
  end

  USERS --> ALB
  ALB --> WEB1 & WEB2 & WEB3
  WEB1 --> APP1
  WEB2 --> APP2
  WEB3 --> APP3
  APP1 & APP2 & APP3 --> RDS_PRI
  RDS_PRI -.->|sync replication| RDS_STBY
  APP1 & APP2 & APP3 --> CACHE
  APP1 & APP2 & APP3 --> S3

ALB + ASG + RDS Multi-AZ — the production baseline

Component Multi-AZ behavior Production rule
ALB Automatically spans AZs; health checks route around unhealthy targets Register targets in ≥2 AZs; set healthy_threshold and deregistration delay
ASG Launch template distributes instances/tasks across AZ subnets minHealthyPercentage during deploys; use capacity rebalancing for Spot
RDS Multi-AZ Synchronous standby in another AZ; automatic failover (~60–120 s) Enable for all production databases; test failover quarterly
ElastiCache Replication group with automatic failover across AZs Multi-AZ with at least one replica per shard; not a source of truth
EFS Storage replicated across all AZs in the region automatically Mount targets in every AZ where compute runs

Never single-AZ for stateful

Stateless compute can tolerate AZ loss if enough capacity exists elsewhere — ASG replaces instances, ALB drains unhealthy targets. Stateful services cannot be rebuilt from another AZ without data. Running RDS, ElastiCache primary, or self-managed databases in a single AZ means a zone outage equals data unavailability until manual recovery — often hours. AWS-managed Multi-AZ handles failover, DNS updates, and storage replication for you.

  • RDS / Aurora — Multi-AZ or Aurora with replicas in separate AZs
  • DynamoDB — Multi-AZ by default; no single-AZ option to misconfigure
  • OpenSearch — dedicated master nodes + data nodes across ≥3 AZs for production
  • MSK (Kafka) — broker placement across AZs; min.insync.replicas ≥ 2
saved globally
bash
# RDS Multi-AZ — cannot be undone without brief outage when disabling
aws rds create-db-instance \
  --db-instance-identifier orders-prod \
  --engine postgres \
  --db-instance-class db.r6g.large \
  --allocated-storage 100 \
  --multi-az \
  --vpc-security-group-ids sg-abc123 \
  --db-subnet-group-name prod-db-subnets

# ASG across three AZs
aws autoscaling create-auto-scaling-group \
  --auto-scaling-group-name order-service-asg \
  --min-size 3 --max-size 12 --desired-capacity 6 \
  --vpc-zone-identifier "subnet-az1,subnet-az2,subnet-az3" \
  --health-check-type ELB --health-check-grace-period 120
hcl
resource "aws_db_instance" "orders" {
  identifier     = "orders-prod"
  engine         = "postgres"
  instance_class = "db.r6g.large"
  multi_az       = true
  db_subnet_group_name   = aws_db_subnet_group.prod.name
  vpc_security_group_ids = [aws_security_group.db.id]
}

resource "aws_autoscaling_group" "app" {
  name                = "order-service-asg"
  vpc_zone_identifier = aws_subnet.private[*].id
  min_size            = 3
  max_size            = 12
  desired_capacity    = 6
  health_check_type   = "ELB"

  tag {
    key                 = "Name"
    value               = "order-service"
    propagate_at_launch = true
  }
}
typescript
import * as ec2 from 'aws-cdk-lib/aws-ec2';
import * as rds from 'aws-cdk-lib/aws-rds';

const vpc = new ec2.Vpc(this, 'ProdVpc', {
  maxAzs: 3,
  natGateways: 3, // one per AZ for resilience
});

const db = new rds.DatabaseInstance(this, 'OrdersDb', {
  engine: rds.DatabaseInstanceEngine.postgres({ version: rds.PostgresEngineVersion.VER_16 }),
  vpc,
  multiAz: true,
  vpcSubnets: { subnetType: ec2.SubnetType.PRIVATE_ISOLATED },
});
⚠️ Pitfall

Running Multi-AZ RDS but placing all application instances in one AZ. When that AZ fails, ALB has no healthy targets even though the database failover succeeded. ASG vpc-zone-identifier must list subnets in every AZ, and minimum capacity should be ≥ number of AZs.

🎯 Exam Tip

Multi-AZ ≠ read scaling. RDS Multi-AZ standby is not readable (except Aurora, which has separate reader endpoints). For read scaling use read replicas; for HA use Multi-AZ. Exam questions often conflate the two.

🔬 Under the Hood

RDS Multi-AZ failover updates the CNAME of the primary endpoint to point at the former standby. DNS TTL is low, but clients with aggressive connection pooling may hold stale connections for minutes. Use connection pools with validation queries (SELECT 1) and retry logic on connection refused.

Multi-region design

Multi-AZ protects against data center failure within a region. Multi-region protects against regional outages, satisfies data residency requirements, and reduces latency for global users. The hard part is data — not routing.

Active-passive vs active-active

Pattern Traffic flow Data model When to use
Active-passive Primary region serves 100%; secondary on standby Async replication; manual or automated failover DR site, regulatory standby, cost-sensitive production
Active-active Both regions serve traffic simultaneously Multi-master or conflict-aware replication required Global low-latency, zero-downtime failover, high RTO/RPO bar

Active-passive is simpler: Route53 failover routing sends traffic to the secondary only when health checks fail on the primary. Active-active requires solving write conflicts — DynamoDB Global Tables (last-writer-wins), Aurora Global Database (one primary writer per cluster, promoted on failover), or application-level idempotency keys.

Route53 failover and latency routing

  • Failover routing — primary/secondary records with health checks (HTTP, HTTPS, or CloudWatch alarm). Secondary is SECONDARY type; only used when primary fails.
  • Latency-based routing — Route53 returns the record with lowest latency to the user's resolver. Good for active-active read paths.
  • Geolocation routing — route EU users to eu-west-1, US users to us-east-1. Combine with data residency policies.
  • Health checks — must test end-to-end (ALB path), not just EC2 ping. Use calculated health checks aggregating multiple child checks.

Global Tables and Aurora Global Database

  • Multi-region, multi-active — writes in any region replicate to all (typically <1 s)
  • Conflict resolution: last writer wins based on timestamp
  • Ideal for session stores, user profiles, idempotent event logs
  • Not ideal when strong ordering or transactional cross-item guarantees are required
  • One primary cluster accepts writes; up to five secondary regions (<1 s replication lag)
  • Managed failover: promote secondary to primary in <1 minute (RTO)
  • Secondary clusters can serve read-only traffic — good for active-passive reads
  • PostgreSQL and MySQL compatible; use for relational workloads needing global DR

Data residency

GDPR, HIPAA, and national banking regulations often require data to remain in specific jurisdictions. Architecture implications: deploy a full stack per region (not just replicate), use separate KMS keys per region, restrict cross-region S3 replication with bucket policies, and document data flows for auditors. Route53 geolocation alone does not enforce residency — a determined client can call any regional API endpoint. Enforce at the API Gateway resource policy and IAM condition keys (aws:RequestedRegion).

saved globally
bash
# Route53 failover record — primary ALB in eu-west-1
aws route53 change-resource-record-sets --hosted-zone-id Z1234567890 \
  --change-batch '{
    "Changes": [{
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "api.example.com",
        "Type": "A",
        "SetIdentifier": "primary-eu",
        "Failover": "PRIMARY",
        "AliasTarget": {
          "HostedZoneId": "Z32O12X0K2O1VO",
          "DNSName": "dualstack.my-alb-eu-1234567890.eu-west-1.elb.amazonaws.com",
          "EvaluateTargetHealth": true
        },
        "HealthCheckId": "abc-def-ghi"
      }
    }]
  }'

# DynamoDB Global Table — add replica region
aws dynamodb update-table --table-name Sessions \
  --replica-updates '[{"Create": {"RegionName": "us-east-1"}}]'
hcl
resource "aws_route53_record" "api_primary" {
  zone_id        = aws_route53_zone.main.zone_id
  name           = "api.example.com"
  type           = "A"
  set_identifier = "primary-eu"
  failover_routing_policy { type = "PRIMARY" }
  health_check_id = aws_route53_health_check.api.id

  alias {
    name                   = aws_lb.eu.dns_name
    zone_id                = aws_lb.eu.zone_id
    evaluate_target_health = true
  }
}

resource "aws_dynamodb_table" "sessions" {
  name             = "Sessions"
  billing_mode     = "PAY_PER_REQUEST"
  hash_key         = "sessionId"
  stream_enabled   = true
  stream_view_type = "NEW_AND_OLD_IMAGES"

  replica { region_name = "us-east-1" }
  replica { region_name = "eu-west-1" }
}
typescript
import * as dynamodb from 'aws-cdk-lib/aws-dynamodb';
import * as route53 from 'aws-cdk-lib/aws-route53';

const table = new dynamodb.Table(this, 'Sessions', {
  partitionKey: { name: 'sessionId', type: dynamodb.AttributeType.STRING },
  billingMode: dynamodb.BillingMode.PAY_PER_REQUEST,
  replicationRegions: ['us-east-1', 'eu-west-1'],
  stream: dynamodb.StreamViewType.NEW_AND_OLD_IMAGES,
});

// Failover routing configured via Route53 CfnRecordSet or separate stack per region
⚖️ Trade-off

Active-active vs cost: running full capacity in two regions doubles compute and data transfer bills. Active-passive with scaled-down standby (pilot light) cuts cost but increases RTO. Match spend to the business SLA — a B2B admin portal and a consumer payments API have different recovery requirements.

📦 Real World

Slack runs regional cells with failover boundaries — a regional outage affects one cell, not the entire product. Duolingo uses DynamoDB Global Tables for user progress so learners keep streaks regardless of which region serves their session. Both invested in data layer design before multi-region routing.

Microservices on AWS

Microservices on AWS means choosing an orchestrator (ECS or EKS), how services find each other, whether you need a service mesh, and how to isolate environments. The orchestrator choice affects operational burden more than application code quality.

ECS vs EKS

Dimension ECS (Fargate or EC2) EKS (Kubernetes)
Operational overhead Lower — AWS-managed control plane, simpler mental model Higher — control plane + node upgrades, CRDs, cluster add-ons
Portability AWS-native; limited multi-cloud Kubernetes manifests portable across clouds and on-prem
Scaling model Service auto-scaling on CPU/memory/request count HPA, VPA, Karpenter, Cluster Autoscaler — more knobs
Networking awsvpc mode — each task gets an ENI (Fargate limits apply) VPC CNI, service networking, Ingress controllers (ALB, NGINX)
Best fit Small-to-mid teams, Spring Boot services, fast time-to-production Platform teams, polyglot microservices, existing K8s investment

Default recommendation for most backend teams: ECS on Fargate until you have a concrete reason for Kubernetes (existing Helm charts, service mesh requirements, multi-cloud mandate). EKS shines when a platform team can amortize cluster operations across dozens of services.

Service discovery

  • ECS Service Connect — built-in service mesh-lite; services call http://orders:8080 without hardcoded IPs
  • AWS Cloud Map — DNS-based or API-based registry; integrates with ECS and EKS
  • ALB path routing — external clients and BFF layer; internal east-west traffic should not hairpin through ALB
  • EKS CoreDNS — standard K8s service discovery via ClusterIP; headless services for StatefulSets

AWS App Mesh

App Mesh is a managed service mesh using Envoy sidecars. It provides mutual TLS, traffic shifting, retries, and observability between services without embedding that logic in application code. Use when you have many inter-service calls and need consistent retry/timeout policies — but accept sidecar overhead (memory + latency). Alternatives: ECS Service Connect for simpler cases; Istio/Linkerd on EKS if you want CNCF ecosystem.

VPC per environment

Production pattern: separate VPC per environment (dev, staging, prod) or per account in AWS Organizations. Never share a VPC between prod and non-prod — a misconfigured security group in dev should not reach prod RDS. Cross-environment access goes through CI/CD roles and explicit peering/transit gateway — not flat networking.

saved globally
bash
# ECS Service Connect — enable on service
aws ecs create-service \
  --cluster prod-cluster \
  --service-name order-service \
  --task-definition order-service:42 \
  --desired-count 3 \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={subnets=[subnet-a,subnet-b],securityGroups=[sg-app]}" \
  --service-connect-configuration '{
    "enabled": true,
    "namespace": "prod.local",
    "services": [{
      "portName": "http",
      "discoveryName": "orders",
      "clientAliases": [{ "port": 8080, "dnsName": "orders" }]
    }]
  }'
hcl
resource "aws_service_discovery_private_dns_namespace" "prod" {
  name = "prod.local"
  vpc  = aws_vpc.prod.id
}

resource "aws_ecs_service" "orders" {
  name            = "order-service"
  cluster         = aws_ecs_cluster.prod.id
  task_definition = aws_ecs_task_definition.orders.arn
  desired_count   = 3
  launch_type     = "FARGATE"

  service_connect_configuration {
    enabled   = true
    namespace = aws_service_discovery_private_dns_namespace.prod.arn
    service {
      port_name      = "http"
      discovery_name = "orders"
      client_alias {
        port     = 8080
        dns_name = "orders"
      }
    }
  }
}
typescript
import * as ecs from 'aws-cdk-lib/aws-ecs';
import * as ec2 from 'aws-cdk-lib/aws-ec2';

const cluster = new ecs.Cluster(this, 'ProdCluster', { vpc, defaultCloudMapNamespace: { name: 'prod.local' } });

const orderService = new ecs.FargateService(this, 'OrderService', {
  cluster,
  taskDefinition: orderTaskDef,
  serviceConnectConfiguration: {
    services: [{
      portMappingName: 'http',
      dnsName: 'orders',
      port: 8080,
    }],
  },
});
💡 Pro Tip

Start with one ECS cluster per environment, not per service. Namespace isolation via Service Connect or separate Cloud Map namespaces. Split clusters only when blast radius or compliance requires it (e.g. PCI services in a dedicated cluster).

💰 Cost

EKS control plane costs $0.10/hour per cluster (~$73/month) regardless of workload size. Fargate has no cluster fee but higher per-vCPU pricing than EC2-backed ECS. For 5–10 microservices at steady load, EC2-backed ECS with Savings Plans often wins on cost; Fargate wins on ops simplicity.

Serverless architecture

The canonical AWS serverless stack — API Gateway, Lambda, DynamoDB — scales to zero and charges per request. Production serverless fails on cold starts, concurrency limits, and DynamoDB hot partitions long before it fails on business logic.

API Gateway + Lambda + DynamoDB

Request flow: client → API Gateway (auth, throttling, request validation) → Lambda (business logic) → DynamoDB (state). Add SQS between API Gateway and Lambda when you need buffering; add ElastiCache when read patterns are hot and cacheable. Keep Lambda functions small and single-purpose — one function per route group, not one monolith handler parsing every path.

Step Functions for orchestration

When a workflow spans multiple Lambdas, external APIs, or human approval steps, use Step Functions instead of chaining Lambdas via SNS/SQS spaghetti. Standard workflows for long-running processes (up to one year); Express workflows for high-volume short flows (<5 minutes). Built-in retry, catch, and parallel states replace custom orchestration code.

  • Order fulfillment — charge payment → reserve inventory → ship → notify (with compensating transactions on failure)
  • Document processing — upload to S3 → Textract → summarize with Bedrock → store in DynamoDB
  • Human-in-the-loopwaitForTaskToken for manual approval before provisioning

Production pitfalls

Pitfall Symptom Mitigation
Cold starts P99 latency spikes on first request after idle; JVM/Spring especially bad Provisioned concurrency; SnapStart (Java); smaller runtimes (Node, Python); keep-warm (last resort)
Concurrency limits 429 TooManyRequestsException during traffic spikes Request account limit increase; reserved concurrency per critical function; SQS buffer for async
Hot partitions Throttling on one DynamoDB partition despite low overall table usage Design partition keys for even distribution; use write sharding suffixes; on-demand mode helps burst
Timeout cascades API Gateway 29 s max vs Lambda 15 min — sync chains timeout Async processing via SQS + Step Functions; return 202 Accepted with polling endpoint
Connection exhaustion RDS from Lambda opens too many connections per concurrent execution RDS Proxy; DynamoDB instead; limit reserved concurrency to match pool size
saved globally
bash
# Provisioned concurrency — eliminates cold starts for critical path
aws lambda put-provisioned-concurrency-config \
  --function-name order-api \
  --qualifier live \
  --provisioned-concurrent-executions 10

# DynamoDB on-demand with PITR for production
aws dynamodb create-table \
  --table-name Orders \
  --attribute-definitions AttributeName=orderId,AttributeType=S \
  --key-schema AttributeName=orderId,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST

aws dynamodb update-continuous-backups \
  --table-name Orders \
  --point-in-time-recovery-specification PointInTimeRecoveryEnabled=true
hcl
resource "aws_lambda_provisioned_concurrency_config" "order_api" {
  function_name                     = aws_lambda_function.order_api.function_name
  qualifier                         = aws_lambda_alias.live.name
  provisioned_concurrent_executions = 10
}

resource "aws_dynamodb_table" "orders" {
  name         = "Orders"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "orderId"

  attribute { name = "orderId" type = "S" }

  point_in_time_recovery { enabled = true }
}
typescript
import * as lambda from 'aws-cdk-lib/aws-lambda';
import * as dynamodb from 'aws-cdk-lib/aws-dynamodb';
import * as apigw from 'aws-cdk-lib/aws-apigateway';

const table = new dynamodb.Table(this, 'Orders', {
  partitionKey: { name: 'orderId', type: dynamodb.AttributeType.STRING },
  billingMode: dynamodb.BillingMode.PAY_PER_REQUEST,
  pointInTimeRecovery: true,
});

const fn = new lambda.Function(this, 'OrderApi', {
  runtime: lambda.Runtime.NODEJS_20_X,
  handler: 'index.handler',
  code: lambda.Code.fromAsset('lambda'),
  currentVersionOptions: { provisionedConcurrentExecutions: 10 },
});

new apigw.LambdaRestApi(this, 'OrdersApi', { handler: fn });
⚠️ Pitfall

Using userId as the sole DynamoDB partition key for a social feed. Celebrity users create hot partitions that throttle the entire table. Shard writes with userId#${random(0,N)} and scatter-gather reads, or use GSI with time-based sort keys for feed queries.

🔒 Security

API Gateway resource policies and Lambda authorizers are not interchangeable. Use Cognito User Pools or IAM SigV4 for machine clients; JWT authorizers for SPA/mobile. Never expose DynamoDB directly to the internet — always go through Lambda or AppSync with fine-grained resolver auth.

CI/CD on AWS

Production CI/CD pushes artifacts to ECR, deploys to ECS/EKS with zero-downtime strategies, and gates releases with automated tests and manual approvals. The deploy mechanism matters as much as the pipeline — a green build that takes down production is worse than no pipeline at all.

GitHub Actions → ECR → ECS/EKS

The standard flow for containerized services:

  1. Developer merges PR to main
  2. GitHub Actions assumes AWS role via OIDC (no long-lived keys)
  3. Build Docker image, run tests, scan with Trivy or ECR image scanning
  4. Push to ECR with immutable tag (sha-abc1234)
  5. Update ECS task definition or EKS deployment manifest with new image digest
  6. Deploy via aws ecs update-service, CodeDeploy, or ArgoCD sync
yaml
# .github/workflows/deploy.yml (excerpt)
jobs:
  deploy:
    runs-on: ubuntu-latest
    permissions:
      id-token: write
      contents: read
    steps:
      - uses: actions/checkout@v4
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/github-actions-deploy
          aws-region: eu-west-1
      - uses: aws-actions/amazon-ecr-login@v2
      - run: |
          docker build -t $ECR_REGISTRY/order-service:${{ github.sha }} .
          docker push $ECR_REGISTRY/order-service:${{ github.sha }}
      - run: |
          aws ecs update-service --cluster prod --service order-service \
            --force-new-deployment \
            --task-definition $(aws ecs register-task-definition ...)

Blue/green on ECS

CodeDeploy for ECS shifts ALB listener traffic from "blue" (current) to "green" (new) task set. Deployment configuration controls traffic shift speed and bake time. Automatic rollback on CloudWatch alarm breach. Requires two target groups and a CodeDeploy application — more setup than rolling deploy, but true zero-downtime with instant rollback.

  • Canary 10% → 50% → 100% — gradual shift over configurable intervals
  • All-at-once — instant cutover after health checks pass (similar to K8s rolling update)
  • Rollback — revert listener rules to blue target group in seconds

Canary Lambda aliases

Lambda aliases (live, beta) point to function versions. Weighted alias routing sends 5% of traffic to the new version while 95% stays on the current. CloudWatch monitors error rate; CodeDeploy automates weight shifts. Ideal for serverless APIs where ECS-style blue/green is overkill.

terminal — Lambda weighted alias canary
$ # Publish new version
$ aws lambda publish-version --function-name order-api
→ Version: 47
$ # Route 10% to v47, 90% to v46 via alias
$ aws lambda update-alias --function-name order-api --name live \
  --routing-config '{"AdditionalVersionWeights": {"47": 0.1}}'
$ # Promote to 100% after bake time
$ aws lambda update-alias --function-name order-api --name live --function-version 47

GitOps on EKS

GitOps (Flux or ArgoCD) watches a manifest repo and reconciles cluster state — drift detection reverts manual kubectl edit changes. Pair with Sealed Secrets or External Secrets; never commit plaintext credentials. Image updater bots open PRs when ECR receives new digests.

🎯 Exam Tip

CodeDeploy vs CodePipeline: CodePipeline orchestrates stages (build, test, deploy); CodeDeploy performs the actual deployment strategy (in-place, blue/green, canary). Lambda aliases with weighted routing are the serverless equivalent of ECS blue/green — know which service handles which concern.

📦 Real World

FT.com and many AWS-native teams use GitHub Actions with OIDC to deploy to EKS via ArgoCD — the pipeline builds and pushes; GitOps reconciles. Separating "build credentials" from "deploy credentials" limits blast radius if a workflow is compromised.

Disaster recovery

Disaster recovery is a business decision expressed in two numbers: RTO (how fast you recover) and RPO (how much data you can lose). AWS offers four classic patterns plus AWS Elastic Disaster Recovery (DRS) for block-level replication of entire servers.

RTO and RPO defined

  • RTO (Recovery Time Objective) — maximum acceptable downtime after a disaster. RTO = 4 hours means the business tolerates four hours offline.
  • RPO (Recovery Point Objective) — maximum acceptable data loss measured in time. RPO = 15 minutes means you may lose up to 15 minutes of writes.

Lower RTO/RPO costs more. A backup-restore strategy with RPO of 24 hours is cheap; active-active with RPO near zero requires continuous replication and conflict resolution infrastructure.

DR strategy comparison

Strategy RTO RPO Cost What's running in DR site AWS services
Backup & restore Hours – days Hours (last backup) $ Nothing — restore from backups when needed AWS Backup, RDS snapshots, S3 versioning, AMIs
Pilot light 1 – 4 hours Minutes – hours $$ Core data layer live; compute scaled to zero or minimal RDS cross-region read replica, S3 CRR, AMIs in DR region
Warm standby Minutes – 1 hour Minutes – seconds $$$ Scaled-down but functional copy of production ASG min=1 in DR, Aurora Global, Route53 failover
Active-active Near zero Near zero – seconds $$$$ Full production capacity in multiple regions Global Tables, Aurora Global, Route53 latency routing, multi-region ALB

Strategy details

  • Backup & restore — AWS Backup policies for RDS, EBS, DynamoDB, EFS, S3; cross-account vault copy; quarterly restore drills. RTO dominated by IaC provisioning, not the restore API.
  • Pilot light — data plane warm (RDS cross-region replica, S3 CRR, DynamoDB exports); compute dormant until failover; Route53 health-check failover.
  • Warm standby — scaled-down but running DR copy (~10% capacity); continuous replication; scale up on failover. Best enterprise balance.
  • Active-active — see multi-region; idempotent APIs and conflict-aware stores required.
  • AWS DRS — block-level continuous replication (formerly CloudEndure) for EC2/on-prem; failover in minutes without refactoring to managed services. Pay for staging storage + drill/failover instance hours.
saved globally
bash
# AWS Backup — daily RDS with cross-account copy
aws backup create-backup-plan --backup-plan '{
  "BackupPlanName": "prod-daily",
  "Rules": [{
    "RuleName": "rds-daily-35d",
    "TargetBackupVaultName": "prod-vault",
    "ScheduleExpression": "cron(0 5 ? * * *)",
    "StartWindowMinutes": 60,
    "CompletionWindowMinutes": 180,
    "Lifecycle": { "DeleteAfterDays": 35 },
    "CopyActions": [{
      "DestinationBackupVaultArn": "arn:aws:backup:us-east-1:999999999999:backup-vault:dr-vault",
      "Lifecycle": { "DeleteAfterDays": 35 }
    }]
  }]
}'

# RDS cross-region read replica (pilot light data layer)
aws rds create-db-instance-read-replica \
  --db-instance-identifier orders-dr-replica \
  --source-db-instance-identifier arn:aws:rds:eu-west-1:123456789012:db:orders-prod \
  --region us-east-1
hcl
resource "aws_backup_plan" "prod" {
  name = "prod-daily"
  rule {
    rule_name         = "rds-daily"
    target_vault_name = aws_backup_vault.prod.name
    schedule          = "cron(0 5 ? * * *)"
    lifecycle { delete_after = 35 }

    copy_action {
      destination_vault_arn = aws_backup_vault.dr.arn
      lifecycle { delete_after = 35 }
    }
  }
}

resource "aws_db_instance" "dr_replica" {
  provider               = aws.dr
  identifier             = "orders-dr-replica"
  replicate_source_db    = aws_db_instance.orders.arn
  instance_class         = "db.r6g.large"
  publicly_accessible    = false
  skip_final_snapshot    = true
}
typescript
import * as backup from 'aws-cdk-lib/aws-backup';
import * as rds from 'aws-cdk-lib/aws-rds';

const plan = backup.BackupPlan.dailyMonthly1YearRetention(this, 'ProdBackup');
plan.addSelection('RdsSelection', {
  resources: [backup.BackupResource.fromRdsDatabaseInstance(db)],
});

// Cross-region replica — deploy db stack in DR region with replicateSourceDb
new rds.DatabaseInstanceReadReplica(this, 'DrReplica', {
  sourceDatabaseInstance: db,
  instanceType: ec2.InstanceType.of(ec2.InstanceClass.R6G, ec2.InstanceSize.LARGE),
});
⚖️ Trade-off

DRS vs refactor: DRS replicates what you have — including technical debt and single-AZ mistakes. Long-term, migrating stateful tiers to Aurora Global, DynamoDB Global Tables, and S3 CRR is cheaper to operate than perpetual block-level replication. Use DRS as a bridge, not a destination.

💡 Pro Tip

Tag every resource with dr-tier=1|2|3 and rto-hours in your IaC modules. During architecture reviews, filter by tier to ensure DR investment matches business criticality — not every Lambda needs cross-region failover.