Production Architecture Patterns

Multi-AZ architecture

An Availability Zone (AZ) is an isolated data center within a region. A single AZ outage is not rare enough to ignore — power, networking, and hardware failures happen. Production workloads spread across at least two AZs; stateful services use AWS-managed Multi-AZ failover instead of DIY replication.

Three-tier across AZs

The classic production pattern: web tier (stateless), application tier (stateless), data tier (stateful). Web and app tiers run in every AZ; the load balancer distributes traffic; the database runs Multi-AZ with synchronous standby. Each tier lives in private subnets per AZ, with only the load balancer in public subnets.

flowchart TB
  subgraph Internet
    USERS["Users / API clients"]
  end

  subgraph Region["AWS Region (e.g. eu-west-1)"]
    ALB["Application Load Balancer\n(public subnets, spans AZs)"]

    subgraph AZ1["Availability Zone A"]
      WEB1["Web tier\nECS tasks / EC2"]
      APP1["App tier\nSpring Boot / Node"]
      RDS_PRI["RDS Primary\n(Multi-AZ)"]
    end

    subgraph AZ2["Availability Zone B"]
      WEB2["Web tier\nECS tasks / EC2"]
      APP2["App tier\nSpring Boot / Node"]
      RDS_STBY["RDS Standby\n(synchronous replica)"]
    end

    subgraph AZ3["Availability Zone C"]
      WEB3["Web tier\nECS tasks / EC2"]
      APP3["App tier\nSpring Boot / Node"]
    end

    CACHE["ElastiCache\n(cluster mode, multi-AZ)"]
    S3["S3\n(regional, 11 nines)"]
  end

  USERS --> ALB
  ALB --> WEB1 & WEB2 & WEB3
  WEB1 --> APP1
  WEB2 --> APP2
  WEB3 --> APP3
  APP1 & APP2 & APP3 --> RDS_PRI
  RDS_PRI -.->|sync replication| RDS_STBY
  APP1 & APP2 & APP3 --> CACHE
  APP1 & APP2 & APP3 --> S3

ALB + ASG + RDS Multi-AZ — the production baseline

Component	Multi-AZ behavior	Production rule
ALB	Automatically spans AZs; health checks route around unhealthy targets	Register targets in ≥2 AZs; set healthy_threshold and deregistration delay
ASG	Launch template distributes instances/tasks across AZ subnets	minHealthyPercentage during deploys; use capacity rebalancing for Spot
RDS Multi-AZ	Synchronous standby in another AZ; automatic failover (~60–120 s)	Enable for all production databases; test failover quarterly
ElastiCache	Replication group with automatic failover across AZs	Multi-AZ with at least one replica per shard; not a source of truth
EFS	Storage replicated across all AZs in the region automatically	Mount targets in every AZ where compute runs

Never single-AZ for stateful

Stateless compute can tolerate AZ loss if enough capacity exists elsewhere — ASG replaces instances, ALB drains unhealthy targets. Stateful services cannot be rebuilt from another AZ without data. Running RDS, ElastiCache primary, or self-managed databases in a single AZ means a zone outage equals data unavailability until manual recovery — often hours. AWS-managed Multi-AZ handles failover, DNS updates, and storage replication for you.

RDS / Aurora — Multi-AZ or Aurora with replicas in separate AZs
DynamoDB — Multi-AZ by default; no single-AZ option to misconfigure
OpenSearch — dedicated master nodes + data nodes across ≥3 AZs for production
MSK (Kafka) — broker placement across AZs; min.insync.replicas ≥ 2

saved globally

# RDS Multi-AZ — cannot be undone without brief outage when disabling
aws rds create-db-instance \
  --db-instance-identifier orders-prod \
  --engine postgres \
  --db-instance-class db.r6g.large \
  --allocated-storage 100 \
  --multi-az \
  --vpc-security-group-ids sg-abc123 \
  --db-subnet-group-name prod-db-subnets

# ASG across three AZs
aws autoscaling create-auto-scaling-group \
  --auto-scaling-group-name order-service-asg \
  --min-size 3 --max-size 12 --desired-capacity 6 \
  --vpc-zone-identifier "subnet-az1,subnet-az2,subnet-az3" \
  --health-check-type ELB --health-check-grace-period 120

resource "aws_db_instance" "orders" {
  identifier     = "orders-prod"
  engine         = "postgres"
  instance_class = "db.r6g.large"
  multi_az       = true
  db_subnet_group_name   = aws_db_subnet_group.prod.name
  vpc_security_group_ids = [aws_security_group.db.id]
}

resource "aws_autoscaling_group" "app" {
  name                = "order-service-asg"
  vpc_zone_identifier = aws_subnet.private[*].id
  min_size            = 3
  max_size            = 12
  desired_capacity    = 6
  health_check_type   = "ELB"

  tag {
    key                 = "Name"
    value               = "order-service"
    propagate_at_launch = true
  }
}

import * as ec2 from 'aws-cdk-lib/aws-ec2';
import * as rds from 'aws-cdk-lib/aws-rds';

const vpc = new ec2.Vpc(this, 'ProdVpc', {
  maxAzs: 3,
  natGateways: 3, // one per AZ for resilience
});

const db = new rds.DatabaseInstance(this, 'OrdersDb', {
  engine: rds.DatabaseInstanceEngine.postgres({ version: rds.PostgresEngineVersion.VER_16 }),
  vpc,
  multiAz: true,
  vpcSubnets: { subnetType: ec2.SubnetType.PRIVATE_ISOLATED },
});

⚠️ Pitfall

Running Multi-AZ RDS but placing all application instances in one AZ. When that AZ fails, ALB has no healthy targets even though the database failover succeeded. ASG vpc-zone-identifier must list subnets in every AZ, and minimum capacity should be ≥ number of AZs.

🎯 Exam Tip

Multi-AZ ≠ read scaling. RDS Multi-AZ standby is not readable (except Aurora, which has separate reader endpoints). For read scaling use read replicas; for HA use Multi-AZ. Exam questions often conflate the two.

🔬 Under the Hood

RDS Multi-AZ failover updates the CNAME of the primary endpoint to point at the former standby. DNS TTL is low, but clients with aggressive connection pooling may hold stale connections for minutes. Use connection pools with validation queries (SELECT 1) and retry logic on connection refused.

Multi-region design

Multi-AZ protects against data center failure within a region. Multi-region protects against regional outages, satisfies data residency requirements, and reduces latency for global users. The hard part is data — not routing.

Active-passive vs active-active

Pattern	Traffic flow	Data model	When to use
Active-passive	Primary region serves 100%; secondary on standby	Async replication; manual or automated failover	DR site, regulatory standby, cost-sensitive production
Active-active	Both regions serve traffic simultaneously	Multi-master or conflict-aware replication required	Global low-latency, zero-downtime failover, high RTO/RPO bar

Active-passive is simpler: Route53 failover routing sends traffic to the secondary only when health checks fail on the primary. Active-active requires solving write conflicts — DynamoDB Global Tables (last-writer-wins), Aurora Global Database (one primary writer per cluster, promoted on failover), or application-level idempotency keys.

Route53 failover and latency routing

Failover routing — primary/secondary records with health checks (HTTP, HTTPS, or CloudWatch alarm). Secondary is SECONDARY type; only used when primary fails.
Latency-based routing — Route53 returns the record with lowest latency to the user's resolver. Good for active-active read paths.
Geolocation routing — route EU users to eu-west-1, US users to us-east-1. Combine with data residency policies.
Health checks — must test end-to-end (ALB path), not just EC2 ping. Use calculated health checks aggregating multiple child checks.

Global Tables and Aurora Global Database

Multi-region, multi-active — writes in any region replicate to all (typically <1 s)
Conflict resolution: last writer wins based on timestamp
Ideal for session stores, user profiles, idempotent event logs
Not ideal when strong ordering or transactional cross-item guarantees are required

One primary cluster accepts writes; up to five secondary regions (<1 s replication lag)
Managed failover: promote secondary to primary in <1 minute (RTO)
Secondary clusters can serve read-only traffic — good for active-passive reads
PostgreSQL and MySQL compatible; use for relational workloads needing global DR

Data residency

GDPR, HIPAA, and national banking regulations often require data to remain in specific jurisdictions. Architecture implications: deploy a full stack per region (not just replicate), use separate KMS keys per region, restrict cross-region S3 replication with bucket policies, and document data flows for auditors. Route53 geolocation alone does not enforce residency — a determined client can call any regional API endpoint. Enforce at the API Gateway resource policy and IAM condition keys (aws:RequestedRegion).

saved globally

# Route53 failover record — primary ALB in eu-west-1
aws route53 change-resource-record-sets --hosted-zone-id Z1234567890 \
  --change-batch '{
    "Changes": [{
      "Action": "UPSERT",
      "ResourceRecordSet": {
        "Name": "api.example.com",
        "Type": "A",
        "SetIdentifier": "primary-eu",
        "Failover": "PRIMARY",
        "AliasTarget": {
          "HostedZoneId": "Z32O12X0K2O1VO",
          "DNSName": "dualstack.my-alb-eu-1234567890.eu-west-1.elb.amazonaws.com",
          "EvaluateTargetHealth": true
        },
        "HealthCheckId": "abc-def-ghi"
      }
    }]
  }'

# DynamoDB Global Table — add replica region
aws dynamodb update-table --table-name Sessions \
  --replica-updates '[{"Create": {"RegionName": "us-east-1"}}]'

resource "aws_route53_record" "api_primary" {
  zone_id        = aws_route53_zone.main.zone_id
  name           = "api.example.com"
  type           = "A"
  set_identifier = "primary-eu"
  failover_routing_policy { type = "PRIMARY" }
  health_check_id = aws_route53_health_check.api.id

  alias {
    name                   = aws_lb.eu.dns_name
    zone_id                = aws_lb.eu.zone_id
    evaluate_target_health = true
  }
}

resource "aws_dynamodb_table" "sessions" {
  name             = "Sessions"
  billing_mode     = "PAY_PER_REQUEST"
  hash_key         = "sessionId"
  stream_enabled   = true
  stream_view_type = "NEW_AND_OLD_IMAGES"

  replica { region_name = "us-east-1" }
  replica { region_name = "eu-west-1" }
}

import * as dynamodb from 'aws-cdk-lib/aws-dynamodb';
import * as route53 from 'aws-cdk-lib/aws-route53';

const table = new dynamodb.Table(this, 'Sessions', {
  partitionKey: { name: 'sessionId', type: dynamodb.AttributeType.STRING },
  billingMode: dynamodb.BillingMode.PAY_PER_REQUEST,
  replicationRegions: ['us-east-1', 'eu-west-1'],
  stream: dynamodb.StreamViewType.NEW_AND_OLD_IMAGES,
});

// Failover routing configured via Route53 CfnRecordSet or separate stack per region

⚖️ Trade-off

Active-active vs cost: running full capacity in two regions doubles compute and data transfer bills. Active-passive with scaled-down standby (pilot light) cuts cost but increases RTO. Match spend to the business SLA — a B2B admin portal and a consumer payments API have different recovery requirements.

📦 Real World

Slack runs regional cells with failover boundaries — a regional outage affects one cell, not the entire product. Duolingo uses DynamoDB Global Tables for user progress so learners keep streaks regardless of which region serves their session. Both invested in data layer design before multi-region routing.

Microservices on AWS

Microservices on AWS means choosing an orchestrator (ECS or EKS), how services find each other, whether you need a service mesh, and how to isolate environments. The orchestrator choice affects operational burden more than application code quality.

ECS vs EKS

Dimension	ECS (Fargate or EC2)	EKS (Kubernetes)
Operational overhead	Lower — AWS-managed control plane, simpler mental model	Higher — control plane + node upgrades, CRDs, cluster add-ons
Portability	AWS-native; limited multi-cloud	Kubernetes manifests portable across clouds and on-prem
Scaling model	Service auto-scaling on CPU/memory/request count	HPA, VPA, Karpenter, Cluster Autoscaler — more knobs
Networking	awsvpc mode — each task gets an ENI (Fargate limits apply)	VPC CNI, service networking, Ingress controllers (ALB, NGINX)
Best fit	Small-to-mid teams, Spring Boot services, fast time-to-production	Platform teams, polyglot microservices, existing K8s investment

Default recommendation for most backend teams: ECS on Fargate until you have a concrete reason for Kubernetes (existing Helm charts, service mesh requirements, multi-cloud mandate). EKS shines when a platform team can amortize cluster operations across dozens of services.

Service discovery

ECS Service Connect — built-in service mesh-lite; services call http://orders:8080 without hardcoded IPs
AWS Cloud Map — DNS-based or API-based registry; integrates with ECS and EKS
ALB path routing — external clients and BFF layer; internal east-west traffic should not hairpin through ALB
EKS CoreDNS — standard K8s service discovery via ClusterIP; headless services for StatefulSets

AWS App Mesh

App Mesh is a managed service mesh using Envoy sidecars. It provides mutual TLS, traffic shifting, retries, and observability between services without embedding that logic in application code. Use when you have many inter-service calls and need consistent retry/timeout policies — but accept sidecar overhead (memory + latency). Alternatives: ECS Service Connect for simpler cases; Istio/Linkerd on EKS if you want CNCF ecosystem.

VPC per environment

Production pattern: separate VPC per environment (dev, staging, prod) or per account in AWS Organizations. Never share a VPC between prod and non-prod — a misconfigured security group in dev should not reach prod RDS. Cross-environment access goes through CI/CD roles and explicit peering/transit gateway — not flat networking.

saved globally

# ECS Service Connect — enable on service
aws ecs create-service \
  --cluster prod-cluster \
  --service-name order-service \
  --task-definition order-service:42 \
  --desired-count 3 \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={subnets=[subnet-a,subnet-b],securityGroups=[sg-app]}" \
  --service-connect-configuration '{
    "enabled": true,
    "namespace": "prod.local",
    "services": [{
      "portName": "http",
      "discoveryName": "orders",
      "clientAliases": [{ "port": 8080, "dnsName": "orders" }]
    }]
  }'

resource "aws_service_discovery_private_dns_namespace" "prod" {
  name = "prod.local"
  vpc  = aws_vpc.prod.id
}

resource "aws_ecs_service" "orders" {
  name            = "order-service"
  cluster         = aws_ecs_cluster.prod.id
  task_definition = aws_ecs_task_definition.orders.arn
  desired_count   = 3
  launch_type     = "FARGATE"

  service_connect_configuration {
    enabled   = true
    namespace = aws_service_discovery_private_dns_namespace.prod.arn
    service {
      port_name      = "http"
      discovery_name = "orders"
      client_alias {
        port     = 8080
        dns_name = "orders"
      }
    }
  }
}

import * as ecs from 'aws-cdk-lib/aws-ecs';
import * as ec2 from 'aws-cdk-lib/aws-ec2';

const cluster = new ecs.Cluster(this, 'ProdCluster', { vpc, defaultCloudMapNamespace: { name: 'prod.local' } });

const orderService = new ecs.FargateService(this, 'OrderService', {
  cluster,
  taskDefinition: orderTaskDef,
  serviceConnectConfiguration: {
    services: [{
      portMappingName: 'http',
      dnsName: 'orders',
      port: 8080,
    }],
  },
});

💡 Pro Tip

Start with one ECS cluster per environment, not per service. Namespace isolation via Service Connect or separate Cloud Map namespaces. Split clusters only when blast radius or compliance requires it (e.g. PCI services in a dedicated cluster).

💰 Cost

EKS control plane costs $0.10/hour per cluster (~$73/month) regardless of workload size. Fargate has no cluster fee but higher per-vCPU pricing than EC2-backed ECS. For 5–10 microservices at steady load, EC2-backed ECS with Savings Plans often wins on cost; Fargate wins on ops simplicity.

Serverless architecture

The canonical AWS serverless stack — API Gateway, Lambda, DynamoDB — scales to zero and charges per request. Production serverless fails on cold starts, concurrency limits, and DynamoDB hot partitions long before it fails on business logic.

API Gateway + Lambda + DynamoDB

Request flow: client → API Gateway (auth, throttling, request validation) → Lambda (business logic) → DynamoDB (state). Add SQS between API Gateway and Lambda when you need buffering; add ElastiCache when read patterns are hot and cacheable. Keep Lambda functions small and single-purpose — one function per route group, not one monolith handler parsing every path.

Step Functions for orchestration

When a workflow spans multiple Lambdas, external APIs, or human approval steps, use Step Functions instead of chaining Lambdas via SNS/SQS spaghetti. Standard workflows for long-running processes (up to one year); Express workflows for high-volume short flows (<5 minutes). Built-in retry, catch, and parallel states replace custom orchestration code.

Order fulfillment — charge payment → reserve inventory → ship → notify (with compensating transactions on failure)
Document processing — upload to S3 → Textract → summarize with Bedrock → store in DynamoDB
Human-in-the-loop — waitForTaskToken for manual approval before provisioning

Production pitfalls

Pitfall	Symptom	Mitigation
Cold starts	P99 latency spikes on first request after idle; JVM/Spring especially bad	Provisioned concurrency; SnapStart (Java); smaller runtimes (Node, Python); keep-warm (last resort)
Concurrency limits	429 TooManyRequestsException during traffic spikes	Request account limit increase; reserved concurrency per critical function; SQS buffer for async
Hot partitions	Throttling on one DynamoDB partition despite low overall table usage	Design partition keys for even distribution; use write sharding suffixes; on-demand mode helps burst
Timeout cascades	API Gateway 29 s max vs Lambda 15 min — sync chains timeout	Async processing via SQS + Step Functions; return 202 Accepted with polling endpoint
Connection exhaustion	RDS from Lambda opens too many connections per concurrent execution	RDS Proxy; DynamoDB instead; limit reserved concurrency to match pool size

saved globally

# Provisioned concurrency — eliminates cold starts for critical path
aws lambda put-provisioned-concurrency-config \
  --function-name order-api \
  --qualifier live \
  --provisioned-concurrent-executions 10

# DynamoDB on-demand with PITR for production
aws dynamodb create-table \
  --table-name Orders \
  --attribute-definitions AttributeName=orderId,AttributeType=S \
  --key-schema AttributeName=orderId,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST

aws dynamodb update-continuous-backups \
  --table-name Orders \
  --point-in-time-recovery-specification PointInTimeRecoveryEnabled=true

resource "aws_lambda_provisioned_concurrency_config" "order_api" {
  function_name                     = aws_lambda_function.order_api.function_name
  qualifier                         = aws_lambda_alias.live.name
  provisioned_concurrent_executions = 10
}

resource "aws_dynamodb_table" "orders" {
  name         = "Orders"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "orderId"

  attribute { name = "orderId" type = "S" }

  point_in_time_recovery { enabled = true }
}

import * as lambda from 'aws-cdk-lib/aws-lambda';
import * as dynamodb from 'aws-cdk-lib/aws-dynamodb';
import * as apigw from 'aws-cdk-lib/aws-apigateway';

const table = new dynamodb.Table(this, 'Orders', {
  partitionKey: { name: 'orderId', type: dynamodb.AttributeType.STRING },
  billingMode: dynamodb.BillingMode.PAY_PER_REQUEST,
  pointInTimeRecovery: true,
});

const fn = new lambda.Function(this, 'OrderApi', {
  runtime: lambda.Runtime.NODEJS_20_X,
  handler: 'index.handler',
  code: lambda.Code.fromAsset('lambda'),
  currentVersionOptions: { provisionedConcurrentExecutions: 10 },
});

new apigw.LambdaRestApi(this, 'OrdersApi', { handler: fn });

⚠️ Pitfall

Using userId as the sole DynamoDB partition key for a social feed. Celebrity users create hot partitions that throttle the entire table. Shard writes with userId#${random(0,N)} and scatter-gather reads, or use GSI with time-based sort keys for feed queries.

🔒 Security

API Gateway resource policies and Lambda authorizers are not interchangeable. Use Cognito User Pools or IAM SigV4 for machine clients; JWT authorizers for SPA/mobile. Never expose DynamoDB directly to the internet — always go through Lambda or AppSync with fine-grained resolver auth.

CI/CD on AWS

Production CI/CD pushes artifacts to ECR, deploys to ECS/EKS with zero-downtime strategies, and gates releases with automated tests and manual approvals. The deploy mechanism matters as much as the pipeline — a green build that takes down production is worse than no pipeline at all.

GitHub Actions → ECR → ECS/EKS

The standard flow for containerized services:

Developer merges PR to main
GitHub Actions assumes AWS role via OIDC (no long-lived keys)
Build Docker image, run tests, scan with Trivy or ECR image scanning
Push to ECR with immutable tag (sha-abc1234)
Update ECS task definition or EKS deployment manifest with new image digest
Deploy via aws ecs update-service, CodeDeploy, or ArgoCD sync

# .github/workflows/deploy.yml (excerpt)
jobs:
  deploy:
    runs-on: ubuntu-latest
    permissions:
      id-token: write
      contents: read
    steps:
      - uses: actions/checkout@v4
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/github-actions-deploy
          aws-region: eu-west-1
      - uses: aws-actions/amazon-ecr-login@v2
      - run: |
          docker build -t $ECR_REGISTRY/order-service:${{ github.sha }} .
          docker push $ECR_REGISTRY/order-service:${{ github.sha }}
      - run: |
          aws ecs update-service --cluster prod --service order-service \
            --force-new-deployment \
            --task-definition $(aws ecs register-task-definition ...)

Blue/green on ECS

CodeDeploy for ECS shifts ALB listener traffic from "blue" (current) to "green" (new) task set. Deployment configuration controls traffic shift speed and bake time. Automatic rollback on CloudWatch alarm breach. Requires two target groups and a CodeDeploy application — more setup than rolling deploy, but true zero-downtime with instant rollback.

Canary 10% → 50% → 100% — gradual shift over configurable intervals
All-at-once — instant cutover after health checks pass (similar to K8s rolling update)
Rollback — revert listener rules to blue target group in seconds

Canary Lambda aliases

Lambda aliases (live, beta) point to function versions. Weighted alias routing sends 5% of traffic to the new version while 95% stays on the current. CloudWatch monitors error rate; CodeDeploy automates weight shifts. Ideal for serverless APIs where ECS-style blue/green is overkill.

$ # Publish new version
$ aws lambda publish-version --function-name order-api
→ Version: 47
$ # Route 10% to v47, 90% to v46 via alias
$ aws lambda update-alias --function-name order-api --name live \
  --routing-config '{"AdditionalVersionWeights": {"47": 0.1}}'
$ # Promote to 100% after bake time
$ aws lambda update-alias --function-name order-api --name live --function-version 47

GitOps on EKS

GitOps (Flux or ArgoCD) watches a manifest repo and reconciles cluster state — drift detection reverts manual kubectl edit changes. Pair with Sealed Secrets or External Secrets; never commit plaintext credentials. Image updater bots open PRs when ECR receives new digests.

🎯 Exam Tip

CodeDeploy vs CodePipeline: CodePipeline orchestrates stages (build, test, deploy); CodeDeploy performs the actual deployment strategy (in-place, blue/green, canary). Lambda aliases with weighted routing are the serverless equivalent of ECS blue/green — know which service handles which concern.

📦 Real World

FT.com and many AWS-native teams use GitHub Actions with OIDC to deploy to EKS via ArgoCD — the pipeline builds and pushes; GitOps reconciles. Separating "build credentials" from "deploy credentials" limits blast radius if a workflow is compromised.

Disaster recovery

Disaster recovery is a business decision expressed in two numbers: RTO (how fast you recover) and RPO (how much data you can lose). AWS offers four classic patterns plus AWS Elastic Disaster Recovery (DRS) for block-level replication of entire servers.

RTO and RPO defined

RTO (Recovery Time Objective) — maximum acceptable downtime after a disaster. RTO = 4 hours means the business tolerates four hours offline.
RPO (Recovery Point Objective) — maximum acceptable data loss measured in time. RPO = 15 minutes means you may lose up to 15 minutes of writes.

Lower RTO/RPO costs more. A backup-restore strategy with RPO of 24 hours is cheap; active-active with RPO near zero requires continuous replication and conflict resolution infrastructure.

DR strategy comparison

Strategy	RTO	RPO	Cost	What's running in DR site	AWS services
Backup & restore	Hours – days	Hours (last backup)	$	Nothing — restore from backups when needed	AWS Backup, RDS snapshots, S3 versioning, AMIs
Pilot light	1 – 4 hours	Minutes – hours	$$	Core data layer live; compute scaled to zero or minimal	RDS cross-region read replica, S3 CRR, AMIs in DR region
Warm standby	Minutes – 1 hour	Minutes – seconds	$$$	Scaled-down but functional copy of production	ASG min=1 in DR, Aurora Global, Route53 failover
Active-active	Near zero	Near zero – seconds	$$$$	Full production capacity in multiple regions	Global Tables, Aurora Global, Route53 latency routing, multi-region ALB

Strategy details

Backup & restore — AWS Backup policies for RDS, EBS, DynamoDB, EFS, S3; cross-account vault copy; quarterly restore drills. RTO dominated by IaC provisioning, not the restore API.
Pilot light — data plane warm (RDS cross-region replica, S3 CRR, DynamoDB exports); compute dormant until failover; Route53 health-check failover.
Warm standby — scaled-down but running DR copy (~10% capacity); continuous replication; scale up on failover. Best enterprise balance.
Active-active — see multi-region; idempotent APIs and conflict-aware stores required.
AWS DRS — block-level continuous replication (formerly CloudEndure) for EC2/on-prem; failover in minutes without refactoring to managed services. Pay for staging storage + drill/failover instance hours.

saved globally

# AWS Backup — daily RDS with cross-account copy
aws backup create-backup-plan --backup-plan '{
  "BackupPlanName": "prod-daily",
  "Rules": [{
    "RuleName": "rds-daily-35d",
    "TargetBackupVaultName": "prod-vault",
    "ScheduleExpression": "cron(0 5 ? * * *)",
    "StartWindowMinutes": 60,
    "CompletionWindowMinutes": 180,
    "Lifecycle": { "DeleteAfterDays": 35 },
    "CopyActions": [{
      "DestinationBackupVaultArn": "arn:aws:backup:us-east-1:999999999999:backup-vault:dr-vault",
      "Lifecycle": { "DeleteAfterDays": 35 }
    }]
  }]
}'

# RDS cross-region read replica (pilot light data layer)
aws rds create-db-instance-read-replica \
  --db-instance-identifier orders-dr-replica \
  --source-db-instance-identifier arn:aws:rds:eu-west-1:123456789012:db:orders-prod \
  --region us-east-1

resource "aws_backup_plan" "prod" {
  name = "prod-daily"
  rule {
    rule_name         = "rds-daily"
    target_vault_name = aws_backup_vault.prod.name
    schedule          = "cron(0 5 ? * * *)"
    lifecycle { delete_after = 35 }

    copy_action {
      destination_vault_arn = aws_backup_vault.dr.arn
      lifecycle { delete_after = 35 }
    }
  }
}

resource "aws_db_instance" "dr_replica" {
  provider               = aws.dr
  identifier             = "orders-dr-replica"
  replicate_source_db    = aws_db_instance.orders.arn
  instance_class         = "db.r6g.large"
  publicly_accessible    = false
  skip_final_snapshot    = true
}

import * as backup from 'aws-cdk-lib/aws-backup';
import * as rds from 'aws-cdk-lib/aws-rds';

const plan = backup.BackupPlan.dailyMonthly1YearRetention(this, 'ProdBackup');
plan.addSelection('RdsSelection', {
  resources: [backup.BackupResource.fromRdsDatabaseInstance(db)],
});

// Cross-region replica — deploy db stack in DR region with replicateSourceDb
new rds.DatabaseInstanceReadReplica(this, 'DrReplica', {
  sourceDatabaseInstance: db,
  instanceType: ec2.InstanceType.of(ec2.InstanceClass.R6G, ec2.InstanceSize.LARGE),
});

⚖️ Trade-off

DRS vs refactor: DRS replicates what you have — including technical debt and single-AZ mistakes. Long-term, migrating stateful tiers to Aurora Global, DynamoDB Global Tables, and S3 CRR is cheaper to operate than perpetual block-level replication. Use DRS as a bridge, not a destination.

💡 Pro Tip

Tag every resource with dr-tier=1|2|3 and rto-hours in your IaC modules. During architecture reviews, filter by tier to ensure DR investment matches business criticality — not every Lambda needs cross-region failover.