Home
Production
Cheat Sheets
Hub
AWS Core
Cheat Sheets
Cheat Sheets
Three dense quick references by persona — daily AWS CLI, platform IaC patterns, and architecture decision matrices.
Use Copy sheet for plain text or print (Cmd/Ctrl+P). Filter by persona or search inline.
developer
devops
architect
CLI v2
SAA / DVA / SOA
Developer
DevOps
Architect
Developer Cheat Sheet
CLI for daily work, IAM snippets, messaging, Logs Insights, common API errors. Chapters: Compute , Storage , Databases , Messaging .
developer
Copy sheet
AWS CLI — profile & region
Task Command Notes
SSO login aws sso login --profile devPreferred over long-lived keys
Set profile export AWS_PROFILE=devOr --profile dev per command
Caller identity aws sts get-caller-identityVerify account + role ARN
Default region aws configure set region eu-west-1Override: --region us-east-1
Assume role aws sts assume-role --role-arn arn:aws:iam::123:role/Dev --role-session-name localExport temp creds from JSON
JSON query aws ec2 describe-instances --query 'Reservations[].Instances[].InstanceId' --output textJMESPath on any API
Paginate all aws s3api list-objects-v2 --bucket b --query 'Contents[].Key' --output textAdd --page-size / --starting-token
Dry run aws ec2 terminate-instances --instance-ids i-abc --dry-runValidates IAM without action
EC2
Task Command
List instances aws ec2 describe-instances --filters "Name=instance-state-name,Values=running" "Name=tag:Environment,Values=prod"
Launch (exam pattern) aws ec2 run-instances --image-id ami-0abc --instance-type t3.micro --subnet-id subnet-xxx --security-group-ids sg-xxx --iam-instance-profile Name=AppRole --count 1
SSM shell (no SSH key) aws ssm start-session --target i-0123456789abcdef0
User data logs sudo tail -f /var/log/cloud-init-output.log
Stop / start aws ec2 stop-instances --instance-ids i-abc · aws ec2 start-instances --instance-ids i-abc
AMI from instance aws ec2 create-image --instance-id i-abc --name "app-golden-$(date +%F)" --no-reboot
Attach ENI / EIP aws ec2 associate-address --instance-id i-abc --allocation-id eipalloc-xxx
Spot interruption curl -s http://169.254.169.254/latest/meta-data/spot/termination-time
S3
Task Command
Sync directory aws s3 sync ./dist s3://my-bucket/app/ --delete --sse AES256
Presigned GET (CLI) aws s3 presign s3://my-bucket/obj.pdf --expires-in 3600
Block public access aws s3api put-public-access-block --bucket my-bucket --public-access-block-configuration BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true
Versioning aws s3api put-bucket-versioning --bucket my-bucket --versioning-configuration Status=Enabled
Lifecycle aws s3api put-bucket-lifecycle-configuration --bucket my-bucket --lifecycle-configuration file://lifecycle.json
Encryption default aws s3api put-bucket-encryption --bucket my-bucket --server-side-encryption-configuration '{"Rules":[{"ApplyServerSideEncryptionByDefault":{"SSEAlgorithm":"aws:kms","KMSMasterKeyID":"alias/my-key"}}]}'
Object ACL / ownership aws s3api put-bucket-ownership-controls --bucket my-bucket --ownership-controls Rules=[{ObjectOwnership=BucketOwnerEnforced}]
Restore Glacier aws s3api restore-object --bucket b --key archive.zip --restore-request Days=7,GlacierJobParameters={Tier=Standard}
ECS
Task Command
List clusters aws ecs list-clusters
Run one-off task aws ecs run-task --cluster prod --task-definition myapp:42 --launch-type FARGATE --network-configuration "awsvpcConfiguration={subnets=[subnet-a],securityGroups=[sg-app],assignPublicIp=DISABLED}"
Force new deployment aws ecs update-service --cluster prod --service web --force-new-deployment
Scale service aws ecs update-service --cluster prod --service web --desired-count 4
Task logs aws logs tail /ecs/web --follow --since 1h
Exec into container aws ecs execute-command --cluster prod --task TASK_ARN --container app --interactive --command "/bin/sh"
Register task def aws ecs register-task-definition --cli-input-json file://taskdef.json
Stop task aws ecs stop-task --cluster prod --task TASK_ARN --reason "debug"
RDS
Task Command
Describe instances aws rds describe-db-instances --query 'DBInstances[?DBInstanceStatus==`available`].[DBInstanceIdentifier,Endpoint.Address]' --output table
Create snapshot aws rds create-db-snapshot --db-instance-identifier prod-db --db-snapshot-identifier prod-db-manual-$(date +%F)
Restore to point in time aws rds restore-db-instance-to-point-in-time --source-db-instance-identifier prod-db --target-db-instance-identifier prod-db-restore --restore-time 2026-06-01T12:00:00Z
Modify instance aws rds modify-db-instance --db-instance-identifier prod-db --allocated-storage 200 --apply-immediately
Failover (Multi-AZ) aws rds reboot-db-instance --db-instance-identifier prod-db --force-failover
Parameter group aws rds modify-db-parameter-group --db-parameter-group-name pg15-custom --parameters ParameterName=max_connections,ParameterValue=500,ApplyMethod=pending-reboot
Secrets Manager rotation aws secretsmanager get-secret-value --secret-id prod/db/credentials --query SecretString --output text
Lambda
Task Command
Invoke sync aws lambda invoke --function-name ProcessOrder --payload file://event.json out.json && cat out.json
Invoke async aws lambda invoke --function-name ProcessOrder --invocation-type Event --payload file://event.json /dev/null
Update code (zip) aws lambda update-function-code --function-name ProcessOrder --zip-file fileb://function.zip
Update config aws lambda update-function-configuration --function-name ProcessOrder --memory-size 1024 --timeout 30 --environment Variables={ENV=prod}
Publish version aws lambda publish-version --function-name ProcessOrder
Alias traffic aws lambda update-alias --function-name ProcessOrder --name live --routing-config AdditionalVersionWeights={"2"=0.1}
Tail logs aws logs tail /aws/lambda/ProcessOrder --follow
Concurrency limit aws lambda put-function-concurrency --function-name ProcessOrder --reserved-concurrent-executions 50
IAM policy snippets
S3 read
SQS send
DynamoDB item
KMS decrypt
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": ["s3:GetObject", "s3:ListBucket"],
"Resource": [
"arn:aws:s3:::my-app-uploads",
"arn:aws:s3:::my-app-uploads/*"
],
"Condition": {
"StringEquals": { "aws:RequestedRegion": "eu-west-1" }
}
}]
}
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": ["sqs:SendMessage", "sqs:GetQueueUrl"],
"Resource": "arn:aws:sqs:eu-west-1:123456789012:orders.fifo",
"Condition": {
"Bool": { "aws:SecureTransport": "true" }
}
}]
}
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": [
"dynamodb:GetItem", "dynamodb:PutItem",
"dynamodb:UpdateItem", "dynamodb:Query"
],
"Resource": [
"arn:aws:dynamodb:eu-west-1:123456789012:table/Orders",
"arn:aws:dynamodb:eu-west-1:123456789012:table/Orders/index/*"
]
}]
}
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": ["kms:Decrypt", "kms:GenerateDataKey"],
"Resource": "arn:aws:kms:eu-west-1:123456789012:key/abcd-1234",
"Condition": {
"StringEquals": {
"kms:ViaService": "s3.eu-west-1.amazonaws.com"
}
}
}]
}
Pattern Key elements
Least privilege S3 Separate ListBucket on bucket ARN vs GetObject on /*
Tag-based access "Condition": { "StringEquals": { "aws:ResourceTag/Environment": "dev" } }
Deny non-TLS aws:SecureTransport = false → Deny
Permission boundary Attach managed policy as boundary; role policies cannot exceed it
Trust policy (ECS task) Principal: { "Service": "ecs-tasks.amazonaws.com" } + sts:AssumeRole
SQS & SNS
Task Command
Send SQS message aws sqs send-message --queue-url https://sqs.eu-west-1.amazonaws.com/123/orders --message-body '{"orderId":"42"}'
FIFO send aws sqs send-message --queue-url .../orders.fifo --message-body '{}' --message-group-id customer-99 --message-deduplication-id $(uuidgen)
Receive / delete aws sqs receive-message --queue-url URL --max-number-of-messages 10 --wait-time-seconds 20
Purge queue aws sqs purge-queue --queue-url URL
DLQ redrive aws sqs start-message-move-task --source-arn arn:aws:sqs:...:orders-dlq --destination-arn arn:aws:sqs:...:orders
SNS publish aws sns publish --topic-arn arn:aws:sns:eu-west-1:123:alerts --message "CPU high" --subject "Alert"
SNS → SQS sub aws sns subscribe --topic-arn TOPIC --protocol sqs --notification-endpoint arn:aws:sqs:...:worker --attributes RawMessageDelivery=true
Filter policy aws sns subscribe ... --attributes FilterPolicy='{"eventType":["order.created"]}'
CloudWatch Logs Insights
Use case Query
Lambda errors (5 min) fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 50
Lambda duration p99 filter @type = "REPORT" | stats pct(@duration, 99) as p99, avg(@duration) as avg_ms by bin(5m)
ALB 5xx by target parse @message '"target_status_code":*,' as code | filter code >= 500 | stats count() by target_group_arn
ECS OOM fields @timestamp, @message | filter @message like /OutOfMemoryError/ or @message like /OOMKilled/
API Gateway 429 filter status = 429 | stats count() as throttled by bin(1m)
Trace by request ID fields @timestamp, @message | filter @requestId = "abc-123-def"
# Run query from CLI (last 1 hour)
aws logs start-query \
--log-group-name /aws/lambda/ProcessOrder \
--start-time $(date -u -v-1H +%s) \
--end-time $(date -u +%s) \
--query-string 'fields @timestamp, @message | filter @message like /ERROR/ | limit 20'
# Poll results
aws logs get-query-results --query-id QUERY_ID
Common API errors
Error Typical cause Fix
AccessDeniedMissing IAM action, SCP deny, or resource policy CloudTrail errorMessage; simulate with IAM Policy Simulator
AccessDeniedException (KMS)Key policy or grant missing for role Add kms:Decrypt on key; check kms:ViaService
ThrottlingExceptionAPI TPS exceeded (DynamoDB, Lambda, etc.) Exponential backoff + jitter; raise limits; use SQS buffer
ProvisionedThroughputExceededExceptionDynamoDB RCU/WCU hot partition On-demand mode; better partition key; DAX
InvalidParameterValueExceptionBad subnet/SG combo, env var size, Lambda layer Validate VPC config; shrink env; check layer compat
ResourceNotFoundExceptionWrong region, deleted resource, typo in ARN aws sts get-caller-identity; verify region
BucketAlreadyOwnedByYouGlobal S3 namespace collision Pick unique bucket name
SignatureDoesNotMatchClock skew, wrong secret, region mismatch Sync NTP; verify cred chain; check SigV4 region
ExpiredTokenSTS session expired Refresh SSO / re-assume role
ServiceUnavailableTransient AZ or service issue Retry with backoff; check Health Dashboard
Java — S3 presigned URL (AWS SDK v2)
import software.amazon.awssdk.services.s3.presigner.S3Presigner;
import software.amazon.awssdk.services.s3.model.GetObjectRequest;
import software.amazon.awssdk.services.s3.presigner.model.GetObjectPresignRequest;
import java.time.Duration;
// Use DefaultCredentialsProvider (ECS task role, SSO profile locally)
try (S3Presigner presigner = S3Presigner.create()) {
GetObjectRequest getReq = GetObjectRequest.builder()
.bucket("my-app-uploads")
.key("reports/q2.pdf")
.build();
GetObjectPresignRequest presignReq = GetObjectPresignRequest.builder()
.signatureDuration(Duration.ofMinutes(15))
.getObjectRequest(getReq)
.build();
String url = presigner.presignGetObject(presignReq).url().toString();
// Return url to browser; no bucket credentials exposed
}
Tip Detail
PUT upload presignPutObject(PutObjectPresignRequest) — set Content-Type in signed request
KMS bucket Presigner role needs s3:GetObject + kms:Decrypt on CMK
Expiry Max 7 days (SigV4); prefer 5–15 min for downloads
Spring Inject S3Presigner bean; never embed access keys in application.yml
VPC CIDR planning
Tier Recommended CIDR Usable IPs / subnet Notes
VPC (prod) 10.0.0.0/1665,536 total One /16 per account/env; avoid overlap for peering/TGW
Public subnet (AZ-a) 10.0.0.0/20~4,091 ALB, NAT Gateway, bastion (avoid if possible)
Public subnet (AZ-b) 10.0.16.0/20~4,091 Multi-AZ ALB requires 2+ public subnets
Private app (AZ-a) 10.0.32.0/20~4,091 ECS tasks, Lambda (VPC), EC2 app tier
Private app (AZ-b) 10.0.48.0/20~4,091 Spread workloads evenly across AZs
Private data (AZ-a) 10.0.64.0/24~251 RDS, ElastiCache — no internet route
Private data (AZ-b) 10.0.65.0/24~251 RDS subnet group spans both
Spare / TGW 10.0.128.0/17— Future expansion, VPN, hybrid
Rule Why
Reserve 3+ AZs worth of /20 app subnets ECS/EKS pod density grows; /24 fills fast with ENIs
Never use 172.31.0.0/16 default VPC CIDR in prod design Exam trap; default VPC is not HA or segmented
Document RFC1918 overlap before TGW peering Overlapping CIDR = no peering path
Enable VPC Flow Logs to S3 (Parquet) for cost at scale CloudWatch Logs expensive for high-volume VPCs
Security group patterns — ALB → App → DB
SG Inbound Outbound
sg-alb443 from 0.0.0.0/0 (or CloudFront prefix list) 8080 → sg-app only
sg-app8080 from sg-alb 5432 → sg-db; 443 → 0.0.0.0/0 (AWS APIs via NAT)
sg-db5432 from sg-app only None required (stateful return traffic)
sg-bastion (if used)22/443 from corp IP range 22 → sg-app (prefer SSM instead)
Anti-pattern Fix
App SG allows 0.0.0.0/0:8080 Reference ALB SG ID as source
DB SG allows entire VPC CIDR Reference app SG only
ALB in private subnet without VPC endpoint ALB must be public subnets OR use internal ALB + VPN
NACL duplicates SG rules Default NACL allow-all; enforce at SG layer
Terraform resource quick ref
VPC module
ECS Fargate
RDS
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "~> 5.0"
name = "prod"
cidr = "10.0.0.0/16"
azs = ["eu-west-1a", "eu-west-1b", "eu-west-1c"]
private_subnets = ["10.0.32.0/20", "10.0.48.0/20", "10.0.64.0/20"]
public_subnets = ["10.0.0.0/20", "10.0.16.0/20", "10.0.80.0/20"]
enable_nat_gateway = true
single_nat_gateway = false # one NAT per AZ = HA, higher cost
enable_flow_log = true
}
resource "aws_ecs_service" "web" {
name = "web"
cluster = aws_ecs_cluster.prod.id
task_definition = aws_ecs_task_definition.web.arn
desired_count = 2
launch_type = "FARGATE"
network_configuration {
subnets = module.vpc.private_subnets
security_groups = [aws_security_group.app.id]
assign_public_ip = false
}
load_balancer {
target_group_arn = aws_lb_target_group.web.arn
container_name = "app"
container_port = 8080
}
lifecycle { ignore_changes = [desired_count] } # allow autoscaling
}
resource "aws_db_instance" "prod" {
identifier = "prod-db"
engine = "postgres"
engine_version = "15"
instance_class = "db.r6g.large"
allocated_storage = 100
storage_encrypted = true
kms_key_id = aws_kms_key.rds.arn
db_subnet_group_name = aws_db_subnet_group.prod.name
vpc_security_group_ids = [aws_security_group.db.id]
multi_az = true
backup_retention_period = 7
deletion_protection = true
skip_final_snapshot = false
final_snapshot_identifier = "prod-db-final"
}
Resource Terraform type Critical attributes
ALB aws_lb + aws_lb_target_groupinternal, subnets, health check path
IAM role aws_iam_role + aws_iam_role_policy_attachmentTrust policy JSON in assume_role_policy
S3 bucket aws_s3_bucket + separate encryption/public access resourcesAWS provider v4+ splits bucket config
Lambda aws_lambda_functionrole, vpc_config, environment
SQS aws_sqs_queue + aws_sqs_queue (DLQ)redrive_policy, visibility_timeout_seconds
Secrets aws_secretsmanager_secretReference from ECS task def secrets block
CDK construct quick ref (TypeScript)
Need CDK construct Package
VPC + subnets ec2.Vpcaws-cdk-lib/aws-ec2
Fargate service + ALB ecs_patterns.ApplicationLoadBalancedFargateServiceaws-cdk-lib/aws-ecs-patterns
Lambda + API apigw.LambdaRestApi or HttpApiaws-apigatewayv2-integrations
RDS Postgres rds.DatabaseInstance or ServerlessCluster (Aurora)aws-cdk-lib/aws-rds
SQS queue sqs.Queue with deadLetterQueueaws-cdk-lib/aws-sqs
Event-driven Lambda lambda.Function + SqsEventSourceaws-lambda-event-sources
Secrets in task ecs.Secret.fromSecretsManageraws-cdk-lib/aws-ecs
WAF on ALB wafv2.CfnWebACL + associationaws-cdk-lib/aws-wafv2
// Minimal Fargate + ALB pattern
const vpc = new ec2.Vpc(this, 'Vpc', { maxAzs: 2, natGateways: 1 });
new ecsPatterns.ApplicationLoadBalancedFargateService(this, 'Service', {
vpc,
taskImageOptions: {
image: ecs.ContainerImage.fromRegistry('nginx'),
containerPort: 8080,
environment: { ENV: 'prod' },
},
publicLoadBalancer: true,
desiredCount: 2,
});
NAT Gateway cost formula
Component Rate (approx.) Formula
Hourly charge ~$0.045/hr per NAT hours × $0.045 × NAT_count
Data processing ~$0.045/GB GB_processed × $0.045
Monthly baseline (1 NAT) ~$32/mo idle 730 × 0.045
3 NATs (HA per AZ) ~$96/mo idle Before any data transfer
Total NAT cost ≈ (NAT_count × 730 × hourly_rate) + (GB_out × data_processing_rate)
Cost reduction Trade-off
Single NAT Gateway (single_nat_gateway = true) AZ failure loses outbound for all private subnets
VPC endpoints (S3, DynamoDB gateway — free) Eliminates NAT GB for those services
Interface endpoints (ECR, Secrets Manager, SSM) ~$7/mo/AZ/endpoint; cheaper at high NAT volume
NAT instance (self-managed) Ops burden; not exam default
Cost optimization checklist
# Action Tool / service
1 Right-size EC2/RDS with 2 weeks of metrics Compute Optimizer, Cost Explorer
2 Buy Savings Plans / RIs for steady baseline 1-yr compute SP for Fargate + Lambda
3 S3 lifecycle → IA → Glacier → expire Lifecycle rules on versioned buckets
4 Delete idle EIPs, unattached EBS, old snapshots Trusted Advisor, AWS Config rules
5 Enable S3 Intelligent-Tiering for unknown access Monitoring fee vs storage savings
6 Lambda: tune memory (affects CPU + cost) Power Tuning tool
7 CloudWatch log retention 7–30 days prod Never infinite retention
8 Use Graviton (ARM) where supported ~20% cheaper t4g/r6g/m6g
9 Tag everything: Environment, Team, CostCenter Cost allocation tags + budgets
10 Schedule dev/staging off-hours Instance Scheduler, ASG desired=0
Multi-AZ production checklist
Layer Requirement Verify
VPC ≥2 AZs, subnets per AZ aws ec2 describe-subnets --filters Name=vpc-id,Values=vpc-xxx
ALB Subnets in ≥2 AZs Targets healthy in each AZ
ECS/EKS spread or distinctInstance placementTasks/pods across AZs
RDS multi_az = trueSync replica in second AZ
NAT 1 per AZ (strict HA) or 1 shared (cost) Route tables per private subnet → local NAT
Route 53 Health checks + failover/weighted records DNS TTL ≤ 60s for failover
S3 Cross-region replication for critical data CRR rule + RTC if needed
Backups Automated snapshots + cross-region copy AWS Backup plan
Secrets Replicated secrets (multi-region) Secrets Manager replication
Runbook AZ failure game day documented Simulate AZ isolate in staging
ECS task definition essentials
Field Value / pattern Why
requiresCompatibilities["FARGATE"]Serverless; no EC2 capacity management
networkModeawsvpcRequired for Fargate; each task gets ENI
cpu / memoryValid Fargate pairs (e.g. 512/1024, 1024/2048) Invalid combo fails registration
taskRoleArnApp runtime permissions (S3, DynamoDB) SDK uses this — not execution role
executionRoleArnAmazonECSTaskExecutionRolePolicy + ECR pullPull image, write logs, read secrets
logConfigurationawslogs driver → /ecs/service-nameCentralized in CloudWatch
healthCheckCMD curl localhost:8080/actuator/health ALB health ≠ container health for dependency wait
secretsvalueFrom Secrets Manager ARNNever plain env for passwords
readonlyRootFilesystemtrue + tmp volumeSecurity hardening
{
"family": "web",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "512",
"memory": "1024",
"taskRoleArn": "arn:aws:iam::123:role/WebTaskRole",
"executionRoleArn": "arn:aws:iam::123:role/WebExecutionRole",
"containerDefinitions": [{
"name": "app",
"image": "123.dkr.ecr.eu-west-1.amazonaws.com/web:1.2.3",
"portMappings": [{ "containerPort": 8080, "protocol": "tcp" }],
"essential": true,
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/web",
"awslogs-region": "eu-west-1",
"awslogs-stream-prefix": "app"
}
},
"secrets": [
{ "name": "DB_PASSWORD", "valueFrom": "arn:aws:secretsmanager:eu-west-1:123:secret:prod/db:password::" }
]
}]
}
Compute selection — ECS vs EKS vs Lambda vs EC2
Criterion ECS Fargate EKS Lambda EC2
Ops burden Low High (control plane + nodes) Lowest Highest
Max duration Unlimited Unlimited 15 min Unlimited
Cold start Task pull ~30s Pod schedule ~30s ms–seconds Always warm if running
Scaling Service auto scaling HPA + CA + Karpenter Automatic concurrency ASG manual/custom
Portability Docker only K8s manifests (multi-cloud) Function per trigger Any workload
Best for AWS-native microservices, Java/Spring Existing K8s teams, operators, service mesh Event-driven, API thin layer, cron Legacy, GPU, custom kernel, license-bound
Cost at steady load Medium Medium–high High at sustained TPS Lowest with RIs/SP
Exam default “Containers without managing servers” “Kubernetes on AWS” “Event-driven, no server management” “Full OS control”
Database selection — RDS vs Aurora vs DynamoDB
Criterion RDS Aurora DynamoDB
Model Managed relational (PG, MySQL, etc.) MySQL/PostgreSQL compatible, distributed storage NoSQL key-value / document
Scale up Vertical (instance class) Up to 128 TB storage auto-grow Horizontal (on-demand unlimited)
Read scale Read replicas (async) Up to 15 low-lag replicas DAX for microsecond cache
Multi-AZ Synchronous standby 6 copies across 3 AZs storage layer Multi-AZ by default
Consistency Strong (single instance) Strong primary; eventual on replicas Eventually consistent reads optional
Joins / SQL Full SQL Full SQL No joins; design access patterns first
Best for Standard OLTP, lift-and-shift High-throughput OLTP, global databases Key-value, sessions, gaming, IoT, single-digit ms
Exam trigger “Managed PostgreSQL” “5× throughput”, “global database” “Massive scale”, “millisecond latency”, “serverless”
Messaging selection — SQS vs SNS vs EventBridge vs Kinesis
Criterion SQS SNS EventBridge Kinesis
Pattern Queue (pull, 1 consumer group) Pub/sub fan-out (push) Event bus (routing rules) Stream (ordered shards)
Ordering FIFO queues only No order guarantee (FIFO topic = order) No strict order Per partition key
Retention 1 min – 14 days No retention (transient) Replay archive optional 1–365 days
Throughput Standard: unlimited; FIFO: 300 TPS Very high publish High; cross-account Shard-limited; auto scaling
Consumers Workers, Lambda pollers HTTP, Lambda, SQS, email, SMS Lambda, Step Functions, SaaS targets Analytics, Lambda, Firehose
Best for Decouple, buffer, DLQ retry Alerts, multi-subscriber notify Event-driven architecture, schema registry Clickstream, logs, real-time analytics
Exam trigger “Decouple microservices” “Fan-out notifications” “Route events from AWS services” “Real-time processing ordered records”
Disaster recovery — RTO / RPO strategies
Strategy RPO RTO AWS pattern Cost
Backup & restore Hours (backup interval) Hours–days AMI/EBS/RDS snapshots → restore in DR region $
Pilot light Minutes–hours Hours Minimal DR: RDS replica stopped, AMIs copied, Route 53 health failover $$
Warm standby Minutes Minutes–hours Scaled-down full stack in DR; ASG min=1 $$$
Active-active Near-zero Near-zero Multi-region ALB + Aurora Global + DynamoDB Global Tables $$$$
S3 CRR + RTC Minutes Minutes Cross-Region Replication; Replication Time Control SLA Per-GB replication
RDS cross-region read replica Seconds–minutes (async) Promote replica ~minutes promote-read-replica in DR regionReplica instance cost
Well-Architected Framework — six pillars
Pillar Focus Key AWS services / practices
Operational Excellence Run and monitor systems CloudWatch, Systems Manager, runbooks, IaC, CI/CD
Security Protect data and systems IAM, KMS, GuardDuty, SCPs, encryption, least privilege
Reliability Recover from failure Multi-AZ, auto scaling, backups, chaos testing
Performance Efficiency Use resources efficiently Right-sizing, caching (ElastiCache, CloudFront), Graviton
Cost Optimization Avoid unnecessary spend Cost Explorer, SP/RIs, lifecycle, tagging, delete idle
Sustainability Minimize environmental impact Graviton efficiency, serverless, region selection, density
Review question Pillar
“How do you detect unauthorized API calls?” Security — CloudTrail + GuardDuty
“How do you survive AZ failure?” Reliability — Multi-AZ + health checks
“How do you reduce NAT data charges?” Cost — VPC endpoints
“How do you deploy without downtime?” Operational Excellence — blue/green, CodeDeploy
Exam service limits (memorize)
Service Limit Exam note
Lambda timeout 15 minutes max Long jobs → Step Functions or ECS
Lambda /tmp storage 512 MB – 10,240 MB Ephemeral; not durable
Lambda env vars 4 KB total Large config → SSM/Secrets Manager
Lambda concurrent executions 1,000 default/account/region Request increase; use SQS buffer
API Gateway REST timeout 29 seconds Long poll → async Lambda + webhook
ALB idle timeout Default 60s (max 4000s) WebSockets need higher idle
NLB Layer 4, static IP, ultra-low latency Not HTTP routing — use ALB for path/host
SQS visibility timeout 0 – 12 hours Must exceed Lambda max processing time
SQS message size 256 KB Larger payloads → S3 pointer in body
SNS message size 256 KB Same S3 extended client pattern
EBS volume max 64 TiB (gp3) Snapshot → new volume for resize
RDS backup retention 1–35 days Automated backups for PITR
VPC per region Default 5 (soft limit) Request increase via support
Security groups per ENI 5 (can increase to 16) Prefer SG referencing over CIDR sprawl
IAM policy size 6,144 chars managed; 2,048 inline user Split policies; use permission sets
CloudFront origin Any HTTP origin + S3 Edge caching; signed URLs/cookies
Route 53 health check HTTP/HTTPS/TCP/CloudWatch alarm Failover routing policy
DynamoDB item size 400 KB max Large docs → S3 + pointer attribute
Kinesis shard 1 MB/s or 1,000 records/s ingest Hot shard → better partition key
ECS Fargate task ENI One ENI per task (awsvpc) IP exhaustion → larger subnets
Networking cost comparison
Traffic path Charged? Typical rate Optimization
Internet → CloudFront → S3 CloudFront egress only ~$0.085/GB (varies by region) OAI/OAC; cache hit ratio
Internet → ALB → EC2 ALB LCU + EC2 egress LCU ~$0.008/hr + data CloudFront in front of static
AZ-a → AZ-b (same region) Yes — cross-AZ ~$0.01/GB each direction Keep chatter in same AZ where possible
Private subnet → internet (via NAT) NAT hourly + $/GB ~$0.045/GB processed VPC endpoints for AWS APIs
S3 gateway endpoint Free $0 Route S3/DynamoDB without NAT
Interface endpoint (ECR, SM) Hourly + data ~$0.01/hr/AZ + $0.01/GB Break-even vs NAT at ~100 GB/mo
Region → region (S3 CRR) Replication + storage ~$0.02/GB inter-region CRR only for DR-critical buckets
On-prem → AWS (Direct Connect) Port hours + data 1 Gbps ~$0.30/hr + transfer DX + VPN backup; no NAT for hybrid
VPC peering Same-region data transfer ~$0.01/GB Transitive routing needs TGW
Transit Gateway Attachment hourly + data ~$0.05/hr/attach + $0.02/GB Hub for many VPCs; replaces full mesh peering
Architecture decision checklist
# Question If yes → consider
1 Spiky unpredictable traffic? Lambda + API GW, or Fargate with aggressive scaling
2 Existing Kubernetes investment? EKS (or ECS if team prefers simpler AWS-native)
3 Need global low-latency reads? Aurora Global, DynamoDB Global Tables, CloudFront
4 Strict ordering required? SQS FIFO, Kinesis partition key, or single-thread consumer
5 Audit every API call? CloudTrail org trail → S3 + Athena; GuardDuty enabled
6 PCI/HIPAA data? KMS CMK, encryption in transit, private subnets, no public S3
7 Batch / ETL heavy? Step Functions + Glue, or EMR, not Lambda chains
8 Multi-account governance? Organizations + SCPs + Control Tower + IAM Identity Center
9 Zero-downtime deploys? CodeDeploy blue/green (ECS/ASG), weighted Route 53
10 Cost ceiling for startup? Serverless-first; single NAT; GP3 EBS; 7-day log retention
🎯 Exam Tip
When the question says “decouple” → SQS. “Fan-out” → SNS (+ SQS subs). “Route AWS service events” → EventBridge. “Real-time analytics stream” → Kinesis. “No server management” → Lambda or Fargate. “Lowest ops containers” → ECS Fargate over EKS unless K8s is explicit.