Databases: RDS, Aurora, DynamoDB & ElastiCache

Your Spring Boot monolith or microservice fleet needs durable state somewhere. AWS offers four database families that cover 95% of backend workloads: RDS for managed relational databases you already know, Aurora when PostgreSQL/MySQL needs cloud-native scale, DynamoDB for key-value access patterns at any throughput, and ElastiCache to keep hot data off the primary database. This chapter explains what AWS manages vs what you still own, how failover actually works, and how to pick the right engine without over-engineering.

developer devops architect Regional services Pay per use

RDS deep dive

Amazon RDS is managed relational database hosting — PostgreSQL, MySQL, MariaDB, Oracle, and SQL Server on EC2 you never SSH into. AWS handles patching, backups, and Multi-AZ failover; you still own schema design, query tuning, connection pooling, and migration strategy.

Managed vs you-manage

Layer AWS manages You manage
Infrastructure EC2 host, EBS volumes, hypervisor, AZ placement Instance class selection, storage type (gp3/io1), IOPS provisioning
Engine Minor version patches (maintenance window), engine binaries Major version upgrades, extension installs (PostGIS, pgvector), parameter tuning
Availability Multi-AZ synchronous replication + automatic DNS failover Application retry logic, connection pool drain, read replica routing
Data Automated backups to S3, snapshot lifecycle Schema migrations (Flyway/Liquibase), PITR window policy, cross-region copy for DR
Security Encryption at rest (KMS), encryption in transit (TLS) Security groups, IAM DB auth, Secrets Manager rotation, least-privilege DB users

Compare to self-managed PostgreSQL on EC2: you gain operational hours back but lose root access and some exotic extensions. For a typical Spring Boot + JPA workload, RDS is the default starting point.

Graviton instance classes (db.*)

ARM-based Graviton instances deliver 20–40% better price/performance for PostgreSQL and MySQL compared to equivalent Intel/AMD classes. Use the db.t4g.* burstable family for dev/staging and low-traffic services; db.r6g.* / db.r7g.* for memory-heavy production; db.m6g.* for balanced CPU/memory. Always verify your JDBC driver and any native extensions support ARM — most Java stacks do without changes.

  • db.t4g.* — burstable Graviton for dev/staging (~$0.065–0.129/hr)
  • db.r6g.large — 2 vCPU / 16 GiB; common production OLTP baseline (~$0.137/hr)
  • db.r6g.xlarge — 4 vCPU / 32 GiB; heavy JPA + large HikariCP pools (~$0.274/hr)

Multi-AZ vs read replicas

These solve different problems and are often confused on exams and in architecture reviews.

  • Multi-AZ (synchronous standby) — a hidden standby in another AZ receives every commit synchronously. On primary failure, RDS promotes the standby and updates the DNS endpoint (~60 seconds). Same endpoint URL; apps reconnect. You pay roughly 2× instance + 2× storage. No read traffic to standby.
  • Read replica (asynchronous) — separate endpoint for read scaling and reporting. Lag is typically sub-second but not zero. Replicas can be promoted to standalone primary (manual failover, minutes). Cross-region replicas exist for DR read access, not automatic failover.
  • Both together — production pattern: Multi-AZ primary for HA + 1–2 read replicas for @Transactional(readOnly = true) query paths or BI tools.
🎯 Exam Tip

Multi-AZ = high availability (same region, automatic failover, no read scaling). Read replica = read scaling + optional DR (separate endpoint, async lag, manual promotion). If the question says "minimize downtime during AZ failure" → Multi-AZ. If it says "offload read traffic" → read replica.

RDS Proxy for Lambda and ECS

Serverless functions and container fleets open many short-lived connections. PostgreSQL has a hard limit on concurrent connections (~max_connections, often a few hundred on smaller instances). RDS Proxy sits between your app and the database, pooling and multiplexing connections.

  • Lambda: each invocation would otherwise open/close a TCP+TLS+auth handshake — Proxy amortizes that cost
  • ECS/Fargate: scale tasks without linear connection growth on the DB
  • IAM authentication: Proxy can enforce IAM DB auth, eliminating static passwords in task definitions
  • Failover awareness: Proxy tracks primary changes during Multi-AZ failover and drains in-flight transactions
🔬 Under the Hood

Multi-AZ RDS uses synchronous storage-level replication to the standby — not PostgreSQL streaming replication you configure yourself. The standby is not queryable. When failover triggers, AWS renames network interfaces and updates the CNAME for your endpoint. Existing TCP connections break; connection pools must detect and retry (HikariCP connectionTestQuery or keepaliveTime settings matter here).

Parameter groups

DB parameters live in DB parameter groups (postgresql.conf as an API resource). Dynamic params apply immediately; static params need a reboot (e.g. shared_preload_libraries). Create a custom group per environment — never edit the default.

Backups and snapshots

Automated backups (1–35 day retention) enable PITR to any second within the window. Manual snapshots persist until deleted — take one before major upgrades. Cross-region snapshot copy is the standard DR pattern; skip final snapshot only on ephemeral dev databases.

💰 RDS Multi-AZ cost comparison

PostgreSQL on Graviton, gp3 storage — us-east-1 list pricing (approximate). Multi-AZ doubles instance and storage.

Single-AZ
Multi-AZ
HA premium

          
💰 Cost

Multi-AZ roughly doubles compute and storage — that's the insurance premium for automatic failover. Read replicas add another full instance (async, readable). RDS Proxy adds ~$0.015/vCPU-hour on top. For dev, use single-AZ + shorter backup retention; never run production OLTP single-AZ to save money.

Provision PostgreSQL Multi-AZ (production baseline)

saved globally
bash
aws rds create-db-instance \
  --db-instance-identifier order-db-prod \
  --engine postgres \
  --engine-version 16.4 \
  --db-instance-class db.r6g.large \
  --allocated-storage 100 \
  --storage-type gp3 \
  --multi-az \
  --master-username dbadmin \
  --manage-master-user-password \
  --master-user-secret-kms-key-id alias/aws/secretsmanager \
  --vpc-security-group-ids sg-0abc123 \
  --db-subnet-group-name private-db-subnets \
  --backup-retention-period 14 \
  --preferred-backup-window "03:00-04:00" \
  --storage-encrypted \
  --enable-performance-insights \
  --performance-insights-retention-period 7 \
  --deletion-protection
hcl
resource "aws_db_instance" "orders" {
  identifier     = "order-db-prod"
  engine         = "postgres"
  engine_version = "16.4"
  instance_class = "db.r6g.large"

  allocated_storage = 100
  storage_type      = "gp3"
  multi_az          = true

  db_subnet_group_name   = aws_db_subnet_group.private.name
  vpc_security_group_ids = [aws_security_group.db.id]

  username                            = "dbadmin"
  manage_master_user_password         = true
  master_user_secret_kms_key_id       = aws_kms_key.secrets.arn

  backup_retention_period = 14
  backup_window           = "03:00-04:00"
  maintenance_window      = "sun:04:00-sun:05:00"

  storage_encrypted               = true
  performance_insights_enabled    = true
  performance_insights_retention_period = 7
  deletion_protection             = true

  parameter_group_name = aws_db_parameter_group.postgres16.name
}

resource "aws_db_parameter_group" "postgres16" {
  family = "postgres16"
  name   = "orders-pg16-prod"

  parameter {
    name  = "log_min_duration_statement"
    value = "1000" # log queries > 1s
  }
}
typescript
import * as ec2 from 'aws-cdk-lib/aws-ec2';
import * as rds from 'aws-cdk-lib/aws-rds';

const pgParams = new rds.ParameterGroup(this, 'Pg16Params', {
  engine: rds.DatabaseInstanceEngine.postgres({
    version: rds.PostgresEngineVersion.VER_16_4,
  }),
  parameters: { log_min_duration_statement: '1000' },
});

new rds.DatabaseInstance(this, 'OrderDb', {
  engine: rds.DatabaseInstanceEngine.postgres({
    version: rds.PostgresEngineVersion.VER_16_4,
  }),
  instanceType: ec2.InstanceType.of(ec2.InstanceClass.R6G, ec2.InstanceSize.LARGE),
  vpc,
  vpcSubnets: { subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS },
  multiAz: true,
  allocatedStorage: 100,
  storageType: rds.StorageType.GP3,
  credentials: rds.Credentials.fromGeneratedSecret('dbadmin'),
  parameterGroup: pgParams,
  backupRetention: cdk.Duration.days(14),
  deletionProtection: true,
  storageEncrypted: true,
  enablePerformanceInsights: true,
});
⚠️ Pitfall

Placing RDS in public subnets with publicly_accessible = true to "simplify dev access." Use a bastion host, SSM Session Manager port forwarding, or VPN instead. Even with security group IP restrictions, public RDS endpoints appear in scanners within hours.

Aurora deep dive

Aurora is AWS's cloud-native relational engine — PostgreSQL- and MySQL-compatible wire protocols with a fundamentally different storage architecture. You get up to 15 read replicas sharing one storage volume, fast failover, and optional serverless scaling for spiky workloads.

Shared storage architecture

Unlike RDS where each instance has its own EBS volume, Aurora decouples compute (DB instances) from storage (a distributed, 6-way replicated volume across 3 AZs). Writes go to the storage layer which handles replication — compute nodes are stateless relative to storage.

flowchart TB
  subgraph compute["Aurora cluster (compute)"]
    WRITER["Writer instance\n(primary)"]
    R1["Reader replica 1"]
    R2["Reader replica 2"]
    RN["… up to 15 readers"]
  end

  subgraph storage["Shared storage volume (auto 6× replicated across 3 AZs)"]
    SEG1["Storage segment"]
    SEG2["Storage segment"]
    SEG3["Storage segment"]
  end

  APP["Spring Boot / Lambda"] --> WRITER
  APP --> R1
  WRITER --> storage
  R1 --> storage
  R2 --> storage
  RN --> storage

  storage -->|"continuous backup to S3"| S3["S3 (backup)"]
  • Storage auto-scales in 10 GB increments up to 128 TiB — no pre-provisioning
  • 6 copies across 3 AZs; Aurora can tolerate loss of 2 copies without write interruption
  • Storage is billed separately from compute (~$0.10/GB-month for Aurora storage)
  • Backups are continuous and incremental to S3 — no I/O penalty on the writer like snapshot-based systems
🔬 Under the Hood

Aurora replicas read from the same storage volume — replication lag is typically tens of milliseconds, not the seconds you see with binlog/WAL shipping on RDS read replicas. Failover promotes a reader to writer in ~30 seconds by reassigning the cluster endpoint. The storage volume persists unchanged — no data copy step.

Replicas — up to 15

Aurora clusters have exactly one writer and up to 15 reader instances in the same region:

  • Cluster endpoint — always points to the current writer; use for writes and transactional reads
  • Reader endpoint — load-balances across all readers; use for read-only queries
  • Custom endpoints — route specific app tiers to specific replica subsets (e.g. reporting vs API reads)
  • Promotion — manual or automatic (Aurora Auto Scaling for readers; failover for writer)

Aurora Serverless v2 (ACU)

Serverless v2 scales compute in ACUs (1 ACU ≈ 2 GiB RAM). Set min/max (e.g. 0.5–16 ACU); Aurora scales in ~15 seconds. Use for dev/test (min 0.5 ACU) and spiky SaaS traffic; skip for steady OLTP where provisioned instances are cheaper.

⚖️ Trade-off

Aurora vs RDS PostgreSQL: Aurora costs ~20% more per vCPU-hour but gives faster failover, more replicas, and auto-scaling storage. Choose Aurora when you need >5 read replicas, sub-minute failover, or Global Database. Stick with RDS for small workloads, exotic PostgreSQL extensions, or when cost predictability at low scale matters more.

Aurora Global Database

One primary region (read/write) + up to 5 secondary regions (read-only, typically <1 second lag). Cross-region replication uses dedicated infrastructure — not PostgreSQL logical replication you manage. Failover to a secondary region (managed or manual) completes in ~1 minute with RPO typically <1 second. Use for multi-region active-passive DR, not active-active writes (writes still go to one primary region).

MySQL vs PostgreSQL compatibility

Engine Compatibility Spring Boot Watch for
Aurora PostgreSQL PostgreSQL 11–16 (version-specific) spring.datasource.driver-class-name=org.postgresql.Driver Some extensions differ; pgvector supported on recent versions
Aurora MySQL MySQL 5.7 / 8.0 compatible com.mysql.cj.jdbc.Driver MySQL-specific SQL in migrations; InnoDB-only (no MyISAM)

Both support IAM database authentication, Performance Insights, and backtrack (MySQL only — undo recent changes without restore). JDBC connection strings use the cluster endpoint: jdbc:postgresql://my-cluster.cluster-abc123.eu-west-1.rds.amazonaws.com:5432/orders

DynamoDB deep dive

DynamoDB is a fully managed NoSQL key-value and document store with single-digit millisecond latency at any scale. There is no connection pool, no schema migration tool from AWS, and no SELECT * WHERE — success depends entirely on access pattern design upfront.

Partition key design

Every item lives in exactly one partition, determined by the partition key (HASH). Optionally add a sort key (RANGE) to store multiple items per partition and enable range queries (BETWEEN, begins_with).

  • High cardinality — partition keys should spread load evenly (userId, orderId, tenantId+date)
  • Composite keysPK=USER#123, SK=ORDER#2024-06-10#456 enables query by prefix
  • Avoid — status flags, boolean fields, or dates alone as partition keys (hot partitions)
  • Write sharding — append a random suffix (userId#shard-{0..9}) for write-heavy monotonic keys, query all shards in parallel
⚠️ Pitfall

Hot partitions — a celebrity product launch, viral event, or poorly chosen partition key (e.g. status=ACTIVE) concentrates all traffic on one partition. DynamoDB splits partitions automatically but only when sustained throughput exceeds ~3,000 RCU or 1,000 WCU per partition. Design keys for even distribution from day one; don't rely on splitting to save you.

GSI and LSI

Index When created Key schema Consistency
LSI (Local Secondary Index) At table creation only Same partition key, different sort key Strongly consistent reads supported
GSI (Global Secondary Index) Anytime Different partition + optional sort key Eventually consistent only (default)

GSIs have their own throughput capacity (provisioned mode) or share on-demand billing. Every GSI duplicates projected attributes — watch storage and write amplification (one write to table = one write per GSI).

On-demand vs provisioned capacity

  • On-demand — pay per request; no capacity planning; scales instantly; ~5× more expensive at steady high throughput
  • Provisioned — set RCU/WCU (or use auto scaling); cheaper at predictable load; requires monitoring
  • Reserved capacity — 1–3 year commit for provisioned mode; significant discount for known baselines
💡 Pro Tip

Start on-demand for new services with unknown traffic. After 2–4 weeks, pull CloudWatch ConsumedReadCapacityUnits / ConsumedWriteCapacityUnits metrics and switch to provisioned + auto scaling when the pattern stabilizes. Use on-demand for spiky, unpredictable workloads permanently.

DAX (DynamoDB Accelerator)

In-memory cache cluster co-located with DynamoDB — microsecond reads for eventually consistent GetItem/Query/Scan. Write-through cache: writes go to DynamoDB and DAX together. Use for read-heavy, latency-sensitive paths where ElastiCache would require you to manage cache invalidation logic. Not a replacement for ElastiCache session stores — DAX only accelerates DynamoDB API calls.

DynamoDB Streams

Ordered change log per partition (INSERT, MODIFY, REMOVE). Consumers: Lambda (event source mapping), Kinesis Client Library, or custom. Enable NEW_AND_OLD_IMAGES when you need the previous state for audit or CDC to OpenSearch. Streams are the backbone of single-table design denormalization — update a GSI projection asynchronously via Lambda.

Single-table design basics

Store multiple entity types in one table using overloaded keys and GSIs — one Query fetches a user's orders, profile, and preferences without joins:

json
{
  "PK": "USER#usr_abc123",
  "SK": "PROFILE",
  "email": "[email protected]",
  "name": "Alex Chen"
}
{
  "PK": "USER#usr_abc123",
  "SK": "ORDER#2024-06-10#ord_789",
  "total": 149.99,
  "status": "SHIPPED"
}
{
  "PK": "ORDER#ord_789",
  "SK": "METADATA",
  "GSI1PK": "STATUS#SHIPPED",
  "GSI1SK": "2024-06-10#ord_789"
}

Design all access patterns on paper first (AP1: get user by ID, AP2: list orders by status, …), then derive PK/SK/GSI keys. Adding a new access pattern later often requires a new GSI or migration — relational-style schema evolution doesn't apply.

Table with GSI (orders by status)

saved globally
bash
aws dynamodb create-table \
  --table-name Orders \
  --billing-mode PAY_PER_REQUEST \
  --attribute-definitions \
    AttributeName=PK,AttributeType=S \
    AttributeName=SK,AttributeType=S \
    AttributeName=GSI1PK,AttributeType=S \
    AttributeName=GSI1SK,AttributeType=S \
  --key-schema \
    AttributeName=PK,KeyType=HASH \
    AttributeName=SK,KeyType=RANGE \
  --global-secondary-indexes '[
    {
      "IndexName": "GSI1-StatusDate",
      "KeySchema": [
        {"AttributeName": "GSI1PK", "KeyType": "HASH"},
        {"AttributeName": "GSI1SK", "KeyType": "RANGE"}
      ],
      "Projection": {"ProjectionType": "ALL"}
    }
  ]' \
  --stream-specification StreamEnabled=true,StreamViewType=NEW_AND_OLD_IMAGES \
  --sse-specification Enabled=true,SSEType=KMS \
  --tags Key=Service,Value=order-api
hcl
resource "aws_dynamodb_table" "orders" {
  name         = "Orders"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "PK"
  range_key    = "SK"

  attribute {
    name = "PK"
    type = "S"
  }
  attribute {
    name = "SK"
    type = "S"
  }
  attribute {
    name = "GSI1PK"
    type = "S"
  }
  attribute {
    name = "GSI1SK"
    type = "S"
  }

  global_secondary_index {
    name            = "GSI1-StatusDate"
    hash_key        = "GSI1PK"
    range_key       = "GSI1SK"
    projection_type = "ALL"
  }

  stream_enabled   = true
  stream_view_type = "NEW_AND_OLD_IMAGES"

  server_side_encryption {
    enabled     = true
    kms_key_arn = aws_kms_key.app.arn
  }

  point_in_time_recovery {
    enabled = true
  }

  tags = { Service = "order-api" }
}
typescript
import * as dynamodb from 'aws-cdk-lib/aws-dynamodb';

const table = new dynamodb.Table(this, 'Orders', {
  partitionKey: { name: 'PK', type: dynamodb.AttributeType.STRING },
  sortKey: { name: 'SK', type: dynamodb.AttributeType.STRING },
  billingMode: dynamodb.BillingMode.PAY_PER_REQUEST,
  encryption: dynamodb.TableEncryption.CUSTOMER_MANAGED,
  encryptionKey: kmsKey,
  stream: dynamodb.StreamViewType.NEW_AND_OLD_IMAGES,
  pointInTimeRecovery: true,
});

table.addGlobalSecondaryIndex({
  indexName: 'GSI1-StatusDate',
  partitionKey: { name: 'GSI1PK', type: dynamodb.AttributeType.STRING },
  sortKey: { name: 'GSI1SK', type: dynamodb.AttributeType.STRING },
  projectionType: dynamodb.ProjectionType.ALL,
});
🔒 Security

Use fine-grained access control (IAM policy conditions on leading keys) so a Lambda can only read/write PK begins_with USER#${cognito:sub}. Enable point-in-time recovery (PITR) on production tables — it's cheap insurance against accidental DeleteTable or bad deploy scripts.

ElastiCache

Managed in-memory cache — Redis or Memcached on EC2 you don't operate. ElastiCache sits in your VPC private subnets and offloads read pressure from RDS/Aurora, stores session state, implements rate limiting, and backs Spring's @Cacheable abstraction.

Redis vs Memcached

Feature Redis Memcached
Data structuresStrings, hashes, lists, sets, sorted sets, streams, HyperLogLogStrings only (key-value)
PersistenceOptional (RDB snapshots, AOF)None — pure cache
ReplicationPrimary + replicas, Multi-AZ failoverNo native replication
Cluster modeUp to 500 shards, 90 nodesClient-side consistent hashing across nodes
Typical useSessions, rate limits, pub/sub, leaderboards, Spring CacheSimple object cache, HTML fragments

Default to Redis for new projects — Memcached only when you need simple horizontal scale of identical cache nodes with no persistence requirements and already have client-side sharding logic.

Cluster mode

Redis cluster mode shards data across multiple node groups (shards), each with a primary and optional replica. The cluster endpoint handles slot routing — clients like Lettuce (Spring Data Redis default) discover topology automatically.

  • Disabled cluster mode — single shard, one primary + up to 5 replicas; simpler; limited to one node's RAM
  • Enabled cluster mode — horizontal scale beyond ~100 GiB; required for >250k ops/sec
  • Replica per shard — enable for read scaling and Multi-AZ failover within the replication group

Multi-AZ and failover

Enable Multi-AZ on a replication group: AWS automatically fails over to a replica in another AZ when the primary fails (~30–60 seconds). Automatic failover requires at least one replica per shard. Application clients must support reconnect — Spring Lettuce handles this with spring.data.redis.lettuce.cluster.refresh.adaptive=true.

Common patterns

Session cache

Store Spring Session data in Redis instead of sticky sessions on ALB. Enables rolling deploys and autoscaling without session loss. Set TTL to match session timeout; use Redis hashes for compact storage.

Rate limiting

Token bucket or sliding window counters in Redis sorted sets. Example: limit API calls to 100/minute per userId with ZADD + ZREMRANGEBYSCORE + ZCARD. Works across all ECS tasks — unlike in-memory Guava rate limiters.

Spring Cache integration

yaml
spring:
  data:
    redis:
      cluster:
        nodes:
          - orders-cache.abc123.clustercfg.euw1.cache.amazonaws.com:6379
      lettuce:
        cluster:
          refresh:
            adaptive: true
  cache:
    type: redis
    redis:
      time-to-live: 300000  # 5 min — tune per cache name

# application code:
# @Cacheable(value = "products", key = "#sku")
# public Product findBySku(String sku) { ... }
⚠️ Pitfall

Caching entities with JPA lazy-loaded associations — serializing a Product graph to Redis can trigger LazyInitializationException or cache massive object trees. Cache DTOs or explicitly fetch joins; set TTLs; use @CacheEvict on writes.

Provision ElastiCache Redis cluster

saved globally
bash
aws elasticache create-replication-group \
  --replication-group-id orders-cache \
  --replication-group-description "Order API session + cache" \
  --engine redis \
  --engine-version 7.1 \
  --cache-node-type cache.r7g.large \
  --num-node-groups 2 \
  --replicas-per-node-group 1 \
  --automatic-failover enabled \
  --multi-az-enabled \
  --cache-subnet-group-name private-cache-subnets \
  --security-group-ids sg-0def456 \
  --at-rest-encryption-enabled \
  --transit-encryption-enabled \
  --auth-token "$(openssl rand -base64 32)"
hcl
resource "aws_elasticache_replication_group" "orders" {
  replication_group_id = "orders-cache"
  description          = "Order API session + cache"
  engine               = "redis"
  engine_version       = "7.1"
  node_type            = "cache.r7g.large"

  num_node_groups         = 2
  replicas_per_node_group = 1

  automatic_failover_enabled = true
  multi_az_enabled           = true

  subnet_group_name  = aws_elasticache_subnet_group.private.name
  security_group_ids = [aws_security_group.cache.id]

  at_rest_encryption_enabled = true
  transit_encryption_enabled = true
  auth_token                 = random_password.redis_auth.result

  parameter_group_name = aws_elasticache_parameter_group.redis7.name
}

resource "aws_elasticache_parameter_group" "redis7" {
  family = "redis7"
  name   = "orders-redis7"

  parameter {
    name  = "maxmemory-policy"
    value = "volatile-lru"
  }
}
typescript
import * as ec2 from 'aws-cdk-lib/aws-ec2';
import * as elasticache from 'aws-cdk-lib/aws-elasticache';

const subnetGroup = new elasticache.CfnSubnetGroup(this, 'CacheSubnets', {
  description: 'Private cache subnets',
  subnetIds: vpc.privateSubnets.map(s => s.subnetId),
  cacheSubnetGroupName: 'private-cache-subnets',
});

new elasticache.CfnReplicationGroup(this, 'OrdersCache', {
  replicationGroupId: 'orders-cache',
  replicationGroupDescription: 'Order API session + cache',
  engine: 'redis',
  engineVersion: '7.1',
  cacheNodeType: 'cache.r7g.large',
  numNodeGroups: 2,
  replicasPerNodeGroup: 1,
  automaticFailoverEnabled: true,
  multiAzEnabled: true,
  cacheSubnetGroupName: subnetGroup.cacheSubnetGroupName!,
  securityGroupIds: [cacheSg.securityGroupId],
  atRestEncryptionEnabled: true,
  transitEncryptionEnabled: true,
  authToken: redisAuth.secretValue.unsafeUnwrap(),
});
📦 Real World

Airbnb uses Redis extensively for rate limiting and feature flags at the edge of their service mesh. Teams standardize on ElastiCache Redis with cluster mode rather than self-managed Redis on EC2 — patching and failover become AWS's problem during holiday traffic spikes.

Database selection

Picking the wrong database is expensive to undo. Use this decision framework before provisioning — most Spring Boot teams default to RDS PostgreSQL and only reach for DynamoDB or Aurora when a specific requirement forces the choice.

RDS vs Aurora vs DynamoDB decision matrix

Requirement RDS Aurora DynamoDB
SQL + joins + ad-hoc queries ✅ Best fit ✅ Best fit ❌ Design access patterns upfront
Spring Data JPA / Hibernate ✅ Native ✅ Native (same JDBC) ⚠️ Use Spring Data DynamoDB or SDK v2
Multi-AZ automatic failover ✅ ~60s ✅ ~30s ✅ Built-in (multi-AZ by default)
Read scaling (>5 replicas) ⚠️ Up to 15 async replicas ✅ Up to 15 low-lag readers ✅ Scales automatically
Cross-region DR (<1 min RPO) ⚠️ Cross-region read replica + manual promote ✅ Global Database ✅ Global Tables
Predictable low cost at small scale ✅ Cheapest entry ⚠️ Premium pricing ⚠️ On-demand adds up; provisioned needs tuning
Spiky / serverless traffic ⚠️ Fixed instance size ✅ Serverless v2 ACU ✅ On-demand mode
Single-digit ms at millions of ops/sec ❌ Connection + query overhead ⚠️ Better but still SQL ✅ Purpose-built
⚖️ Trade-off

Polyglot persistence vs one database: microservices often justify DynamoDB for high-throughput services and PostgreSQL for reporting. Operating three engines increases runbook surface. Start monolithic on RDS; split data stores only when metrics prove the need (connection exhaustion, query latency, cost at scale).

Spring Boot + RDS/Aurora guidance

Dependencies and configuration

yaml
# build.gradle: implementation 'org.postgresql:postgresql'
#                  implementation 'org.flywaydb:flyway-database-postgresql'

spring:
  datasource:
    url: jdbc:postgresql://${DB_HOST}:5432/orders?sslmode=require
    username: ${DB_USER}
    password: ${DB_PASSWORD}  # from Secrets Manager in prod — never commit
    hikari:
      maximum-pool-size: 20          # tune: (CPU cores * 2) + spindle count rule is wrong for RDS
      minimum-idle: 5
      connection-timeout: 5000
      idle-timeout: 600000
      max-lifetime: 1800000          # < RDS idle timeout (often 30 min default)
      connection-test-query: SELECT 1
  jpa:
    hibernate:
      ddl-auto: validate             # Flyway owns schema — never update/create in prod
    properties:
      hibernate:
        jdbc:
          batch_size: 50
        order_inserts: true
        order_updates: true
  flyway:
    enabled: true
    baseline-on-migrate: true

Connection math with RDS Proxy

If you have 20 ECS tasks each with HikariCP pool size 20, that's 400 connections to a db.r6g.large that supports ~500–800 max. Put RDS Proxy in front: tasks connect to the proxy endpoint; proxy maintains a smaller pool to the database (~50–100 connections). Set HikariCP maximum-pool-size per task lower (5–10) when using Proxy.

Read replica routing

Spring doesn't route reads automatically. Options:

  • Manual — separate @Bean DataSource for reader endpoint; use in @Transactional(readOnly=true) routing aspect
  • Aurora reader endpoint — JDBC URL with -ro cluster endpoint for read-only repository methods
  • Avoid — routing writes to replicas (replication lag causes stale reads or conflicts)

Observability checklist

  • Enable Performance Insights on RDS/Aurora — top SQL, wait events, DB load
  • CloudWatch alarms: CPUUtilization > 80%, FreeableMemory low, DatabaseConnections near max
  • Log slow queries via parameter group (log_min_duration_statement)
  • Trace JDBC with X-Ray or OpenTelemetry JDBC instrumentation
🎯 Exam Tip

When the scenario describes "unpredictable traffic" + "NoSQL" + "single-digit millisecond" → DynamoDB on-demand. When it says "PostgreSQL compatibility" + "fast failover" + "15 read replicas" → Aurora. When it says "existing MySQL app" + "minimal migration" → RDS MySQL with Multi-AZ. ElastiCache is never the primary database — it's always a cache layer.

💡 Pro Tip

Run load tests against RDS before production cutover with the same HikariCP pool settings and connection count you'll use in ECS. Connection storms during deploys (old tasks not drained, new tasks starting) are a common cause of FATAL: too many connections — use ECS deployment circuit breaker and graceful shutdown hooks that close HikariCP on SIGTERM.