Databases: RDS, Aurora, DynamoDB & ElastiCache

RDS deep dive

Amazon RDS is managed relational database hosting — PostgreSQL, MySQL, MariaDB, Oracle, and SQL Server on EC2 you never SSH into. AWS handles patching, backups, and Multi-AZ failover; you still own schema design, query tuning, connection pooling, and migration strategy.

Managed vs you-manage

Layer	AWS manages	You manage
Infrastructure	EC2 host, EBS volumes, hypervisor, AZ placement	Instance class selection, storage type (gp3/io1), IOPS provisioning
Engine	Minor version patches (maintenance window), engine binaries	Major version upgrades, extension installs (PostGIS, pgvector), parameter tuning
Availability	Multi-AZ synchronous replication + automatic DNS failover	Application retry logic, connection pool drain, read replica routing
Data	Automated backups to S3, snapshot lifecycle	Schema migrations (Flyway/Liquibase), PITR window policy, cross-region copy for DR
Security	Encryption at rest (KMS), encryption in transit (TLS)	Security groups, IAM DB auth, Secrets Manager rotation, least-privilege DB users

Compare to self-managed PostgreSQL on EC2: you gain operational hours back but lose root access and some exotic extensions. For a typical Spring Boot + JPA workload, RDS is the default starting point.

Graviton instance classes (db.*)

ARM-based Graviton instances deliver 20–40% better price/performance for PostgreSQL and MySQL compared to equivalent Intel/AMD classes. Use the db.t4g.* burstable family for dev/staging and low-traffic services; db.r6g.* / db.r7g.* for memory-heavy production; db.m6g.* for balanced CPU/memory. Always verify your JDBC driver and any native extensions support ARM — most Java stacks do without changes.

db.t4g.* — burstable Graviton for dev/staging (~$0.065–0.129/hr)
db.r6g.large — 2 vCPU / 16 GiB; common production OLTP baseline (~$0.137/hr)
db.r6g.xlarge — 4 vCPU / 32 GiB; heavy JPA + large HikariCP pools (~$0.274/hr)

Multi-AZ vs read replicas

These solve different problems and are often confused on exams and in architecture reviews.

Multi-AZ (synchronous standby) — a hidden standby in another AZ receives every commit synchronously. On primary failure, RDS promotes the standby and updates the DNS endpoint (~60 seconds). Same endpoint URL; apps reconnect. You pay roughly 2× instance + 2× storage. No read traffic to standby.
Read replica (asynchronous) — separate endpoint for read scaling and reporting. Lag is typically sub-second but not zero. Replicas can be promoted to standalone primary (manual failover, minutes). Cross-region replicas exist for DR read access, not automatic failover.
Both together — production pattern: Multi-AZ primary for HA + 1–2 read replicas for @Transactional(readOnly = true) query paths or BI tools.

🎯 Exam Tip

Multi-AZ = high availability (same region, automatic failover, no read scaling). Read replica = read scaling + optional DR (separate endpoint, async lag, manual promotion). If the question says "minimize downtime during AZ failure" → Multi-AZ. If it says "offload read traffic" → read replica.

RDS Proxy for Lambda and ECS

Serverless functions and container fleets open many short-lived connections. PostgreSQL has a hard limit on concurrent connections (~max_connections, often a few hundred on smaller instances). RDS Proxy sits between your app and the database, pooling and multiplexing connections.

Lambda: each invocation would otherwise open/close a TCP+TLS+auth handshake — Proxy amortizes that cost
ECS/Fargate: scale tasks without linear connection growth on the DB
IAM authentication: Proxy can enforce IAM DB auth, eliminating static passwords in task definitions
Failover awareness: Proxy tracks primary changes during Multi-AZ failover and drains in-flight transactions

🔬 Under the Hood

Multi-AZ RDS uses synchronous storage-level replication to the standby — not PostgreSQL streaming replication you configure yourself. The standby is not queryable. When failover triggers, AWS renames network interfaces and updates the CNAME for your endpoint. Existing TCP connections break; connection pools must detect and retry (HikariCP connectionTestQuery or keepaliveTime settings matter here).

Parameter groups

DB parameters live in DB parameter groups (postgresql.conf as an API resource). Dynamic params apply immediately; static params need a reboot (e.g. shared_preload_libraries). Create a custom group per environment — never edit the default.

Backups and snapshots

Automated backups (1–35 day retention) enable PITR to any second within the window. Manual snapshots persist until deleted — take one before major upgrades. Cross-region snapshot copy is the standard DR pattern; skip final snapshot only on ephemeral dev databases.

💰 RDS Multi-AZ cost comparison

PostgreSQL on Graviton, gp3 storage — us-east-1 list pricing (approximate). Multi-AZ doubles instance and storage.

Instance class (hourly rate)

Storage (GB, gp3)

Single-AZ —

Multi-AZ —

HA premium —

💰 Cost

Multi-AZ roughly doubles compute and storage — that's the insurance premium for automatic failover. Read replicas add another full instance (async, readable). RDS Proxy adds ~$0.015/vCPU-hour on top. For dev, use single-AZ + shorter backup retention; never run production OLTP single-AZ to save money.

Provision PostgreSQL Multi-AZ (production baseline)

saved globally

aws rds create-db-instance \
  --db-instance-identifier order-db-prod \
  --engine postgres \
  --engine-version 16.4 \
  --db-instance-class db.r6g.large \
  --allocated-storage 100 \
  --storage-type gp3 \
  --multi-az \
  --master-username dbadmin \
  --manage-master-user-password \
  --master-user-secret-kms-key-id alias/aws/secretsmanager \
  --vpc-security-group-ids sg-0abc123 \
  --db-subnet-group-name private-db-subnets \
  --backup-retention-period 14 \
  --preferred-backup-window "03:00-04:00" \
  --storage-encrypted \
  --enable-performance-insights \
  --performance-insights-retention-period 7 \
  --deletion-protection

resource "aws_db_instance" "orders" {
  identifier     = "order-db-prod"
  engine         = "postgres"
  engine_version = "16.4"
  instance_class = "db.r6g.large"

  allocated_storage = 100
  storage_type      = "gp3"
  multi_az          = true

  db_subnet_group_name   = aws_db_subnet_group.private.name
  vpc_security_group_ids = [aws_security_group.db.id]

  username                            = "dbadmin"
  manage_master_user_password         = true
  master_user_secret_kms_key_id       = aws_kms_key.secrets.arn

  backup_retention_period = 14
  backup_window           = "03:00-04:00"
  maintenance_window      = "sun:04:00-sun:05:00"

  storage_encrypted               = true
  performance_insights_enabled    = true
  performance_insights_retention_period = 7
  deletion_protection             = true

  parameter_group_name = aws_db_parameter_group.postgres16.name
}

resource "aws_db_parameter_group" "postgres16" {
  family = "postgres16"
  name   = "orders-pg16-prod"

  parameter {
    name  = "log_min_duration_statement"
    value = "1000" # log queries > 1s
  }
}

import * as ec2 from 'aws-cdk-lib/aws-ec2';
import * as rds from 'aws-cdk-lib/aws-rds';

const pgParams = new rds.ParameterGroup(this, 'Pg16Params', {
  engine: rds.DatabaseInstanceEngine.postgres({
    version: rds.PostgresEngineVersion.VER_16_4,
  }),
  parameters: { log_min_duration_statement: '1000' },
});

new rds.DatabaseInstance(this, 'OrderDb', {
  engine: rds.DatabaseInstanceEngine.postgres({
    version: rds.PostgresEngineVersion.VER_16_4,
  }),
  instanceType: ec2.InstanceType.of(ec2.InstanceClass.R6G, ec2.InstanceSize.LARGE),
  vpc,
  vpcSubnets: { subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS },
  multiAz: true,
  allocatedStorage: 100,
  storageType: rds.StorageType.GP3,
  credentials: rds.Credentials.fromGeneratedSecret('dbadmin'),
  parameterGroup: pgParams,
  backupRetention: cdk.Duration.days(14),
  deletionProtection: true,
  storageEncrypted: true,
  enablePerformanceInsights: true,
});

⚠️ Pitfall

Placing RDS in public subnets with publicly_accessible = true to "simplify dev access." Use a bastion host, SSM Session Manager port forwarding, or VPN instead. Even with security group IP restrictions, public RDS endpoints appear in scanners within hours.

Aurora deep dive

Aurora is AWS's cloud-native relational engine — PostgreSQL- and MySQL-compatible wire protocols with a fundamentally different storage architecture. You get up to 15 read replicas sharing one storage volume, fast failover, and optional serverless scaling for spiky workloads.

Shared storage architecture

Unlike RDS where each instance has its own EBS volume, Aurora decouples compute (DB instances) from storage (a distributed, 6-way replicated volume across 3 AZs). Writes go to the storage layer which handles replication — compute nodes are stateless relative to storage.

flowchart TB
  subgraph compute["Aurora cluster (compute)"]
    WRITER["Writer instance\n(primary)"]
    R1["Reader replica 1"]
    R2["Reader replica 2"]
    RN["… up to 15 readers"]
  end

  subgraph storage["Shared storage volume (auto 6× replicated across 3 AZs)"]
    SEG1["Storage segment"]
    SEG2["Storage segment"]
    SEG3["Storage segment"]
  end

  APP["Spring Boot / Lambda"] --> WRITER
  APP --> R1
  WRITER --> storage
  R1 --> storage
  R2 --> storage
  RN --> storage

  storage -->|"continuous backup to S3"| S3["S3 (backup)"]

Storage auto-scales in 10 GB increments up to 128 TiB — no pre-provisioning
6 copies across 3 AZs; Aurora can tolerate loss of 2 copies without write interruption
Storage is billed separately from compute (~$0.10/GB-month for Aurora storage)
Backups are continuous and incremental to S3 — no I/O penalty on the writer like snapshot-based systems

🔬 Under the Hood

Aurora replicas read from the same storage volume — replication lag is typically tens of milliseconds, not the seconds you see with binlog/WAL shipping on RDS read replicas. Failover promotes a reader to writer in ~30 seconds by reassigning the cluster endpoint. The storage volume persists unchanged — no data copy step.

Replicas — up to 15

Aurora clusters have exactly one writer and up to 15 reader instances in the same region:

Cluster endpoint — always points to the current writer; use for writes and transactional reads
Reader endpoint — load-balances across all readers; use for read-only queries
Custom endpoints — route specific app tiers to specific replica subsets (e.g. reporting vs API reads)
Promotion — manual or automatic (Aurora Auto Scaling for readers; failover for writer)

Aurora Serverless v2 (ACU)

Serverless v2 scales compute in ACUs (1 ACU ≈ 2 GiB RAM). Set min/max (e.g. 0.5–16 ACU); Aurora scales in ~15 seconds. Use for dev/test (min 0.5 ACU) and spiky SaaS traffic; skip for steady OLTP where provisioned instances are cheaper.

⚖️ Trade-off

Aurora vs RDS PostgreSQL: Aurora costs ~20% more per vCPU-hour but gives faster failover, more replicas, and auto-scaling storage. Choose Aurora when you need >5 read replicas, sub-minute failover, or Global Database. Stick with RDS for small workloads, exotic PostgreSQL extensions, or when cost predictability at low scale matters more.

Aurora Global Database

One primary region (read/write) + up to 5 secondary regions (read-only, typically <1 second lag). Cross-region replication uses dedicated infrastructure — not PostgreSQL logical replication you manage. Failover to a secondary region (managed or manual) completes in ~1 minute with RPO typically <1 second. Use for multi-region active-passive DR, not active-active writes (writes still go to one primary region).

MySQL vs PostgreSQL compatibility

Engine	Compatibility	Spring Boot	Watch for
Aurora PostgreSQL	PostgreSQL 11–16 (version-specific)	spring.datasource.driver-class-name=org.postgresql.Driver	Some extensions differ; pgvector supported on recent versions
Aurora MySQL	MySQL 5.7 / 8.0 compatible	com.mysql.cj.jdbc.Driver	MySQL-specific SQL in migrations; InnoDB-only (no MyISAM)

Both support IAM database authentication, Performance Insights, and backtrack (MySQL only — undo recent changes without restore). JDBC connection strings use the cluster endpoint: jdbc:postgresql://my-cluster.cluster-abc123.eu-west-1.rds.amazonaws.com:5432/orders

DynamoDB deep dive

DynamoDB is a fully managed NoSQL key-value and document store with single-digit millisecond latency at any scale. There is no connection pool, no schema migration tool from AWS, and no SELECT * WHERE — success depends entirely on access pattern design upfront.

Partition key design

Every item lives in exactly one partition, determined by the partition key (HASH). Optionally add a sort key (RANGE) to store multiple items per partition and enable range queries (BETWEEN, begins_with).

High cardinality — partition keys should spread load evenly (userId, orderId, tenantId+date)
Composite keys — PK=USER#123, SK=ORDER#2024-06-10#456 enables query by prefix
Avoid — status flags, boolean fields, or dates alone as partition keys (hot partitions)
Write sharding — append a random suffix (userId#shard-{0..9}) for write-heavy monotonic keys, query all shards in parallel

⚠️ Pitfall

Hot partitions — a celebrity product launch, viral event, or poorly chosen partition key (e.g. status=ACTIVE) concentrates all traffic on one partition. DynamoDB splits partitions automatically but only when sustained throughput exceeds ~3,000 RCU or 1,000 WCU per partition. Design keys for even distribution from day one; don't rely on splitting to save you.

GSI and LSI

Index	When created	Key schema	Consistency
LSI (Local Secondary Index)	At table creation only	Same partition key, different sort key	Strongly consistent reads supported
GSI (Global Secondary Index)	Anytime	Different partition + optional sort key	Eventually consistent only (default)

GSIs have their own throughput capacity (provisioned mode) or share on-demand billing. Every GSI duplicates projected attributes — watch storage and write amplification (one write to table = one write per GSI).

On-demand vs provisioned capacity

On-demand — pay per request; no capacity planning; scales instantly; ~5× more expensive at steady high throughput
Provisioned — set RCU/WCU (or use auto scaling); cheaper at predictable load; requires monitoring
Reserved capacity — 1–3 year commit for provisioned mode; significant discount for known baselines

💡 Pro Tip

Start on-demand for new services with unknown traffic. After 2–4 weeks, pull CloudWatch ConsumedReadCapacityUnits / ConsumedWriteCapacityUnits metrics and switch to provisioned + auto scaling when the pattern stabilizes. Use on-demand for spiky, unpredictable workloads permanently.

DAX (DynamoDB Accelerator)

In-memory cache cluster co-located with DynamoDB — microsecond reads for eventually consistent GetItem/Query/Scan. Write-through cache: writes go to DynamoDB and DAX together. Use for read-heavy, latency-sensitive paths where ElastiCache would require you to manage cache invalidation logic. Not a replacement for ElastiCache session stores — DAX only accelerates DynamoDB API calls.

DynamoDB Streams

Ordered change log per partition (INSERT, MODIFY, REMOVE). Consumers: Lambda (event source mapping), Kinesis Client Library, or custom. Enable NEW_AND_OLD_IMAGES when you need the previous state for audit or CDC to OpenSearch. Streams are the backbone of single-table design denormalization — update a GSI projection asynchronously via Lambda.

Single-table design basics

Store multiple entity types in one table using overloaded keys and GSIs — one Query fetches a user's orders, profile, and preferences without joins:

{
  "PK": "USER#usr_abc123",
  "SK": "PROFILE",
  "email": "[email protected]",
  "name": "Alex Chen"
}
{
  "PK": "USER#usr_abc123",
  "SK": "ORDER#2024-06-10#ord_789",
  "total": 149.99,
  "status": "SHIPPED"
}
{
  "PK": "ORDER#ord_789",
  "SK": "METADATA",
  "GSI1PK": "STATUS#SHIPPED",
  "GSI1SK": "2024-06-10#ord_789"
}

Design all access patterns on paper first (AP1: get user by ID, AP2: list orders by status, …), then derive PK/SK/GSI keys. Adding a new access pattern later often requires a new GSI or migration — relational-style schema evolution doesn't apply.

Table with GSI (orders by status)

saved globally

aws dynamodb create-table \
  --table-name Orders \
  --billing-mode PAY_PER_REQUEST \
  --attribute-definitions \
    AttributeName=PK,AttributeType=S \
    AttributeName=SK,AttributeType=S \
    AttributeName=GSI1PK,AttributeType=S \
    AttributeName=GSI1SK,AttributeType=S \
  --key-schema \
    AttributeName=PK,KeyType=HASH \
    AttributeName=SK,KeyType=RANGE \
  --global-secondary-indexes '[
    {
      "IndexName": "GSI1-StatusDate",
      "KeySchema": [
        {"AttributeName": "GSI1PK", "KeyType": "HASH"},
        {"AttributeName": "GSI1SK", "KeyType": "RANGE"}
      ],
      "Projection": {"ProjectionType": "ALL"}
    }
  ]' \
  --stream-specification StreamEnabled=true,StreamViewType=NEW_AND_OLD_IMAGES \
  --sse-specification Enabled=true,SSEType=KMS \
  --tags Key=Service,Value=order-api

resource "aws_dynamodb_table" "orders" {
  name         = "Orders"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "PK"
  range_key    = "SK"

  attribute {
    name = "PK"
    type = "S"
  }
  attribute {
    name = "SK"
    type = "S"
  }
  attribute {
    name = "GSI1PK"
    type = "S"
  }
  attribute {
    name = "GSI1SK"
    type = "S"
  }

  global_secondary_index {
    name            = "GSI1-StatusDate"
    hash_key        = "GSI1PK"
    range_key       = "GSI1SK"
    projection_type = "ALL"
  }

  stream_enabled   = true
  stream_view_type = "NEW_AND_OLD_IMAGES"

  server_side_encryption {
    enabled     = true
    kms_key_arn = aws_kms_key.app.arn
  }

  point_in_time_recovery {
    enabled = true
  }

  tags = { Service = "order-api" }
}

import * as dynamodb from 'aws-cdk-lib/aws-dynamodb';

const table = new dynamodb.Table(this, 'Orders', {
  partitionKey: { name: 'PK', type: dynamodb.AttributeType.STRING },
  sortKey: { name: 'SK', type: dynamodb.AttributeType.STRING },
  billingMode: dynamodb.BillingMode.PAY_PER_REQUEST,
  encryption: dynamodb.TableEncryption.CUSTOMER_MANAGED,
  encryptionKey: kmsKey,
  stream: dynamodb.StreamViewType.NEW_AND_OLD_IMAGES,
  pointInTimeRecovery: true,
});

table.addGlobalSecondaryIndex({
  indexName: 'GSI1-StatusDate',
  partitionKey: { name: 'GSI1PK', type: dynamodb.AttributeType.STRING },
  sortKey: { name: 'GSI1SK', type: dynamodb.AttributeType.STRING },
  projectionType: dynamodb.ProjectionType.ALL,
});

🔒 Security

Use fine-grained access control (IAM policy conditions on leading keys) so a Lambda can only read/write PK begins_with USER#${cognito:sub}. Enable point-in-time recovery (PITR) on production tables — it's cheap insurance against accidental DeleteTable or bad deploy scripts.

ElastiCache

Managed in-memory cache — Redis or Memcached on EC2 you don't operate. ElastiCache sits in your VPC private subnets and offloads read pressure from RDS/Aurora, stores session state, implements rate limiting, and backs Spring's @Cacheable abstraction.

Redis vs Memcached

Feature	Redis	Memcached
Data structures	Strings, hashes, lists, sets, sorted sets, streams, HyperLogLog	Strings only (key-value)
Persistence	Optional (RDB snapshots, AOF)	None — pure cache
Replication	Primary + replicas, Multi-AZ failover	No native replication
Cluster mode	Up to 500 shards, 90 nodes	Client-side consistent hashing across nodes
Typical use	Sessions, rate limits, pub/sub, leaderboards, Spring Cache	Simple object cache, HTML fragments

Default to Redis for new projects — Memcached only when you need simple horizontal scale of identical cache nodes with no persistence requirements and already have client-side sharding logic.

Cluster mode

Redis cluster mode shards data across multiple node groups (shards), each with a primary and optional replica. The cluster endpoint handles slot routing — clients like Lettuce (Spring Data Redis default) discover topology automatically.

Disabled cluster mode — single shard, one primary + up to 5 replicas; simpler; limited to one node's RAM
Enabled cluster mode — horizontal scale beyond ~100 GiB; required for >250k ops/sec
Replica per shard — enable for read scaling and Multi-AZ failover within the replication group

Multi-AZ and failover

Enable Multi-AZ on a replication group: AWS automatically fails over to a replica in another AZ when the primary fails (~30–60 seconds). Automatic failover requires at least one replica per shard. Application clients must support reconnect — Spring Lettuce handles this with spring.data.redis.lettuce.cluster.refresh.adaptive=true.

Common patterns

Session cache

Store Spring Session data in Redis instead of sticky sessions on ALB. Enables rolling deploys and autoscaling without session loss. Set TTL to match session timeout; use Redis hashes for compact storage.

Rate limiting

Token bucket or sliding window counters in Redis sorted sets. Example: limit API calls to 100/minute per userId with ZADD + ZREMRANGEBYSCORE + ZCARD. Works across all ECS tasks — unlike in-memory Guava rate limiters.

Spring Cache integration

spring:
  data:
    redis:
      cluster:
        nodes:
          - orders-cache.abc123.clustercfg.euw1.cache.amazonaws.com:6379
      lettuce:
        cluster:
          refresh:
            adaptive: true
  cache:
    type: redis
    redis:
      time-to-live: 300000  # 5 min — tune per cache name

# application code:
# @Cacheable(value = "products", key = "#sku")
# public Product findBySku(String sku) { ... }

⚠️ Pitfall

Caching entities with JPA lazy-loaded associations — serializing a Product graph to Redis can trigger LazyInitializationException or cache massive object trees. Cache DTOs or explicitly fetch joins; set TTLs; use @CacheEvict on writes.

Provision ElastiCache Redis cluster

saved globally

aws elasticache create-replication-group \
  --replication-group-id orders-cache \
  --replication-group-description "Order API session + cache" \
  --engine redis \
  --engine-version 7.1 \
  --cache-node-type cache.r7g.large \
  --num-node-groups 2 \
  --replicas-per-node-group 1 \
  --automatic-failover enabled \
  --multi-az-enabled \
  --cache-subnet-group-name private-cache-subnets \
  --security-group-ids sg-0def456 \
  --at-rest-encryption-enabled \
  --transit-encryption-enabled \
  --auth-token "$(openssl rand -base64 32)"

resource "aws_elasticache_replication_group" "orders" {
  replication_group_id = "orders-cache"
  description          = "Order API session + cache"
  engine               = "redis"
  engine_version       = "7.1"
  node_type            = "cache.r7g.large"

  num_node_groups         = 2
  replicas_per_node_group = 1

  automatic_failover_enabled = true
  multi_az_enabled           = true

  subnet_group_name  = aws_elasticache_subnet_group.private.name
  security_group_ids = [aws_security_group.cache.id]

  at_rest_encryption_enabled = true
  transit_encryption_enabled = true
  auth_token                 = random_password.redis_auth.result

  parameter_group_name = aws_elasticache_parameter_group.redis7.name
}

resource "aws_elasticache_parameter_group" "redis7" {
  family = "redis7"
  name   = "orders-redis7"

  parameter {
    name  = "maxmemory-policy"
    value = "volatile-lru"
  }
}

import * as ec2 from 'aws-cdk-lib/aws-ec2';
import * as elasticache from 'aws-cdk-lib/aws-elasticache';

const subnetGroup = new elasticache.CfnSubnetGroup(this, 'CacheSubnets', {
  description: 'Private cache subnets',
  subnetIds: vpc.privateSubnets.map(s => s.subnetId),
  cacheSubnetGroupName: 'private-cache-subnets',
});

new elasticache.CfnReplicationGroup(this, 'OrdersCache', {
  replicationGroupId: 'orders-cache',
  replicationGroupDescription: 'Order API session + cache',
  engine: 'redis',
  engineVersion: '7.1',
  cacheNodeType: 'cache.r7g.large',
  numNodeGroups: 2,
  replicasPerNodeGroup: 1,
  automaticFailoverEnabled: true,
  multiAzEnabled: true,
  cacheSubnetGroupName: subnetGroup.cacheSubnetGroupName!,
  securityGroupIds: [cacheSg.securityGroupId],
  atRestEncryptionEnabled: true,
  transitEncryptionEnabled: true,
  authToken: redisAuth.secretValue.unsafeUnwrap(),
});

📦 Real World

Airbnb uses Redis extensively for rate limiting and feature flags at the edge of their service mesh. Teams standardize on ElastiCache Redis with cluster mode rather than self-managed Redis on EC2 — patching and failover become AWS's problem during holiday traffic spikes.

Database selection

Picking the wrong database is expensive to undo. Use this decision framework before provisioning — most Spring Boot teams default to RDS PostgreSQL and only reach for DynamoDB or Aurora when a specific requirement forces the choice.

RDS vs Aurora vs DynamoDB decision matrix

Requirement	RDS	Aurora	DynamoDB
SQL + joins + ad-hoc queries	✅ Best fit	✅ Best fit	❌ Design access patterns upfront
Spring Data JPA / Hibernate	✅ Native	✅ Native (same JDBC)	⚠️ Use Spring Data DynamoDB or SDK v2
Multi-AZ automatic failover	✅ ~60s	✅ ~30s	✅ Built-in (multi-AZ by default)
Read scaling (>5 replicas)	⚠️ Up to 15 async replicas	✅ Up to 15 low-lag readers	✅ Scales automatically
Cross-region DR (<1 min RPO)	⚠️ Cross-region read replica + manual promote	✅ Global Database	✅ Global Tables
Predictable low cost at small scale	✅ Cheapest entry	⚠️ Premium pricing	⚠️ On-demand adds up; provisioned needs tuning
Spiky / serverless traffic	⚠️ Fixed instance size	✅ Serverless v2 ACU	✅ On-demand mode
Single-digit ms at millions of ops/sec	❌ Connection + query overhead	⚠️ Better but still SQL	✅ Purpose-built

⚖️ Trade-off

Polyglot persistence vs one database: microservices often justify DynamoDB for high-throughput services and PostgreSQL for reporting. Operating three engines increases runbook surface. Start monolithic on RDS; split data stores only when metrics prove the need (connection exhaustion, query latency, cost at scale).

Spring Boot + RDS/Aurora guidance

Dependencies and configuration

# build.gradle: implementation 'org.postgresql:postgresql'
#                  implementation 'org.flywaydb:flyway-database-postgresql'

spring:
  datasource:
    url: jdbc:postgresql://${DB_HOST}:5432/orders?sslmode=require
    username: ${DB_USER}
    password: ${DB_PASSWORD}  # from Secrets Manager in prod — never commit
    hikari:
      maximum-pool-size: 20          # tune: (CPU cores * 2) + spindle count rule is wrong for RDS
      minimum-idle: 5
      connection-timeout: 5000
      idle-timeout: 600000
      max-lifetime: 1800000          # < RDS idle timeout (often 30 min default)
      connection-test-query: SELECT 1
  jpa:
    hibernate:
      ddl-auto: validate             # Flyway owns schema — never update/create in prod
    properties:
      hibernate:
        jdbc:
          batch_size: 50
        order_inserts: true
        order_updates: true
  flyway:
    enabled: true
    baseline-on-migrate: true

Connection math with RDS Proxy

If you have 20 ECS tasks each with HikariCP pool size 20, that's 400 connections to a db.r6g.large that supports ~500–800 max. Put RDS Proxy in front: tasks connect to the proxy endpoint; proxy maintains a smaller pool to the database (~50–100 connections). Set HikariCP maximum-pool-size per task lower (5–10) when using Proxy.

Read replica routing

Spring doesn't route reads automatically. Options:

Manual — separate @Bean DataSource for reader endpoint; use in @Transactional(readOnly=true) routing aspect
Aurora reader endpoint — JDBC URL with -ro cluster endpoint for read-only repository methods
Avoid — routing writes to replicas (replication lag causes stale reads or conflicts)

Observability checklist

Enable Performance Insights on RDS/Aurora — top SQL, wait events, DB load
CloudWatch alarms: CPUUtilization > 80%, FreeableMemory low, DatabaseConnections near max
Log slow queries via parameter group (log_min_duration_statement)
Trace JDBC with X-Ray or OpenTelemetry JDBC instrumentation

🎯 Exam Tip

When the scenario describes "unpredictable traffic" + "NoSQL" + "single-digit millisecond" → DynamoDB on-demand. When it says "PostgreSQL compatibility" + "fast failover" + "15 read replicas" → Aurora. When it says "existing MySQL app" + "minimal migration" → RDS MySQL with Multi-AZ. ElastiCache is never the primary database — it's always a cache layer.

💡 Pro Tip

Run load tests against RDS before production cutover with the same HikariCP pool settings and connection count you'll use in ECS. Connection storms during deploys (old tasks not drained, new tasks starting) are a common cause of FATAL: too many connections — use ECS deployment circuit breaker and graceful shutdown hooks that close HikariCP on SIGTERM.