Databases: RDS, Aurora, DynamoDB & ElastiCache
Your Spring Boot monolith or microservice fleet needs durable state somewhere. AWS offers four database families that cover 95% of backend workloads: RDS for managed relational databases you already know, Aurora when PostgreSQL/MySQL needs cloud-native scale, DynamoDB for key-value access patterns at any throughput, and ElastiCache to keep hot data off the primary database. This chapter explains what AWS manages vs what you still own, how failover actually works, and how to pick the right engine without over-engineering.
RDS deep dive
Amazon RDS is managed relational database hosting — PostgreSQL, MySQL, MariaDB, Oracle, and SQL Server on EC2 you never SSH into. AWS handles patching, backups, and Multi-AZ failover; you still own schema design, query tuning, connection pooling, and migration strategy.
Managed vs you-manage
| Layer | AWS manages | You manage |
|---|---|---|
| Infrastructure | EC2 host, EBS volumes, hypervisor, AZ placement | Instance class selection, storage type (gp3/io1), IOPS provisioning |
| Engine | Minor version patches (maintenance window), engine binaries | Major version upgrades, extension installs (PostGIS, pgvector), parameter tuning |
| Availability | Multi-AZ synchronous replication + automatic DNS failover | Application retry logic, connection pool drain, read replica routing |
| Data | Automated backups to S3, snapshot lifecycle | Schema migrations (Flyway/Liquibase), PITR window policy, cross-region copy for DR |
| Security | Encryption at rest (KMS), encryption in transit (TLS) | Security groups, IAM DB auth, Secrets Manager rotation, least-privilege DB users |
Compare to self-managed PostgreSQL on EC2: you gain operational hours back but lose root access and some exotic extensions. For a typical Spring Boot + JPA workload, RDS is the default starting point.
Graviton instance classes (db.*)
ARM-based Graviton instances deliver 20–40% better price/performance for PostgreSQL and MySQL compared to equivalent Intel/AMD classes. Use the db.t4g.* burstable family for dev/staging and low-traffic services; db.r6g.* / db.r7g.* for memory-heavy production; db.m6g.* for balanced CPU/memory. Always verify your JDBC driver and any native extensions support ARM — most Java stacks do without changes.
- db.t4g.* — burstable Graviton for dev/staging (~$0.065–0.129/hr)
- db.r6g.large — 2 vCPU / 16 GiB; common production OLTP baseline (~$0.137/hr)
- db.r6g.xlarge — 4 vCPU / 32 GiB; heavy JPA + large HikariCP pools (~$0.274/hr)
Multi-AZ vs read replicas
These solve different problems and are often confused on exams and in architecture reviews.
- Multi-AZ (synchronous standby) — a hidden standby in another AZ receives every commit synchronously. On primary failure, RDS promotes the standby and updates the DNS endpoint (~60 seconds). Same endpoint URL; apps reconnect. You pay roughly 2× instance + 2× storage. No read traffic to standby.
- Read replica (asynchronous) — separate endpoint for read scaling and reporting. Lag is typically sub-second but not zero. Replicas can be promoted to standalone primary (manual failover, minutes). Cross-region replicas exist for DR read access, not automatic failover.
- Both together — production pattern: Multi-AZ primary for HA + 1–2 read replicas for @Transactional(readOnly = true) query paths or BI tools.
Multi-AZ = high availability (same region, automatic failover, no read scaling). Read replica = read scaling + optional DR (separate endpoint, async lag, manual promotion). If the question says "minimize downtime during AZ failure" → Multi-AZ. If it says "offload read traffic" → read replica.
RDS Proxy for Lambda and ECS
Serverless functions and container fleets open many short-lived connections. PostgreSQL has a hard limit on concurrent connections (~max_connections, often a few hundred on smaller instances). RDS Proxy sits between your app and the database, pooling and multiplexing connections.
- Lambda: each invocation would otherwise open/close a TCP+TLS+auth handshake — Proxy amortizes that cost
- ECS/Fargate: scale tasks without linear connection growth on the DB
- IAM authentication: Proxy can enforce IAM DB auth, eliminating static passwords in task definitions
- Failover awareness: Proxy tracks primary changes during Multi-AZ failover and drains in-flight transactions
Multi-AZ RDS uses synchronous storage-level replication to the standby — not PostgreSQL streaming replication you configure yourself. The standby is not queryable. When failover triggers, AWS renames network interfaces and updates the CNAME for your endpoint. Existing TCP connections break; connection pools must detect and retry (HikariCP connectionTestQuery or keepaliveTime settings matter here).
Parameter groups
DB parameters live in DB parameter groups (postgresql.conf as an API resource). Dynamic params apply immediately; static params need a reboot (e.g. shared_preload_libraries). Create a custom group per environment — never edit the default.
Backups and snapshots
Automated backups (1–35 day retention) enable PITR to any second within the window. Manual snapshots persist until deleted — take one before major upgrades. Cross-region snapshot copy is the standard DR pattern; skip final snapshot only on ephemeral dev databases.
💰 RDS Multi-AZ cost comparison
PostgreSQL on Graviton, gp3 storage — us-east-1 list pricing (approximate). Multi-AZ doubles instance and storage.
Multi-AZ roughly doubles compute and storage — that's the insurance premium for automatic failover. Read replicas add another full instance (async, readable). RDS Proxy adds ~$0.015/vCPU-hour on top. For dev, use single-AZ + shorter backup retention; never run production OLTP single-AZ to save money.
Provision PostgreSQL Multi-AZ (production baseline)
aws rds create-db-instance \
--db-instance-identifier order-db-prod \
--engine postgres \
--engine-version 16.4 \
--db-instance-class db.r6g.large \
--allocated-storage 100 \
--storage-type gp3 \
--multi-az \
--master-username dbadmin \
--manage-master-user-password \
--master-user-secret-kms-key-id alias/aws/secretsmanager \
--vpc-security-group-ids sg-0abc123 \
--db-subnet-group-name private-db-subnets \
--backup-retention-period 14 \
--preferred-backup-window "03:00-04:00" \
--storage-encrypted \
--enable-performance-insights \
--performance-insights-retention-period 7 \
--deletion-protection
resource "aws_db_instance" "orders" {
identifier = "order-db-prod"
engine = "postgres"
engine_version = "16.4"
instance_class = "db.r6g.large"
allocated_storage = 100
storage_type = "gp3"
multi_az = true
db_subnet_group_name = aws_db_subnet_group.private.name
vpc_security_group_ids = [aws_security_group.db.id]
username = "dbadmin"
manage_master_user_password = true
master_user_secret_kms_key_id = aws_kms_key.secrets.arn
backup_retention_period = 14
backup_window = "03:00-04:00"
maintenance_window = "sun:04:00-sun:05:00"
storage_encrypted = true
performance_insights_enabled = true
performance_insights_retention_period = 7
deletion_protection = true
parameter_group_name = aws_db_parameter_group.postgres16.name
}
resource "aws_db_parameter_group" "postgres16" {
family = "postgres16"
name = "orders-pg16-prod"
parameter {
name = "log_min_duration_statement"
value = "1000" # log queries > 1s
}
}
import * as ec2 from 'aws-cdk-lib/aws-ec2';
import * as rds from 'aws-cdk-lib/aws-rds';
const pgParams = new rds.ParameterGroup(this, 'Pg16Params', {
engine: rds.DatabaseInstanceEngine.postgres({
version: rds.PostgresEngineVersion.VER_16_4,
}),
parameters: { log_min_duration_statement: '1000' },
});
new rds.DatabaseInstance(this, 'OrderDb', {
engine: rds.DatabaseInstanceEngine.postgres({
version: rds.PostgresEngineVersion.VER_16_4,
}),
instanceType: ec2.InstanceType.of(ec2.InstanceClass.R6G, ec2.InstanceSize.LARGE),
vpc,
vpcSubnets: { subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS },
multiAz: true,
allocatedStorage: 100,
storageType: rds.StorageType.GP3,
credentials: rds.Credentials.fromGeneratedSecret('dbadmin'),
parameterGroup: pgParams,
backupRetention: cdk.Duration.days(14),
deletionProtection: true,
storageEncrypted: true,
enablePerformanceInsights: true,
});
Placing RDS in public subnets with publicly_accessible = true to "simplify dev access." Use a bastion host, SSM Session Manager port forwarding, or VPN instead. Even with security group IP restrictions, public RDS endpoints appear in scanners within hours.
Aurora deep dive
Aurora is AWS's cloud-native relational engine — PostgreSQL- and MySQL-compatible wire protocols with a fundamentally different storage architecture. You get up to 15 read replicas sharing one storage volume, fast failover, and optional serverless scaling for spiky workloads.
Shared storage architecture
Unlike RDS where each instance has its own EBS volume, Aurora decouples compute (DB instances) from storage (a distributed, 6-way replicated volume across 3 AZs). Writes go to the storage layer which handles replication — compute nodes are stateless relative to storage.
flowchart TB
subgraph compute["Aurora cluster (compute)"]
WRITER["Writer instance\n(primary)"]
R1["Reader replica 1"]
R2["Reader replica 2"]
RN["… up to 15 readers"]
end
subgraph storage["Shared storage volume (auto 6× replicated across 3 AZs)"]
SEG1["Storage segment"]
SEG2["Storage segment"]
SEG3["Storage segment"]
end
APP["Spring Boot / Lambda"] --> WRITER
APP --> R1
WRITER --> storage
R1 --> storage
R2 --> storage
RN --> storage
storage -->|"continuous backup to S3"| S3["S3 (backup)"]
- Storage auto-scales in 10 GB increments up to 128 TiB — no pre-provisioning
- 6 copies across 3 AZs; Aurora can tolerate loss of 2 copies without write interruption
- Storage is billed separately from compute (~$0.10/GB-month for Aurora storage)
- Backups are continuous and incremental to S3 — no I/O penalty on the writer like snapshot-based systems
Aurora replicas read from the same storage volume — replication lag is typically tens of milliseconds, not the seconds you see with binlog/WAL shipping on RDS read replicas. Failover promotes a reader to writer in ~30 seconds by reassigning the cluster endpoint. The storage volume persists unchanged — no data copy step.
Replicas — up to 15
Aurora clusters have exactly one writer and up to 15 reader instances in the same region:
- Cluster endpoint — always points to the current writer; use for writes and transactional reads
- Reader endpoint — load-balances across all readers; use for read-only queries
- Custom endpoints — route specific app tiers to specific replica subsets (e.g. reporting vs API reads)
- Promotion — manual or automatic (Aurora Auto Scaling for readers; failover for writer)
Aurora Serverless v2 (ACU)
Serverless v2 scales compute in ACUs (1 ACU ≈ 2 GiB RAM). Set min/max (e.g. 0.5–16 ACU); Aurora scales in ~15 seconds. Use for dev/test (min 0.5 ACU) and spiky SaaS traffic; skip for steady OLTP where provisioned instances are cheaper.
Aurora vs RDS PostgreSQL: Aurora costs ~20% more per vCPU-hour but gives faster failover, more replicas, and auto-scaling storage. Choose Aurora when you need >5 read replicas, sub-minute failover, or Global Database. Stick with RDS for small workloads, exotic PostgreSQL extensions, or when cost predictability at low scale matters more.
Aurora Global Database
One primary region (read/write) + up to 5 secondary regions (read-only, typically <1 second lag). Cross-region replication uses dedicated infrastructure — not PostgreSQL logical replication you manage. Failover to a secondary region (managed or manual) completes in ~1 minute with RPO typically <1 second. Use for multi-region active-passive DR, not active-active writes (writes still go to one primary region).
MySQL vs PostgreSQL compatibility
| Engine | Compatibility | Spring Boot | Watch for |
|---|---|---|---|
| Aurora PostgreSQL | PostgreSQL 11–16 (version-specific) | spring.datasource.driver-class-name=org.postgresql.Driver | Some extensions differ; pgvector supported on recent versions |
| Aurora MySQL | MySQL 5.7 / 8.0 compatible | com.mysql.cj.jdbc.Driver | MySQL-specific SQL in migrations; InnoDB-only (no MyISAM) |
Both support IAM database authentication, Performance Insights, and backtrack (MySQL only — undo recent changes without restore). JDBC connection strings use the cluster endpoint: jdbc:postgresql://my-cluster.cluster-abc123.eu-west-1.rds.amazonaws.com:5432/orders
DynamoDB deep dive
DynamoDB is a fully managed NoSQL key-value and document store with single-digit millisecond latency at any scale. There is no connection pool, no schema migration tool from AWS, and no SELECT * WHERE — success depends entirely on access pattern design upfront.
Partition key design
Every item lives in exactly one partition, determined by the partition key (HASH). Optionally add a sort key (RANGE) to store multiple items per partition and enable range queries (BETWEEN, begins_with).
- High cardinality — partition keys should spread load evenly (userId, orderId, tenantId+date)
- Composite keys — PK=USER#123, SK=ORDER#2024-06-10#456 enables query by prefix
- Avoid — status flags, boolean fields, or dates alone as partition keys (hot partitions)
- Write sharding — append a random suffix (userId#shard-{0..9}) for write-heavy monotonic keys, query all shards in parallel
Hot partitions — a celebrity product launch, viral event, or poorly chosen partition key (e.g. status=ACTIVE) concentrates all traffic on one partition. DynamoDB splits partitions automatically but only when sustained throughput exceeds ~3,000 RCU or 1,000 WCU per partition. Design keys for even distribution from day one; don't rely on splitting to save you.
GSI and LSI
| Index | When created | Key schema | Consistency |
|---|---|---|---|
| LSI (Local Secondary Index) | At table creation only | Same partition key, different sort key | Strongly consistent reads supported |
| GSI (Global Secondary Index) | Anytime | Different partition + optional sort key | Eventually consistent only (default) |
GSIs have their own throughput capacity (provisioned mode) or share on-demand billing. Every GSI duplicates projected attributes — watch storage and write amplification (one write to table = one write per GSI).
On-demand vs provisioned capacity
- On-demand — pay per request; no capacity planning; scales instantly; ~5× more expensive at steady high throughput
- Provisioned — set RCU/WCU (or use auto scaling); cheaper at predictable load; requires monitoring
- Reserved capacity — 1–3 year commit for provisioned mode; significant discount for known baselines
Start on-demand for new services with unknown traffic. After 2–4 weeks, pull CloudWatch ConsumedReadCapacityUnits / ConsumedWriteCapacityUnits metrics and switch to provisioned + auto scaling when the pattern stabilizes. Use on-demand for spiky, unpredictable workloads permanently.
DAX (DynamoDB Accelerator)
In-memory cache cluster co-located with DynamoDB — microsecond reads for eventually consistent GetItem/Query/Scan. Write-through cache: writes go to DynamoDB and DAX together. Use for read-heavy, latency-sensitive paths where ElastiCache would require you to manage cache invalidation logic. Not a replacement for ElastiCache session stores — DAX only accelerates DynamoDB API calls.
DynamoDB Streams
Ordered change log per partition (INSERT, MODIFY, REMOVE). Consumers: Lambda (event source mapping), Kinesis Client Library, or custom. Enable NEW_AND_OLD_IMAGES when you need the previous state for audit or CDC to OpenSearch. Streams are the backbone of single-table design denormalization — update a GSI projection asynchronously via Lambda.
Single-table design basics
Store multiple entity types in one table using overloaded keys and GSIs — one Query fetches a user's orders, profile, and preferences without joins:
{
"PK": "USER#usr_abc123",
"SK": "PROFILE",
"email": "[email protected]",
"name": "Alex Chen"
}
{
"PK": "USER#usr_abc123",
"SK": "ORDER#2024-06-10#ord_789",
"total": 149.99,
"status": "SHIPPED"
}
{
"PK": "ORDER#ord_789",
"SK": "METADATA",
"GSI1PK": "STATUS#SHIPPED",
"GSI1SK": "2024-06-10#ord_789"
}
Design all access patterns on paper first (AP1: get user by ID, AP2: list orders by status, …), then derive PK/SK/GSI keys. Adding a new access pattern later often requires a new GSI or migration — relational-style schema evolution doesn't apply.
Table with GSI (orders by status)
aws dynamodb create-table \
--table-name Orders \
--billing-mode PAY_PER_REQUEST \
--attribute-definitions \
AttributeName=PK,AttributeType=S \
AttributeName=SK,AttributeType=S \
AttributeName=GSI1PK,AttributeType=S \
AttributeName=GSI1SK,AttributeType=S \
--key-schema \
AttributeName=PK,KeyType=HASH \
AttributeName=SK,KeyType=RANGE \
--global-secondary-indexes '[
{
"IndexName": "GSI1-StatusDate",
"KeySchema": [
{"AttributeName": "GSI1PK", "KeyType": "HASH"},
{"AttributeName": "GSI1SK", "KeyType": "RANGE"}
],
"Projection": {"ProjectionType": "ALL"}
}
]' \
--stream-specification StreamEnabled=true,StreamViewType=NEW_AND_OLD_IMAGES \
--sse-specification Enabled=true,SSEType=KMS \
--tags Key=Service,Value=order-api
resource "aws_dynamodb_table" "orders" {
name = "Orders"
billing_mode = "PAY_PER_REQUEST"
hash_key = "PK"
range_key = "SK"
attribute {
name = "PK"
type = "S"
}
attribute {
name = "SK"
type = "S"
}
attribute {
name = "GSI1PK"
type = "S"
}
attribute {
name = "GSI1SK"
type = "S"
}
global_secondary_index {
name = "GSI1-StatusDate"
hash_key = "GSI1PK"
range_key = "GSI1SK"
projection_type = "ALL"
}
stream_enabled = true
stream_view_type = "NEW_AND_OLD_IMAGES"
server_side_encryption {
enabled = true
kms_key_arn = aws_kms_key.app.arn
}
point_in_time_recovery {
enabled = true
}
tags = { Service = "order-api" }
}
import * as dynamodb from 'aws-cdk-lib/aws-dynamodb';
const table = new dynamodb.Table(this, 'Orders', {
partitionKey: { name: 'PK', type: dynamodb.AttributeType.STRING },
sortKey: { name: 'SK', type: dynamodb.AttributeType.STRING },
billingMode: dynamodb.BillingMode.PAY_PER_REQUEST,
encryption: dynamodb.TableEncryption.CUSTOMER_MANAGED,
encryptionKey: kmsKey,
stream: dynamodb.StreamViewType.NEW_AND_OLD_IMAGES,
pointInTimeRecovery: true,
});
table.addGlobalSecondaryIndex({
indexName: 'GSI1-StatusDate',
partitionKey: { name: 'GSI1PK', type: dynamodb.AttributeType.STRING },
sortKey: { name: 'GSI1SK', type: dynamodb.AttributeType.STRING },
projectionType: dynamodb.ProjectionType.ALL,
});
Use fine-grained access control (IAM policy conditions on leading keys) so a Lambda can only read/write PK begins_with USER#${cognito:sub}. Enable point-in-time recovery (PITR) on production tables — it's cheap insurance against accidental DeleteTable or bad deploy scripts.
ElastiCache
Managed in-memory cache — Redis or Memcached on EC2 you don't operate. ElastiCache sits in your VPC private subnets and offloads read pressure from RDS/Aurora, stores session state, implements rate limiting, and backs Spring's @Cacheable abstraction.
Redis vs Memcached
| Feature | Redis | Memcached |
|---|---|---|
| Data structures | Strings, hashes, lists, sets, sorted sets, streams, HyperLogLog | Strings only (key-value) |
| Persistence | Optional (RDB snapshots, AOF) | None — pure cache |
| Replication | Primary + replicas, Multi-AZ failover | No native replication |
| Cluster mode | Up to 500 shards, 90 nodes | Client-side consistent hashing across nodes |
| Typical use | Sessions, rate limits, pub/sub, leaderboards, Spring Cache | Simple object cache, HTML fragments |
Default to Redis for new projects — Memcached only when you need simple horizontal scale of identical cache nodes with no persistence requirements and already have client-side sharding logic.
Cluster mode
Redis cluster mode shards data across multiple node groups (shards), each with a primary and optional replica. The cluster endpoint handles slot routing — clients like Lettuce (Spring Data Redis default) discover topology automatically.
- Disabled cluster mode — single shard, one primary + up to 5 replicas; simpler; limited to one node's RAM
- Enabled cluster mode — horizontal scale beyond ~100 GiB; required for >250k ops/sec
- Replica per shard — enable for read scaling and Multi-AZ failover within the replication group
Multi-AZ and failover
Enable Multi-AZ on a replication group: AWS automatically fails over to a replica in another AZ when the primary fails (~30–60 seconds). Automatic failover requires at least one replica per shard. Application clients must support reconnect — Spring Lettuce handles this with spring.data.redis.lettuce.cluster.refresh.adaptive=true.
Common patterns
Session cache
Store Spring Session data in Redis instead of sticky sessions on ALB. Enables rolling deploys and autoscaling without session loss. Set TTL to match session timeout; use Redis hashes for compact storage.
Rate limiting
Token bucket or sliding window counters in Redis sorted sets. Example: limit API calls to 100/minute per userId with ZADD + ZREMRANGEBYSCORE + ZCARD. Works across all ECS tasks — unlike in-memory Guava rate limiters.
Spring Cache integration
spring:
data:
redis:
cluster:
nodes:
- orders-cache.abc123.clustercfg.euw1.cache.amazonaws.com:6379
lettuce:
cluster:
refresh:
adaptive: true
cache:
type: redis
redis:
time-to-live: 300000 # 5 min — tune per cache name
# application code:
# @Cacheable(value = "products", key = "#sku")
# public Product findBySku(String sku) { ... }
Caching entities with JPA lazy-loaded associations — serializing a Product graph to Redis can trigger LazyInitializationException or cache massive object trees. Cache DTOs or explicitly fetch joins; set TTLs; use @CacheEvict on writes.
Provision ElastiCache Redis cluster
aws elasticache create-replication-group \
--replication-group-id orders-cache \
--replication-group-description "Order API session + cache" \
--engine redis \
--engine-version 7.1 \
--cache-node-type cache.r7g.large \
--num-node-groups 2 \
--replicas-per-node-group 1 \
--automatic-failover enabled \
--multi-az-enabled \
--cache-subnet-group-name private-cache-subnets \
--security-group-ids sg-0def456 \
--at-rest-encryption-enabled \
--transit-encryption-enabled \
--auth-token "$(openssl rand -base64 32)"
resource "aws_elasticache_replication_group" "orders" {
replication_group_id = "orders-cache"
description = "Order API session + cache"
engine = "redis"
engine_version = "7.1"
node_type = "cache.r7g.large"
num_node_groups = 2
replicas_per_node_group = 1
automatic_failover_enabled = true
multi_az_enabled = true
subnet_group_name = aws_elasticache_subnet_group.private.name
security_group_ids = [aws_security_group.cache.id]
at_rest_encryption_enabled = true
transit_encryption_enabled = true
auth_token = random_password.redis_auth.result
parameter_group_name = aws_elasticache_parameter_group.redis7.name
}
resource "aws_elasticache_parameter_group" "redis7" {
family = "redis7"
name = "orders-redis7"
parameter {
name = "maxmemory-policy"
value = "volatile-lru"
}
}
import * as ec2 from 'aws-cdk-lib/aws-ec2';
import * as elasticache from 'aws-cdk-lib/aws-elasticache';
const subnetGroup = new elasticache.CfnSubnetGroup(this, 'CacheSubnets', {
description: 'Private cache subnets',
subnetIds: vpc.privateSubnets.map(s => s.subnetId),
cacheSubnetGroupName: 'private-cache-subnets',
});
new elasticache.CfnReplicationGroup(this, 'OrdersCache', {
replicationGroupId: 'orders-cache',
replicationGroupDescription: 'Order API session + cache',
engine: 'redis',
engineVersion: '7.1',
cacheNodeType: 'cache.r7g.large',
numNodeGroups: 2,
replicasPerNodeGroup: 1,
automaticFailoverEnabled: true,
multiAzEnabled: true,
cacheSubnetGroupName: subnetGroup.cacheSubnetGroupName!,
securityGroupIds: [cacheSg.securityGroupId],
atRestEncryptionEnabled: true,
transitEncryptionEnabled: true,
authToken: redisAuth.secretValue.unsafeUnwrap(),
});
Airbnb uses Redis extensively for rate limiting and feature flags at the edge of their service mesh. Teams standardize on ElastiCache Redis with cluster mode rather than self-managed Redis on EC2 — patching and failover become AWS's problem during holiday traffic spikes.
Database selection
Picking the wrong database is expensive to undo. Use this decision framework before provisioning — most Spring Boot teams default to RDS PostgreSQL and only reach for DynamoDB or Aurora when a specific requirement forces the choice.
RDS vs Aurora vs DynamoDB decision matrix
| Requirement | RDS | Aurora | DynamoDB |
|---|---|---|---|
| SQL + joins + ad-hoc queries | ✅ Best fit | ✅ Best fit | ❌ Design access patterns upfront |
| Spring Data JPA / Hibernate | ✅ Native | ✅ Native (same JDBC) | ⚠️ Use Spring Data DynamoDB or SDK v2 |
| Multi-AZ automatic failover | ✅ ~60s | ✅ ~30s | ✅ Built-in (multi-AZ by default) |
| Read scaling (>5 replicas) | ⚠️ Up to 15 async replicas | ✅ Up to 15 low-lag readers | ✅ Scales automatically |
| Cross-region DR (<1 min RPO) | ⚠️ Cross-region read replica + manual promote | ✅ Global Database | ✅ Global Tables |
| Predictable low cost at small scale | ✅ Cheapest entry | ⚠️ Premium pricing | ⚠️ On-demand adds up; provisioned needs tuning |
| Spiky / serverless traffic | ⚠️ Fixed instance size | ✅ Serverless v2 ACU | ✅ On-demand mode |
| Single-digit ms at millions of ops/sec | ❌ Connection + query overhead | ⚠️ Better but still SQL | ✅ Purpose-built |
Polyglot persistence vs one database: microservices often justify DynamoDB for high-throughput services and PostgreSQL for reporting. Operating three engines increases runbook surface. Start monolithic on RDS; split data stores only when metrics prove the need (connection exhaustion, query latency, cost at scale).
Spring Boot + RDS/Aurora guidance
Dependencies and configuration
# build.gradle: implementation 'org.postgresql:postgresql'
# implementation 'org.flywaydb:flyway-database-postgresql'
spring:
datasource:
url: jdbc:postgresql://${DB_HOST}:5432/orders?sslmode=require
username: ${DB_USER}
password: ${DB_PASSWORD} # from Secrets Manager in prod — never commit
hikari:
maximum-pool-size: 20 # tune: (CPU cores * 2) + spindle count rule is wrong for RDS
minimum-idle: 5
connection-timeout: 5000
idle-timeout: 600000
max-lifetime: 1800000 # < RDS idle timeout (often 30 min default)
connection-test-query: SELECT 1
jpa:
hibernate:
ddl-auto: validate # Flyway owns schema — never update/create in prod
properties:
hibernate:
jdbc:
batch_size: 50
order_inserts: true
order_updates: true
flyway:
enabled: true
baseline-on-migrate: true
Connection math with RDS Proxy
If you have 20 ECS tasks each with HikariCP pool size 20, that's 400 connections to a db.r6g.large that supports ~500–800 max. Put RDS Proxy in front: tasks connect to the proxy endpoint; proxy maintains a smaller pool to the database (~50–100 connections). Set HikariCP maximum-pool-size per task lower (5–10) when using Proxy.
Read replica routing
Spring doesn't route reads automatically. Options:
- Manual — separate @Bean DataSource for reader endpoint; use in @Transactional(readOnly=true) routing aspect
- Aurora reader endpoint — JDBC URL with -ro cluster endpoint for read-only repository methods
- Avoid — routing writes to replicas (replication lag causes stale reads or conflicts)
Observability checklist
- Enable Performance Insights on RDS/Aurora — top SQL, wait events, DB load
- CloudWatch alarms: CPUUtilization > 80%, FreeableMemory low, DatabaseConnections near max
- Log slow queries via parameter group (log_min_duration_statement)
- Trace JDBC with X-Ray or OpenTelemetry JDBC instrumentation
When the scenario describes "unpredictable traffic" + "NoSQL" + "single-digit millisecond" → DynamoDB on-demand. When it says "PostgreSQL compatibility" + "fast failover" + "15 read replicas" → Aurora. When it says "existing MySQL app" + "minimal migration" → RDS MySQL with Multi-AZ. ElastiCache is never the primary database — it's always a cache layer.
Run load tests against RDS before production cutover with the same HikariCP pool settings and connection count you'll use in ECS. Connection storms during deploys (old tasks not drained, new tasks starting) are a common cause of FATAL: too many connections — use ECS deployment circuit breaker and graceful shutdown hooks that close HikariCP on SIGTERM.