Infrastructure as Code: CDK, Terraform & CloudFormation
ClickOps does not scale past one environment. Production AWS estates are defined in version-controlled templates — CloudFormation is the native deployment engine; Terraform is the multi-cloud lingua franca with remote state; CDK lets you express infrastructure in TypeScript or Java and synthesize to CloudFormation. This chapter covers how each tool models change, how to avoid state corruption and stack drift, and how platform teams wire landing zones, GitOps, and policy-as-code into every pull request.
CloudFormation
CloudFormation is AWS's native IaC engine. You declare desired state in YAML or JSON; CFN creates, updates, and deletes resources in the right order, rolls back on failure, and records every change in stack events. Terraform and CDK both ultimately talk to AWS through CFN (CDK directly; Terraform through the AWS provider API).
Stacks — the unit of deployment
A stack is a named collection of resources defined by one template. Stack name is unique per account/region. CFN tracks dependencies: it creates IAM roles before Lambda functions that reference them, deletes in reverse order on DeleteStack. Stack status flows through CREATE_IN_PROGRESS → CREATE_COMPLETE or ROLLBACK_COMPLETE on failure.
| Concept | What it does | Production use |
|---|---|---|
| Change sets | Preview what an update will do before executing | Mandatory for prod — review resource replacements (destructive) in PR comments |
| Nested stacks | Parent stack embeds child templates via AWS::CloudFormation::Stack | Split VPC / app / data layers; reuse network template across accounts |
| Stack sets | Deploy same template to multiple accounts/regions from management account | Org-wide Config rules, GuardDuty enablement, baseline IAM roles |
| Drift detection | Compares live resources to template — flags manual console changes | Weekly drift scan; alert when security groups or S3 policies diverge |
| Stack policy | JSON document protecting resources from update/delete during deploy | Protect production RDS during risky app stack updates |
Parameters, conditions, and outputs
Parameters make templates reusable — environment name, instance type, VPC CIDR. Use AllowedValues and NoEcho for secrets. Conditions gate resource creation — e.g. create NAT Gateway only when CreateNatGateway parameter is true. Outputs export values (VPC ID, ALB DNS) for cross-stack references via Fn::ImportValue or SSM Parameter Store (preferred for decoupling).
flowchart LR TPL["Template\n(YAML/JSON)"] CS["Change set\n(preview)"] REV["PR review +\npolicy scan"] UPD["Execute change set"] EVT["Stack events\n+ drift check"] TPL --> CS --> REV --> UPD --> EVT
Deploy a VPC stack with public and private subnets
AWSTemplateFormatVersion: '2010-09-09'
Description: App VPC — 2 AZs, public + private subnets, single NAT
Parameters:
Environment:
Type: String
AllowedValues: [dev, staging, prod]
VpcCidr:
Type: String
Default: 10.0.0.0/16
Conditions:
IsProd: !Equals [!Ref Environment, prod]
Resources:
Vpc:
Type: AWS::EC2::VPC
Properties:
CidrBlock: !Ref VpcCidr
EnableDnsHostnames: true
EnableDnsSupport: true
Tags:
- Key: Name
Value: !Sub '${Environment}-app-vpc'
PublicSubnetA:
Type: AWS::EC2::Subnet
Properties:
VpcId: !Ref Vpc
CidrBlock: 10.0.0.0/24
AvailabilityZone: !Select [0, !GetAZs '']
MapPublicIpOnLaunch: true
NatEip:
Type: AWS::EC2::EIP
Condition: IsProd
Properties:
Domain: vpc
Outputs:
VpcId:
Value: !Ref Vpc
Export:
Name: !Sub '${Environment}-VpcId'
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "~> 5.0"
name = "${var.environment}-app-vpc"
cidr = var.vpc_cidr
azs = slice(data.aws_availability_zones.available.names, 0, 2)
public_subnets = ["10.0.0.0/24", "10.0.1.0/24"]
private_subnets = ["10.0.10.0/24", "10.0.11.0/24"]
enable_nat_gateway = var.environment == "prod"
single_nat_gateway = true
tags = { Environment = var.environment }
}
import * as ec2 from 'aws-cdk-lib/aws-ec2';
import { Stack, StackProps } from 'aws-cdk-lib';
export class AppVpcStack extends Stack {
constructor(scope: Construct, id: string, props: StackProps & { envName: string }) {
super(scope, id, props);
const vpc = new ec2.Vpc(this, 'Vpc', {
maxAzs: 2,
natGateways: props.envName === 'prod' ? 1 : 0,
subnetConfiguration: [
{ name: 'Public', subnetType: ec2.SubnetType.PUBLIC, cidrMask: 24 },
{ name: 'Private', subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS, cidrMask: 24 },
],
});
}
}
Change sets and drift in practice
$ aws cloudformation create-change-set \ --stack-name prod-app-vpc --change-set-name cs-2024-06-10 \ --template-body file://vpc.yaml --parameters ParameterKey=Environment,ParameterValue=prod $ aws cloudformation describe-change-set \ --stack-name prod-app-vpc --change-set-name cs-2024-06-10 \ --query 'Changes[*].ResourceChange.{Action:Action,LogicalId:LogicalResourceId,Replacement:Replacement}' → Replacement: True on Subnet = destructive — investigate before execute $ aws cloudformation detect-stack-drift --stack-name prod-app-vpc $ aws cloudformation describe-stack-resource-drifts --stack-name prod-app-vpc
CloudFormation maintains a state machine per resource in the stack. Updates that change immutable properties (subnet CIDR, RDS engine version in-place) trigger replacement — delete old, create new. CFN uses DeletionPolicy and UpdateReplacePolicy to control whether replaced resources are retained or snapshotted. Nested stacks propagate failures to the parent — a child rollback rolls back the parent too.
Change sets preview updates; they do not apply changes until you execute. Drift detection finds manual changes but does not auto-remediate — you update the template or run ImportResource for resources created outside the stack. Stack sets require a trusted administrator role in member accounts.
Editing resources in the console "just once" — drift accumulates until the next UpdateStack fails because CFN tries to revert your manual change. Treat the console as read-only in managed environments. If you must hotfix, import the change back into the template the same day.
Terraform
Terraform (OpenTofu-compatible) describes infrastructure in HCL and applies changes via the AWS provider API directly — no CloudFormation intermediary. State is Terraform's memory of what it created; corrupt or conflicting state is the #1 production incident. Remote state on S3 with DynamoDB locking is non-negotiable for teams.
Remote state — S3 + DynamoDB locking
Local terraform.tfstate on a laptop is a single point of failure. Store state in an encrypted S3 bucket with versioning enabled; use a DynamoDB table for state locking so two CI jobs cannot apply simultaneously. Enable SSE-KMS on the bucket; restrict access to the deploy role only.
# backend.tf — bootstrap this stack manually once, then migrate
terraform {
backend "s3" {
bucket = "my-org-terraform-state"
key = "platform/vpc/prod/terraform.tfstate"
region = "eu-west-1"
encrypt = true
dynamodb_table = "terraform-state-lock"
kms_key_id = "arn:aws:kms:eu-west-1:123456789012:key/abc-123"
}
}
| Feature | Purpose | Guidance |
|---|---|---|
| Modules | Reusable packages of resources with inputs/outputs | Pin module versions (version = "~> 5.0"); publish internal modules to a private registry |
| Workspaces | Multiple state files in same backend key prefix | dev / staging / prod — or prefer separate state keys per env for stronger isolation |
| Import | Adopt existing AWS resources into Terraform state | terraform import aws_s3_bucket.artifacts my-bucket — write matching HCL first |
| Plan / apply | Plan = dry-run diff; apply = execute changes | Always plan in CI; apply only from protected branch with approval |
| Data sources | Read existing resources without managing them | Look up VPC, subnets, ACM certs created by networking team |
Plan/apply workflow
- terraform init — download providers, configure backend
- terraform fmt -check and validate in CI
- terraform plan -out=plan.tfplan — save binary plan artifact
- Human or policy bot reviews plan output in PR
- terraform apply plan.tfplan — apply exact planned changes only
- Post-apply: run smoke tests; refresh AWS Config / Security Hub compliance
ECS Fargate service — same stack, Terraform view
OrderServiceCluster:
Type: AWS::ECS::Cluster
Properties:
ClusterName: !Sub '${Environment}-orders'
OrderService:
Type: AWS::ECS::Service
Properties:
Cluster: !Ref OrderServiceCluster
DesiredCount: !If [IsProd, 3, 1]
LaunchType: FARGATE
NetworkConfiguration:
AwsvpcConfiguration:
AssignPublicIp: DISABLED
Subnets: !Split [',', !ImportValue PrivateSubnetIds]
SecurityGroups: [!Ref AppSecurityGroup]
TaskDefinition: !Ref OrderTaskDefinition
resource "aws_ecs_cluster" "orders" {
name = "${var.environment}-orders"
}
resource "aws_ecs_service" "order_service" {
name = "order-service"
cluster = aws_ecs_cluster.orders.id
task_definition = aws_ecs_task_definition.app.arn
desired_count = var.environment == "prod" ? 3 : 1
launch_type = "FARGATE"
network_configuration {
subnets = var.private_subnet_ids
security_groups = [aws_security_group.app.id]
assign_public_ip = false
}
lifecycle {
ignore_changes = [desired_count] # let autoscaling own replica count
}
}
const cluster = new ecs.Cluster(this, 'OrdersCluster', {
clusterName: `${envName}-orders`,
vpc,
});
new ecs.FargateService(this, 'OrderService', {
cluster,
taskDefinition,
desiredCount: envName === 'prod' ? 3 : 1,
assignPublicIp: false,
vpcSubnets: { subnetType: ec2.SubnetType.PRIVATE_WITH_EGRESS },
});
Two engineers running terraform apply without locking — state file corruption, duplicate resources, or orphaned AWS objects. Never disable DynamoDB locking. If state is corrupted, restore from S3 versioning before attempting manual state surgery with terraform state rm.
Split state by blast radius: networking/, data/, app-order-service/ as separate state keys. Use terraform_remote_state data sources to pass VPC IDs — a bad app apply cannot destroy the VPC.
Terraform itself is free (BSL license — evaluate OpenTofu for strict OSS policies). State storage costs pennies: S3 + DynamoDB on-demand for a platform team is typically < $5/month. The expensive mistake is uncontrolled applies creating duplicate NAT Gateways or oversized RDS instances — enforce plan review.
AWS CDK
CDK lets you define infrastructure in general-purpose languages — TypeScript and Java are the most common in enterprise teams. CDK synthesizes to CloudFormation; cdk deploy uploads assets (Lambda zip, Docker image) to S3/ECR and creates the CFN stack. You get IDE autocomplete, unit tests, and reusable constructs instead of YAML indentation archaeology.
Constructs — L1, L2, L3
| Level | What it maps to | Example |
|---|---|---|
| L1 (Cfn*) | 1:1 CloudFormation resource — full property surface, no defaults | new s3.CfnBucket(this, 'Raw') |
| L2 | AWS construct library — sensible defaults, helper methods | new s3.Bucket(this, 'Artifacts', { encryption: s3.BucketEncryption.S3_MANAGED }) |
| L3 (patterns) | Opinionated multi-resource patterns | aws-apigatewayv2-integrations.HttpLambdaIntegration, ApplicationLoadBalancedFargateService |
Synthesis and deployment
cdk synth compiles your app to CloudFormation templates in cdk.out/ — commit nothing from this folder. cdk diff compares deployed stack to synthesized template (like Terraform plan). cdk deploy runs CFN create/update. Use aspects to apply org-wide rules — e.g. enforce S3 encryption on every Bucket construct via IAspect.
Testing CDK apps
- Snapshot tests — Template.fromStack(stack).toJSON() compared to golden file
- Fine-grained assertions — template.hasResourceProperties('AWS::S3::Bucket', { ... })
- Integration tests — deploy to ephemeral account via CI; run smoke tests; destroy
ECS stack in TypeScript and Java
import { App, Stack, StackProps, Tags } from 'aws-cdk-lib';
import * as ec2 from 'aws-cdk-lib/aws-ec2';
import * as ecs from 'aws-cdk-lib/aws-ecs';
import * as ecsPatterns from 'aws-cdk-lib/aws-ecs-patterns';
const app = new App();
const stack = new Stack(app, 'OrderServiceStack', { env: { account: '123456789012', region: 'eu-west-1' } });
const vpc = ec2.Vpc.fromLookup(stack, 'Vpc', { vpcName: 'prod-app-vpc' });
new ecsPatterns.ApplicationLoadBalancedFargateService(stack, 'OrderService', {
vpc,
cpu: 512,
memoryLimitMiB: 1024,
desiredCount: 3,
publicLoadBalancer: false,
taskImageOptions: { image: ecs.ContainerImage.fromRegistry('123456789012.dkr.ecr.eu-west-1.amazonaws.com/orders:latest') },
});
Tags.of(stack).add('Team', 'payments');
app.synth();
import software.amazon.awscdk.*;
import software.amazon.awscdk.services.ec2.Vpc;
import software.amazon.awscdk.services.ecs.patterns.ApplicationLoadBalancedFargateService;
public class OrderServiceApp {
public static void main(String[] args) {
App app = new App();
Stack stack = Stack.Builder.create(app, "OrderServiceStack")
.env(Environment.builder().account("123456789012").region("eu-west-1").build())
.build();
Vpc vpc = Vpc.fromLookup(stack, "Vpc", VpcLookupOptions.builder().vpcName("prod-app-vpc").build());
ApplicationLoadBalancedFargateService.Builder.create(stack, "OrderService")
.vpc(vpc).cpu(512).memoryLimitMiB(1024).desiredCount(3).publicLoadBalancer(false).build();
Tags.of(stack).add("Team", "payments");
app.synth();
}
}
Unit tests use Template.fromStack(stack).hasResourceProperties(...) to assert synthesized CloudFormation — run cdk diff before deploy and cdk deploy --require-approval never only in CI after Checkov passes.
CDK vs raw CloudFormation: CDK adds a build step, construct version churn, and CloudFormation resource limits still apply after synthesis. Worth it when you have 5+ engineers sharing constructs and writing tests. For a single S3 bucket, a YAML template is faster than a TypeScript project.
Amazon dogfoods CDK internally; many AWS Solutions Constructs are L3 patterns extracted from production. Enterprises with Java backend teams often standardize on CDK Java so infra code sits in the same repo and review culture as Spring services — one PR touches app and platform.
CDK vs Terraform
Both are popular; they optimize for different constraints. CDK is AWS-native and developer-ergonomic; Terraform is cloud-agnostic with mature module ecosystem and explicit state management. Many orgs use both: Terraform for networking and shared services, CDK for application teams shipping fast on ECS/Lambda.
| Dimension | AWS CDK | Terraform |
|---|---|---|
| Language | TypeScript, Java, Python, Go, C# | HCL (DSL) — logic via modules and for_each |
| Engine | CloudFormation (synth → deploy) | Provider APIs directly (AWS, GCP, Azure, Datadog, Cloudflare…) |
| State | CFN manages stack state in AWS | Explicit tfstate — you own backup and locking |
| Drift | CFN drift detection per stack | terraform plan shows diff vs state; no built-in AWS drift UI |
| Reuse | Constructs, L3 patterns, internal construct libraries | Modules (Terraform Registry, private registry) |
| Testing | Unit tests on synthesized template (assertions library) | terraform plan in CI; tools like Terratest for integration |
| Multi-cloud | AWS-focused (CDK for Terraform exists but niche) | First-class — one workflow for AWS + SaaS providers |
| Day-0 speed | Fast for devs who already write TypeScript/Java | Fast for ops teams with existing HCL modules |
| Lock-in | CFN resource model + CDK construct versions | HCL portable; provider version pins |
When to pick each
- Choose CDK — AWS-only shop, app teams own infra, heavy use of Lambda/ECS/API Gateway, want IDE support and construct tests, already standardized on TypeScript or Java
- Choose Terraform — multi-cloud or multi-SaaS (Datadog, PagerDuty, Cloudflare), platform team owns shared modules, need fine-grained state separation, large existing HCL investment
- Choose CloudFormation YAML — minimal tooling, compliance requires auditable static templates, Service Catalog products, or teams forbidden from build-time synthesis
- Use both deliberately — platform Terraform provisions VPC, RDS, IAM baselines; product CDK stacks deploy into that VPC via SSM parameters or Vpc.fromLookup
Single tool mandate vs pragmatism: forcing one IaC tool company-wide slows teams that already excel with another. Standardize on patterns (tagging, naming, OIDC deploy roles, policy checks) rather than one syntax. The expensive problem is untagged, unreviewed ClickOps — not whether the template is HCL or TypeScript.
CDK always synthesizes to CloudFormation — it is not an alternative engine. Terraform state is stored outside AWS unless you use Terraform Cloud. For org-wide multi-account baseline, think CloudFormation StackSets or Control Tower, not a single Terraform root module without state isolation.
Landing zones
A landing zone is a pre-configured multi-account AWS environment — not a single template, but a governed foundation: account structure, baseline networking, logging, IAM, and guardrails applied before application teams deploy workloads. AWS Control Tower automates much of this for Organizations.
Control Tower — opinionated landing zone
Control Tower sets up a management account, log archive, audit account, and OU structure (Security, Sandbox, Workloads). It enables guardrails — implemented as SCPs and Config rules — such as disallowing public S3 buckets or restricting regions. New accounts enrolled in an OU inherit those guardrails.
flowchart TB MGMT["Management account\nControl Tower"] LOG["Log archive account\nCentral S3 + CloudTrail"] AUD["Audit account\nSecurity Hub aggregator"] OU_SEC["Security OU"] OU_DEV["Sandbox OU"] OU_PROD["Workloads OU\nprod / nonprod"] MGMT --> LOG MGMT --> AUD MGMT --> OU_SEC MGMT --> OU_DEV MGMT --> OU_PROD OU_PROD --> ACC1["App account A"] OU_PROD --> ACC2["App account B"]
Account vending
Account Factory (AFT) or Service Catalog products let teams request new accounts with pre-wired VPC, IAM roles, and guardrails — no manual console account creation. Account vending pipelines (Terraform or CFN) create the account via Organizations API, move it to the right OU, apply SCPs, and bootstrap a default VPC or empty network for the app team.
SCPs and tag enforcement
| Guardrail type | Mechanism | Example |
|---|---|---|
| Preventive | SCP denies API calls org-wide | Deny ec2:RunInstances without tag CostCenter |
| Detective | AWS Config rule → non-compliant | Required tags Environment, Owner on all EC2 |
| Proactive | CloudFormation hooks / SCP condition keys | Block stack create if template lacks mandatory tags |
| Tag policies | Organizations tag policies enforce key format | Environment must be dev | staging | prod |
Example SCP — require cost allocation tags on create
{
"Version": "2012-10-17",
"Statement": [{
"Sid": "DenyUntaggedEc2AndRds",
"Effect": "Deny",
"Action": [
"ec2:RunInstances",
"rds:CreateDBInstance"
],
"Resource": "*",
"Condition": {
"Null": {
"aws:RequestTag/CostCenter": "true",
"aws:RequestTag/Environment": "true"
}
}
}]
}
SCPs are org guardrails — they cannot grant permissions, only deny. Apply deny-public-S3 and deny-unapproved-regions SCPs at the root OU. Keep a break-glass OU outside prod guardrails for sandbox experimentation, but never attach production data to sandbox accounts.
Large banks run Account Factory for Terraform (AFT) — new accounts arrive with centralized CloudTrail, Config, GuardDuty, and a spoke VPC peered to a shared services hub. Application teams receive an account with guardrails already enforced; their first commit is app IaC, not "enable logging."
IaC patterns
Defining infrastructure in Git is step one. Production maturity means every change flows through PR review, automated policy checks, plan artifacts, and optional GitOps reconciliation — the same rigor as application code.
GitOps for infrastructure
Git is the source of truth. A merge to main triggers CI to plan/apply (push model) or a controller like Atlantis, Spacelift, or Flux TF Controller reconciles drift (pull model). For AWS-native teams, CDK Pipelines or CodePipeline synth → deploy stages provide a managed GitOps-like flow entirely inside AWS.
PR review checklist for IaC
- Does the plan show resource replacement (destructive) on prod resources?
- Are new IAM policies least-privilege? Any "*" actions or resources?
- Security groups: no 0.0.0.0/0 on port 22/3389?
- Encryption enabled? KMS keys referenced correctly?
- Tags present for CostCenter, Environment, Owner?
- State/backend unchanged or intentionally migrated?
Policy-as-code — Checkov and cfn-nag
| Tool | Input | Strength |
|---|---|---|
| Checkov | Terraform, CloudFormation, CDK synth output, Kubernetes | Broad coverage; hundreds of built-in policies; custom YAML policies |
| cfn-nag | CloudFormation JSON/YAML (and CDK after synth) | Deep CFN-specific rules (IAM wildcards, open SG ingress, missing encryption) |
| tfsec / trivy | Terraform HCL | Fast TF-focused scans; integrates with GitHub Advanced Security |
| OPA / Conftest | JSON plan output (custom) | Write your own Rego rules for org-specific standards |
$ cdk synth -q -o cdk.out $ checkov -d cdk.out --framework cloudformation --quiet --compact → CKV_AWS_20: S3 bucket public access — FAILED $ cfn_nag_scan --input-path cdk.out/OrderServiceStack.template.json → W9: IAM policy should not allow * on resource * $ terraform plan -out=plan.tfplan && terraform show -json plan.tfplan | checkov -f /dev/stdin --framework terraform_plan
End-to-end deploy pattern — VPC + ECS via GitOps
# .github/workflows/cfn-deploy.yml (excerpt)
- name: Validate and scan
run: |
aws cloudformation validate-template --template-body file://stacks/ecs.yaml
cfn_nag_scan --input-path stacks/ecs.yaml
- name: Deploy change set
run: |
aws cloudformation create-change-set --change-set-type CREATE \
--stack-name ${{ inputs.environment }}-orders --template-body file://stacks/ecs.yaml
aws cloudformation execute-change-set --stack-name ${{ inputs.environment }}-orders \
--change-set-name $(aws cloudformation list-change-sets --stack-name ${{ inputs.environment }}-orders --query 'Summaries[0].ChangeSetName' -o text)
# atlantis.yaml — PR-driven Terraform
projects:
- name: order-service-prod
dir: terraform/apps/order-service
workspace: prod
autoplan:
when_modified: ["*.tf", "../modules/**/*.tf"]
workflow: custom
workflows:
custom:
plan:
steps:
- init
- plan
- run: checkov -d . --framework terraform
apply:
steps: [apply]
// CDK Pipelines — self-mutating pipeline in AWS
import { CodePipeline, CodePipelineSource, ShellStep } from 'aws-cdk-lib/pipelines';
const pipeline = new CodePipeline(this, 'Pipeline', {
pipelineName: 'OrderService',
synth: new ShellStep('Synth', {
input: CodePipelineSource.gitHub('myorg/infra', 'main'),
commands: ['npm ci', 'npx cdk synth'],
}),
});
pipeline.addStage(new OrderServiceStage(this, 'Prod', { envName: 'prod' }));
-
Plan artifacts
Store plan.tfplan or change-set ID with the PR — auditors prove what was reviewed vs applied.
-
Shift-left scan
Checkov/cfn-nag on every PR — catch open SGs before they reach prod.
-
OIDC deploy roles
GitHub/GitLab assumes IAM role — no long-lived AWS_ACCESS_KEY_ID in secrets.
-
Drift remediation
Scheduled terraform plan or CFN drift detection — alert, don't auto-apply surprise fixes.
-
Immutable artifacts
Container images and Lambda bundles versioned in ECR/S3 — IaC references digests, not :latest.
-
Separate blast radius
Network state ≠ app state ≠ data state — limit what one apply can destroy.
Auto-applying Terraform on every merge to main without approval gates — one bad count = 0 or refactored module source deletes production databases. Require human approval for prod applies; use plan-only on PRs and apply from a protected environment.
Post CDK synth and Terraform plan output as PR comments (Infracost for cost delta, too). Reviewers approve the diff, not the HCL/TypeScript source — same mental model as reviewing a generated OpenAPI client.
When the scenario asks for consistent multi-account baseline with guardrails, answer AWS Organizations + SCPs + Control Tower (or Landing Zone Accelerator). When it asks for preview before deploy, CloudFormation → change sets; Terraform → plan. Policy-as-code tools are detective at PR time, not a substitute for SCPs.