Infrastructure as Code
ClickOps does not scale and cannot pass audit. This chapter covers IaC principles, Terraform state and modules, security scanning with Checkov and tfsec, Terragrunt for DRY environments, Ansible configuration, Pulumi for typed infrastructure, and GitOps-driven apply pipelines.
Infrastructure as Code Principles
Infrastructure as Code (IaC) treats datacenter resources like application code: versioned, reviewed, tested, and applied by automation—not clicked into existence in a console. The goal is reproducible environments, auditable change history, and drift detection.
Core principles
| Principle | Meaning | Anti-pattern |
|---|---|---|
| Declarative | Describe desired end state | Imperative shell scripts with 200 SSH commands |
| Idempotent | Second apply changes nothing if state matches | Scripts that fail on re-run |
| Immutable | Replace, don't patch servers | SSH + apt upgrade production fleet |
| Git as source of truth | PR review before apply | Terraform apply from laptop |
| Least privilege | CI role applies; humans read | Admin creds in ~/.aws/credentials |
IaC in the delivery lifecycle
IaC runs in three places: platform foundation (VPC, EKS, IAM), cluster add-ons (ingress, monitoring), and application dependencies (RDS, S3, SQS). Each layer has different change velocity and blast radius—split state files accordingly.
flowchart TB
subgraph foundation["Foundation state (quarterly)"]
VPC["VPC / subnets"]
EKS["EKS / node groups"]
IAM["IAM roles"]
end
subgraph platform["Platform state (weekly)"]
ING["Ingress / DNS"]
MON["Monitoring stack"]
POL["Policy engines"]
end
subgraph app["App state (daily)"]
RDS["RDS / caches"]
S3["S3 buckets"]
SQS["Queues"]
end
Git["Git PR"] --> CI["CI plan + scan"]
CI --> Apply["Terraform apply"]
Apply --> foundation
Apply --> platform
Apply --> app
Every manual console change is technical debt with an audit trail gap. If it is not in git, it does not exist for disaster recovery, compliance evidence, or onboarding.
Full declarative coverage takes quarters. Pragmatic teams declare new resources in IaC immediately and schedule import of legacy resources—never the reverse.
Terraform & State Management
Terraform's workflow—init → plan → apply—maps HCL declarations to provider APIs. State is Terraform's memory of real-world IDs; corrupt or leaked state is a production incident.
Project layout
Mature repos separate modules/ (reusable building blocks) from environments/ (root modules per env). Each root module has its own backend key.
module "eks_node_group" {
source = "../../modules/eks-node-group"
cluster_name = var.cluster_name
node_group_name = "${var.cluster_name}-${var.pool_name}"
subnet_ids = var.private_subnet_ids
instance_types = var.instance_types
desired_size = var.desired_size
min_size = var.min_size
max_size = var.max_size
labels = {
nodepool = var.pool_name
}
taints = var.taints
tags = merge(var.common_tags, {
ManagedBy = "terraform"
})
}
terraform {
required_version = ">= 1.6.0"
backend "s3" {
bucket = "acme-terraform-state"
key = "production/eks/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "acme-terraform-locks"
}
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
provider "aws" {
region = var.aws_region
default_tags {
tags = {
Environment = var.environment
Repository = "github.com/acme/platform-iac"
}
}
}
name: Terraform
on:
pull_request:
paths: ['terraform/**']
push:
branches: [main]
paths: ['terraform/**']
permissions:
id-token: write
contents: read
jobs:
plan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/github-terraform-plan
aws-region: us-east-1
- uses: hashicorp/setup-terraform@v3
- run: terraform init && terraform plan -no-color -out=plan.tfplan
working-directory: terraform/environments/production
- uses: actions/upload-artifact@v4
with:
name: tfplan
path: terraform/environments/production/plan.tfplan
apply:
if: github.ref == 'refs/heads/main'
needs: plan
runs-on: ubuntu-latest
environment: production-iac
steps:
- uses: hashicorp/setup-terraform@v3
- run: terraform apply -auto-approve plan.tfplan
working-directory: terraform/environments/production
include:
- template: Terraform/Base.gitlab-ci.yml
.terraform-plan:
stage: build
script:
- cd terraform/environments/$CI_ENVIRONMENT_NAME
- terraform init -input=false
- terraform plan -out=plan.tfplan
artifacts:
paths:
- terraform/environments/$CI_ENVIRONMENT_NAME/plan.tfplan
terraform-plan-production:
extends: .terraform-plan
environment: production
terraform-apply-production:
stage: deploy
environment: production
script:
- cd terraform/environments/production
- terraform apply -auto-approve plan.tfplan
needs: [terraform-plan-production]
rules:
- if: $CI_COMMIT_BRANCH == "main"
when: manual
Terraform builds a dependency graph from resource references. plan walks the graph in parallel where safe, computing a diff against state. Provider plugins translate each resource change into cloud API calls.
State operations & blast radius
| Operation | Risk | Mitigation |
|---|---|---|
| terraform import | Wrong ID corrupts state | Import in dev stack first; snapshot state before import |
| moved blocks | Refactor without destroy/create | Use Terraform 1.1+ moved syntax in same PR as rename |
| terraform state rm | Orphaned cloud resource | Pair with console verification; document in PR |
| Workspace per env | State collision if shared backend key | Separate backend keys per environment root module |
Module versioning
Pin module sources to semver git refs—not ref=main. source = "git::https://github.com/acme/tf-modules.git//eks?ref=v2.4.1" lets consumers upgrade deliberately after reading module CHANGELOG.
moved {
from = aws_instance.web
to = aws_instance.web_v2
}
resource "aws_instance" "web_v2" {
ami = var.ami_id
instance_type = "m6i.large"
subnet_id = var.subnet_id
tags = {
Name = "web-${var.environment}"
}
}
Terraform Security & Scanning
IaC misconfiguration is a leading cloud breach vector—public S3 buckets, open security groups, overly broad IAM. Scan before apply with Checkov, tfsec, or KICS; enforce with OPA policy in CI.
Common misconfigurations
| Finding | Severity | Fix |
|---|---|---|
| S3 acl = "public-read" | CRITICAL | Block public access; use bucket policy with IAM principal |
| Security group 0.0.0.0/0 on port 22 | HIGH | Bastion or SSM Session Manager only |
| IAM wildcard action on wildcard resource | HIGH | Scoped policy per workload IRSA role |
| Unencrypted RDS instance | MEDIUM | storage_encrypted = true + KMS CMK |
resource "aws_s3_bucket" "artifacts" {
bucket = var.bucket_name
}
resource "aws_s3_bucket_public_access_block" "artifacts" {
bucket = aws_s3_bucket.artifacts.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
resource "aws_s3_bucket_server_side_encryption_configuration" "artifacts" {
bucket = aws_s3_bucket.artifacts.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "aws:kms"
kms_master_key_id = var.kms_key_arn
}
bucket_key_enabled = true
}
}
name: IaC scan
on: [pull_request]
jobs:
checkov:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: bridgecrewio/checkov-action@v12
with:
directory: terraform/
framework: terraform
soft_fail: false
- uses: aquasecurity/[email protected]
with:
working_directory: terraform/
iac-scan:
stage: test
image:
name: bridgecrew/checkov:latest
entrypoint: [""]
script:
- checkov -d terraform/ --framework terraform --soft-fail false
rules:
- changes:
- terraform/**/*
Run scanners on planned JSON (terraform show -json plan.tfplan) for accurate context—static HCL scan misses computed values.
Scanner comparison
| Tool | Strength | CI integration |
|---|---|---|
| Checkov | Broad multi-framework (TF, K8s, Helm) | GitHub Action, GitLab, SARIF |
| tfsec | Fast Terraform-only rules | CLI, pre-commit |
| KICS | 1000+ queries across IaC | CI gate, IDE plugins |
| OPA/Conftest | Custom org policy in Rego | Plan JSON + HCL |
IAM policy as code
Generate least-privilege IAM from Terraform module outputs—never hand-write "*" policies. IRSA roles for EKS should scope to single S3 prefix and single KMS key.
data "aws_iam_policy_document" "payments_api" {
statement {
sid = "ReadArtifacts"
effect = "Allow"
actions = [
"s3:GetObject",
"s3:ListBucket"
]
resources = [
aws_s3_bucket.artifacts.arn,
"${aws_s3_bucket.artifacts.arn}/payments/*"
]
}
statement {
sid = "DecryptWithKms"
effect = "Allow"
actions = ["kms:Decrypt"]
resources = [aws_kms_key.artifacts.arn]
}
}
resource "aws_iam_role" "payments_api" {
name = "payments-api-${var.environment}"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = {
Federated = aws_iam_openid_connect_provider.eks.arn
}
Action = "sts:AssumeRoleWithWebIdentity"
Condition = {
StringEquals = {
"${replace(aws_iam_openid_connect_provider.eks.url, "https://", "")}:sub" =
"system:serviceaccount:production:payments-api"
}
}
}]
})
}
Terragrunt & DRY Environments
Terragrunt wraps Terraform with DRY backends, remote state generation, and explicit dependencies between stacks—reducing copy-paste across dev/staging/production folders.
locals {
environment = basename(get_terragrunt_dir())
common_tags = {
Environment = local.environment
ManagedBy = "terragrunt"
}
}
remote_state {
backend = "s3"
generate = {
path = "backend.tf"
if_exists = "overwrite_terragrunt"
}
config = {
bucket = "acme-terraform-state"
key = "${local.environment}/${path_relative_to_include()}/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "acme-terraform-locks"
}
}
include "root" {
path = find_in_parent_folders("root.hcl")
}
terraform {
source = "../../../modules//vpc"
}
dependency "network" {
config_path = "../vpc"
mock_outputs = {
private_subnet_ids = ["subnet-mock-a", "subnet-mock-b"]
}
mock_outputs_allowed_terraform_commands = ["plan", "validate"]
}
inputs = {
vpc_cidr = "10.20.0.0/16"
private_subnet_ids = dependency.network.outputs.private_subnet_ids
cluster_name = "acme-${basename(get_terragrunt_dir())}"
}
Use terragrunt run-all plan in CI to validate the entire dependency graph—not isolated modules that assume outputs exist.
Terragrunt's dependency blocks encode stack ordering in code—replace wiki diagrams that say "apply VPC before EKS" with machine-verifiable graphs.
Directory layout example
platform-iac/
├── root.hcl
├── _envcommon/
│ ├── eks.hcl
│ └── vpc.hcl
├── production/
│ ├── vpc/terragrunt.hcl
│ ├── eks/terragrunt.hcl
│ └── rds/terragrunt.hcl
└── staging/
├── vpc/terragrunt.hcl
└── eks/terragrunt.hcl
CI run-all
In CI, terragrunt run-all plan --terragrunt-non-interactive from production/ validates the full stack. Parallelism defaults to 10—tune --terragrunt-parallelism to avoid API rate limits on large estates.
name: Terragrunt plan all
on:
pull_request:
paths: ['production/**', 'staging/**']
jobs:
plan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/terraform-plan
aws-region: us-east-1
- run: |
cd production
terragrunt run-all plan --terragrunt-non-interactive -no-color
terragrunt-plan:
stage: build
script:
- cd production
- terragrunt run-all plan --terragrunt-non-interactive -no-color
rules:
- changes:
- production/**/*
- staging/**/*
Ansible & Configuration Management
Ansible complements Terraform: Terraform provisions; Ansible configures—packages, sysctl, systemd units, certificate rotation on VMs or bare metal when immutable image baking is not yet available.
---
- name: Harden EKS bastion hosts
hosts: bastions
become: true
vars:
ssh_allowed_cidrs:
- 10.0.0.0/8
roles:
- role: devsecops.ssh_hardening
- role: devsecops.auditd
tasks:
- name: Ensure fail2ban running
ansible.builtin.service:
name: fail2ban
state: started
enabled: true
- name: Restrict SSH users
ansible.builtin.lineinfile:
path: /etc/ssh/sshd_config
regexp: '^AllowUsers'
line: 'AllowUsers deploy'
notify: Restart sshd
handlers:
- name: Restart sshd
ansible.builtin.service:
name: sshd
state: restarted
name: Ansible lint and check
on:
pull_request:
paths: ['ansible/**']
jobs:
ansible:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install ansible ansible-lint
- run: ansible-lint ansible/playbooks/
- run: ansible-playbook ansible/playbooks/harden.yml --check --diff
ansible-lint:
stage: test
image: cytopia/ansible-lint:latest
script:
- ansible-lint ansible/playbooks/
rules:
- changes:
- ansible/**/*
Ansible on live servers is imperative drift correction. Prefer golden AMIs or Ignition for Kubernetes nodes; reserve Ansible for brownfield migration windows.
Pulumi & General-Purpose Languages
Pulumi lets you define infrastructure in TypeScript, Python, Go, or C#—sharing types, unit tests, and IDE refactoring with application code. State and engine concepts mirror Terraform.
import * as aws from "@pulumi/aws";
import * as eks from "@pulumi/eks";
import * as pulumi from "@pulumi/pulumi";
const config = new pulumi.Config();
const clusterName = config.require("clusterName");
const cluster = new eks.Cluster(clusterName, {
version: "1.29",
instanceType: "m6i.large",
desiredCapacity: 3,
minSize: 2,
maxSize: 6,
tags: {
Environment: pulumi.getStack(),
ManagedBy: "pulumi",
},
});
export const kubeconfig = cluster.kubeconfig;
export const clusterArn = cluster.core.cluster.arn;
When to choose Pulumi
- Platform team already ships TypeScript libraries for tagging, naming, compliance
- Complex conditionals are unreadable in HCL count/for_each nests
- You want native unit tests (@pulumi/testing) in CI
Pulumi vs Terraform decision matrix
| Factor | Terraform | Pulumi |
|---|---|---|
| Ecosystem | Largest module registry | Growing; wrap TF providers |
| Language | HCL (DSL) | TS/Python/Go/C# |
| Policy | Sentinel / OPA on plan | CrossGuard in-language |
| State | S3 + DynamoDB standard | Pulumi Cloud or self-hosted |
Pulumi CrossGuard policies run at preview time—similar to OPA but authored in the same language as infrastructure. One toolchain for teams that live in IDE refactoring tools.
GitOps for Infrastructure
GitOps for infrastructure means the merged commit triggers plan/apply—not a human running Terraform locally. Atlantis, Terraform Cloud, and VCS-driven workflows close the loop: PR shows plan diff, merge applies, drift detection alerts on console edits.
sequenceDiagram participant Eng as Engineer participant Git as Git PR participant CI as CI / Atlantis participant TF as Terraform participant Cloud as Cloud API Eng->>Git: HCL change + module bump Git->>CI: webhook CI->>TF: plan TF-->>Git: plan comment on PR Eng->>Git: approve + merge CI->>TF: apply TF->>Cloud: reconcile resources Cloud-->>CI: apply success + state upload
repos:
- id: github.com/acme/platform-iac
apply_requirements: [approved, mergeable]
workflow: terragrunt
allowed_overrides: [workflow]
pre_workflow_hooks:
- run: terragrunt hclfmt --check
workflows:
terragrunt:
plan:
steps:
- init
- run: terragrunt run-all plan -no-color -out=$PLANFILE
apply:
steps:
- init
- run: terragrunt run-all apply -auto-approve $PLANFILE
# Atlantis runs on self-hosted runner with IRSA
# PR comments: atlantis plan / atlantis apply
name: Atlantis trigger
on:
issue_comment:
types: [created]
jobs:
atlantis:
if: contains(github.event.comment.body, 'atlantis')
runs-on: [self-hosted, atlantis]
steps:
- run: echo "Atlantis server handles plan/apply via webhook"
trigger-tfc-run:
stage: deploy
script:
- |
curl -sf -X POST "https://app.terraform.io/api/v2/runs" \
-H "Authorization: Bearer $TF_API_TOKEN" \
-H "Content-Type: application/vnd.api+json" \
-d '{"data":{"type":"runs","attributes":{"message":"GitLab $CI_COMMIT_SHA"}}}'
rules:
- if: $CI_COMMIT_BRANCH == "main"
changes:
- terraform/**/*
Drift detection
Schedule terraform plan -detailed-exitcode on cron. Exit code 2 means drift— someone changed console or a failed apply left partial state. Alert platform on-call; never auto-apply drift fixes without human review.
Operational checklist
- Plan artifact stored per PR for audit.
- Checkov/tfsec gate before merge.
- Apply only from CI with OIDC—no local admin applies.
- State bucket versioning + MFA delete protection.
- Break-glass console access logged and reconciled into IaC within 48h.
Auto-apply on merge without scan gates lets a typo delete production databases. Minimum bar: plan artifact, Checkov pass, peer approval, then apply in protected environment.
GitOps is not only Kubernetes manifests—foundational IaC belongs in the same review culture. Platform SREs and security architects review HCL like application PRs.
Crossplane & Kubernetes-native IaC
Crossplane extends Kubernetes with CRDs for cloud resources—RDSInstance, Bucket—reconciled by controllers like any Deployment. Platform teams publish Compositions (golden paths); app teams claim CompositeResourceClaim without writing raw Terraform.
| Approach | Best when | Trade-off |
|---|---|---|
| Terraform + Atlantis | Mature platform team, multi-cloud | Separate toolchain from K8s GitOps |
| Terraform Cloud | Want SaaS state + RBAC | Vendor lock-in, per-resource pricing |
| Crossplane | K8s-native platform, self-service | Steeper K8s expertise required |
| Pulumi | Typed infra in app languages | Smaller module ecosystem |
Interview preparation
- Explain state locking—why DynamoDB + S3 together
- Describe blast radius of terraform apply on shared VPC module
- Compare drift detection vs GitOps reconciliation loop
- Walk through securing CI OIDC role for Terraform apply
When asked "how do you manage infrastructure?", lead with git PR workflow → plan in CI → security scan → approved apply → drift detection cron—not "we use Terraform" alone.