Infrastructure as Code

ClickOps does not scale and cannot pass audit. This chapter covers IaC principles, Terraform state and modules, security scanning with Checkov and tfsec, Terragrunt for DRY environments, Ansible configuration, Pulumi for typed infrastructure, and GitOps-driven apply pipelines.

developer devops security IaC OPA

Infrastructure as Code Principles

Infrastructure as Code (IaC) treats datacenter resources like application code: versioned, reviewed, tested, and applied by automation—not clicked into existence in a console. The goal is reproducible environments, auditable change history, and drift detection.

Core principles

PrincipleMeaningAnti-pattern
DeclarativeDescribe desired end stateImperative shell scripts with 200 SSH commands
IdempotentSecond apply changes nothing if state matchesScripts that fail on re-run
ImmutableReplace, don't patch serversSSH + apt upgrade production fleet
Git as source of truthPR review before applyTerraform apply from laptop
Least privilegeCI role applies; humans readAdmin creds in ~/.aws/credentials

IaC in the delivery lifecycle

IaC runs in three places: platform foundation (VPC, EKS, IAM), cluster add-ons (ingress, monitoring), and application dependencies (RDS, S3, SQS). Each layer has different change velocity and blast radius—split state files accordingly.

flowchart TB
  subgraph foundation["Foundation state (quarterly)"]
    VPC["VPC / subnets"]
    EKS["EKS / node groups"]
    IAM["IAM roles"]
  end
  subgraph platform["Platform state (weekly)"]
    ING["Ingress / DNS"]
    MON["Monitoring stack"]
    POL["Policy engines"]
  end
  subgraph app["App state (daily)"]
    RDS["RDS / caches"]
    S3["S3 buckets"]
    SQS["Queues"]
  end
  Git["Git PR"] --> CI["CI plan + scan"]
  CI --> Apply["Terraform apply"]
  Apply --> foundation
  Apply --> platform
  Apply --> app
🏗️ IaC

Every manual console change is technical debt with an audit trail gap. If it is not in git, it does not exist for disaster recovery, compliance evidence, or onboarding.

⚖️ Trade-off

Full declarative coverage takes quarters. Pragmatic teams declare new resources in IaC immediately and schedule import of legacy resources—never the reverse.

Terraform & State Management

Terraform's workflow—init → plan → apply—maps HCL declarations to provider APIs. State is Terraform's memory of real-world IDs; corrupt or leaked state is a production incident.

Project layout

Mature repos separate modules/ (reusable building blocks) from environments/ (root modules per env). Each root module has its own backend key.

hcl — EKS node group module
module "eks_node_group" {
  source = "../../modules/eks-node-group"

  cluster_name    = var.cluster_name
  node_group_name = "${var.cluster_name}-${var.pool_name}"
  subnet_ids      = var.private_subnet_ids
  instance_types  = var.instance_types
  desired_size    = var.desired_size
  min_size        = var.min_size
  max_size        = var.max_size

  labels = {
    nodepool = var.pool_name
  }

  taints = var.taints

  tags = merge(var.common_tags, {
    ManagedBy = "terraform"
  })
}
hcl — remote state backend
terraform {
  required_version = ">= 1.6.0"

  backend "s3" {
    bucket         = "acme-terraform-state"
    key            = "production/eks/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "acme-terraform-locks"
  }

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = var.aws_region

  default_tags {
    tags = {
      Environment = var.environment
      Repository  = "github.com/acme/platform-iac"
    }
  }
}
.github/workflows/terraform.yml
name: Terraform
on:
  pull_request:
    paths: ['terraform/**']
  push:
    branches: [main]
    paths: ['terraform/**']
permissions:
  id-token: write
  contents: read
jobs:
  plan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/github-terraform-plan
          aws-region: us-east-1
      - uses: hashicorp/setup-terraform@v3
      - run: terraform init && terraform plan -no-color -out=plan.tfplan
        working-directory: terraform/environments/production
      - uses: actions/upload-artifact@v4
        with:
          name: tfplan
          path: terraform/environments/production/plan.tfplan
  apply:
    if: github.ref == 'refs/heads/main'
    needs: plan
    runs-on: ubuntu-latest
    environment: production-iac
    steps:
      - uses: hashicorp/setup-terraform@v3
      - run: terraform apply -auto-approve plan.tfplan
        working-directory: terraform/environments/production
.gitlab-ci.yml
include:
  - template: Terraform/Base.gitlab-ci.yml

.terraform-plan:
  stage: build
  script:
    - cd terraform/environments/$CI_ENVIRONMENT_NAME
    - terraform init -input=false
    - terraform plan -out=plan.tfplan
  artifacts:
    paths:
      - terraform/environments/$CI_ENVIRONMENT_NAME/plan.tfplan

terraform-plan-production:
  extends: .terraform-plan
  environment: production

terraform-apply-production:
  stage: deploy
  environment: production
  script:
    - cd terraform/environments/production
    - terraform apply -auto-approve plan.tfplan
  needs: [terraform-plan-production]
  rules:
    - if: $CI_COMMIT_BRANCH == "main"
      when: manual
🔬 Under the Hood

Terraform builds a dependency graph from resource references. plan walks the graph in parallel where safe, computing a diff against state. Provider plugins translate each resource change into cloud API calls.

State operations & blast radius

OperationRiskMitigation
terraform importWrong ID corrupts stateImport in dev stack first; snapshot state before import
moved blocksRefactor without destroy/createUse Terraform 1.1+ moved syntax in same PR as rename
terraform state rmOrphaned cloud resourcePair with console verification; document in PR
Workspace per envState collision if shared backend keySeparate backend keys per environment root module

Module versioning

Pin module sources to semver git refs—not ref=main. source = "git::https://github.com/acme/tf-modules.git//eks?ref=v2.4.1" lets consumers upgrade deliberately after reading module CHANGELOG.

hcl — moved block (refactor without recreate)
moved {
  from = aws_instance.web
  to   = aws_instance.web_v2
}

resource "aws_instance" "web_v2" {
  ami           = var.ami_id
  instance_type = "m6i.large"
  subnet_id     = var.subnet_id

  tags = {
    Name = "web-${var.environment}"
  }
}

Terraform Security & Scanning

IaC misconfiguration is a leading cloud breach vector—public S3 buckets, open security groups, overly broad IAM. Scan before apply with Checkov, tfsec, or KICS; enforce with OPA policy in CI.

Common misconfigurations

FindingSeverityFix
S3 acl = "public-read" CRITICAL Block public access; use bucket policy with IAM principal
Security group 0.0.0.0/0 on port 22 HIGH Bastion or SSM Session Manager only
IAM wildcard action on wildcard resource HIGH Scoped policy per workload IRSA role
Unencrypted RDS instance MEDIUM storage_encrypted = true + KMS CMK
hcl — secure S3 module
resource "aws_s3_bucket" "artifacts" {
  bucket = var.bucket_name
}

resource "aws_s3_bucket_public_access_block" "artifacts" {
  bucket = aws_s3_bucket.artifacts.id

  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

resource "aws_s3_bucket_server_side_encryption_configuration" "artifacts" {
  bucket = aws_s3_bucket.artifacts.id

  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm     = "aws:kms"
      kms_master_key_id = var.kms_key_arn
    }
    bucket_key_enabled = true
  }
}
.github/workflows/tfsec.yml
name: IaC scan
on: [pull_request]
jobs:
  checkov:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: bridgecrewio/checkov-action@v12
        with:
          directory: terraform/
          framework: terraform
          soft_fail: false
      - uses: aquasecurity/[email protected]
        with:
          working_directory: terraform/
.gitlab-ci.yml
iac-scan:
  stage: test
  image:
    name: bridgecrew/checkov:latest
    entrypoint: [""]
  script:
    - checkov -d terraform/ --framework terraform --soft-fail false
  rules:
    - changes:
        - terraform/**/*
🔒 Security

Run scanners on planned JSON (terraform show -json plan.tfplan) for accurate context—static HCL scan misses computed values.

Scanner comparison

ToolStrengthCI integration
CheckovBroad multi-framework (TF, K8s, Helm)GitHub Action, GitLab, SARIF
tfsecFast Terraform-only rulesCLI, pre-commit
KICS1000+ queries across IaCCI gate, IDE plugins
OPA/ConftestCustom org policy in RegoPlan JSON + HCL

IAM policy as code

Generate least-privilege IAM from Terraform module outputs—never hand-write "*" policies. IRSA roles for EKS should scope to single S3 prefix and single KMS key.

hcl — IRSA role for workload
data "aws_iam_policy_document" "payments_api" {
  statement {
    sid    = "ReadArtifacts"
    effect = "Allow"
    actions = [
      "s3:GetObject",
      "s3:ListBucket"
    ]
    resources = [
      aws_s3_bucket.artifacts.arn,
      "${aws_s3_bucket.artifacts.arn}/payments/*"
    ]
  }

  statement {
    sid    = "DecryptWithKms"
    effect = "Allow"
    actions = ["kms:Decrypt"]
    resources = [aws_kms_key.artifacts.arn]
  }
}

resource "aws_iam_role" "payments_api" {
  name = "payments-api-${var.environment}"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect = "Allow"
      Principal = {
        Federated = aws_iam_openid_connect_provider.eks.arn
      }
      Action = "sts:AssumeRoleWithWebIdentity"
      Condition = {
        StringEquals = {
          "${replace(aws_iam_openid_connect_provider.eks.url, "https://", "")}:sub" =
            "system:serviceaccount:production:payments-api"
        }
      }
    }]
  })
}

Terragrunt & DRY Environments

Terragrunt wraps Terraform with DRY backends, remote state generation, and explicit dependencies between stacks—reducing copy-paste across dev/staging/production folders.

hcl — terragrunt.hcl root
locals {
  environment = basename(get_terragrunt_dir())
  common_tags = {
    Environment = local.environment
    ManagedBy   = "terragrunt"
  }
}

remote_state {
  backend = "s3"
  generate = {
    path      = "backend.tf"
    if_exists = "overwrite_terragrunt"
  }
  config = {
    bucket         = "acme-terraform-state"
    key            = "${local.environment}/${path_relative_to_include()}/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "acme-terraform-locks"
  }
}
hcl — environment leaf + dependency
include "root" {
  path = find_in_parent_folders("root.hcl")
}

terraform {
  source = "../../../modules//vpc"
}

dependency "network" {
  config_path = "../vpc"

  mock_outputs = {
    private_subnet_ids = ["subnet-mock-a", "subnet-mock-b"]
  }
  mock_outputs_allowed_terraform_commands = ["plan", "validate"]
}

inputs = {
  vpc_cidr           = "10.20.0.0/16"
  private_subnet_ids = dependency.network.outputs.private_subnet_ids
  cluster_name       = "acme-${basename(get_terragrunt_dir())}"
}
💡 Pro Tip

Use terragrunt run-all plan in CI to validate the entire dependency graph—not isolated modules that assume outputs exist.

🏗️ IaC

Terragrunt's dependency blocks encode stack ordering in code—replace wiki diagrams that say "apply VPC before EKS" with machine-verifiable graphs.

Directory layout example

platform-iac/
├── root.hcl
├── _envcommon/
│   ├── eks.hcl
│   └── vpc.hcl
├── production/
│   ├── vpc/terragrunt.hcl
│   ├── eks/terragrunt.hcl
│   └── rds/terragrunt.hcl
└── staging/
    ├── vpc/terragrunt.hcl
    └── eks/terragrunt.hcl

CI run-all

In CI, terragrunt run-all plan --terragrunt-non-interactive from production/ validates the full stack. Parallelism defaults to 10—tune --terragrunt-parallelism to avoid API rate limits on large estates.

.github/workflows/terragrunt.yml
name: Terragrunt plan all
on:
  pull_request:
    paths: ['production/**', 'staging/**']
jobs:
  plan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/terraform-plan
          aws-region: us-east-1
      - run: |
          cd production
          terragrunt run-all plan --terragrunt-non-interactive -no-color
.gitlab-ci.yml
terragrunt-plan:
  stage: build
  script:
    - cd production
    - terragrunt run-all plan --terragrunt-non-interactive -no-color
  rules:
    - changes:
        - production/**/*
        - staging/**/*

Ansible & Configuration Management

Ansible complements Terraform: Terraform provisions; Ansible configures—packages, sysctl, systemd units, certificate rotation on VMs or bare metal when immutable image baking is not yet available.

yaml — Ansible hardening playbook
---
- name: Harden EKS bastion hosts
  hosts: bastions
  become: true
  vars:
    ssh_allowed_cidrs:
      - 10.0.0.0/8
  roles:
    - role: devsecops.ssh_hardening
    - role: devsecops.auditd
  tasks:
    - name: Ensure fail2ban running
      ansible.builtin.service:
        name: fail2ban
        state: started
        enabled: true
    - name: Restrict SSH users
      ansible.builtin.lineinfile:
        path: /etc/ssh/sshd_config
        regexp: '^AllowUsers'
        line: 'AllowUsers deploy'
      notify: Restart sshd
  handlers:
    - name: Restart sshd
      ansible.builtin.service:
        name: sshd
        state: restarted
.github/workflows/ansible.yml
name: Ansible lint and check
on:
  pull_request:
    paths: ['ansible/**']
jobs:
  ansible:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install ansible ansible-lint
      - run: ansible-lint ansible/playbooks/
      - run: ansible-playbook ansible/playbooks/harden.yml --check --diff
.gitlab-ci.yml
ansible-lint:
  stage: test
  image: cytopia/ansible-lint:latest
  script:
    - ansible-lint ansible/playbooks/
  rules:
    - changes:
        - ansible/**/*
⚖️ Trade-off

Ansible on live servers is imperative drift correction. Prefer golden AMIs or Ignition for Kubernetes nodes; reserve Ansible for brownfield migration windows.

Pulumi & General-Purpose Languages

Pulumi lets you define infrastructure in TypeScript, Python, Go, or C#—sharing types, unit tests, and IDE refactoring with application code. State and engine concepts mirror Terraform.

typescript — Pulumi EKS (excerpt)
import * as aws from "@pulumi/aws";
import * as eks from "@pulumi/eks";
import * as pulumi from "@pulumi/pulumi";

const config = new pulumi.Config();
const clusterName = config.require("clusterName");

const cluster = new eks.Cluster(clusterName, {
  version: "1.29",
  instanceType: "m6i.large",
  desiredCapacity: 3,
  minSize: 2,
  maxSize: 6,
  tags: {
    Environment: pulumi.getStack(),
    ManagedBy: "pulumi",
  },
});

export const kubeconfig = cluster.kubeconfig;
export const clusterArn = cluster.core.cluster.arn;

When to choose Pulumi

  • Platform team already ships TypeScript libraries for tagging, naming, compliance
  • Complex conditionals are unreadable in HCL count/for_each nests
  • You want native unit tests (@pulumi/testing) in CI

Pulumi vs Terraform decision matrix

FactorTerraformPulumi
EcosystemLargest module registryGrowing; wrap TF providers
LanguageHCL (DSL)TS/Python/Go/C#
PolicySentinel / OPA on planCrossGuard in-language
StateS3 + DynamoDB standardPulumi Cloud or self-hosted
🏗️ IaC

Pulumi CrossGuard policies run at preview time—similar to OPA but authored in the same language as infrastructure. One toolchain for teams that live in IDE refactoring tools.

GitOps for Infrastructure

GitOps for infrastructure means the merged commit triggers plan/apply—not a human running Terraform locally. Atlantis, Terraform Cloud, and VCS-driven workflows close the loop: PR shows plan diff, merge applies, drift detection alerts on console edits.

sequenceDiagram
  participant Eng as Engineer
  participant Git as Git PR
  participant CI as CI / Atlantis
  participant TF as Terraform
  participant Cloud as Cloud API
  Eng->>Git: HCL change + module bump
  Git->>CI: webhook
  CI->>TF: plan
  TF-->>Git: plan comment on PR
  Eng->>Git: approve + merge
  CI->>TF: apply
  TF->>Cloud: reconcile resources
  Cloud-->>CI: apply success + state upload
yaml — Atlantis repo config
repos:
  - id: github.com/acme/platform-iac
    apply_requirements: [approved, mergeable]
    workflow: terragrunt
    allowed_overrides: [workflow]
    pre_workflow_hooks:
      - run: terragrunt hclfmt --check

workflows:
  terragrunt:
    plan:
      steps:
        - init
        - run: terragrunt run-all plan -no-color -out=$PLANFILE
    apply:
      steps:
        - init
        - run: terragrunt run-all apply -auto-approve $PLANFILE
.github/workflows/atlantis.yml
# Atlantis runs on self-hosted runner with IRSA
# PR comments: atlantis plan / atlantis apply
name: Atlantis trigger
on:
  issue_comment:
    types: [created]
jobs:
  atlantis:
    if: contains(github.event.comment.body, 'atlantis')
    runs-on: [self-hosted, atlantis]
    steps:
      - run: echo "Atlantis server handles plan/apply via webhook"
.gitlab-ci.yml
trigger-tfc-run:
  stage: deploy
  script:
    - |
      curl -sf -X POST "https://app.terraform.io/api/v2/runs" \
        -H "Authorization: Bearer $TF_API_TOKEN" \
        -H "Content-Type: application/vnd.api+json" \
        -d '{"data":{"type":"runs","attributes":{"message":"GitLab $CI_COMMIT_SHA"}}}'
  rules:
    - if: $CI_COMMIT_BRANCH == "main"
      changes:
        - terraform/**/*

Drift detection

Schedule terraform plan -detailed-exitcode on cron. Exit code 2 means drift— someone changed console or a failed apply left partial state. Alert platform on-call; never auto-apply drift fixes without human review.

Operational checklist

  1. Plan artifact stored per PR for audit.
  2. Checkov/tfsec gate before merge.
  3. Apply only from CI with OIDC—no local admin applies.
  4. State bucket versioning + MFA delete protection.
  5. Break-glass console access logged and reconciled into IaC within 48h.
⚠️ Pitfall

Auto-apply on merge without scan gates lets a typo delete production databases. Minimum bar: plan artifact, Checkov pass, peer approval, then apply in protected environment.

🏗️ IaC

GitOps is not only Kubernetes manifests—foundational IaC belongs in the same review culture. Platform SREs and security architects review HCL like application PRs.

Crossplane & Kubernetes-native IaC

Crossplane extends Kubernetes with CRDs for cloud resources—RDSInstance, Bucket—reconciled by controllers like any Deployment. Platform teams publish Compositions (golden paths); app teams claim CompositeResourceClaim without writing raw Terraform.

ApproachBest whenTrade-off
Terraform + AtlantisMature platform team, multi-cloudSeparate toolchain from K8s GitOps
Terraform CloudWant SaaS state + RBACVendor lock-in, per-resource pricing
CrossplaneK8s-native platform, self-serviceSteeper K8s expertise required
PulumiTyped infra in app languagesSmaller module ecosystem

Interview preparation

  • Explain state locking—why DynamoDB + S3 together
  • Describe blast radius of terraform apply on shared VPC module
  • Compare drift detection vs GitOps reconciliation loop
  • Walk through securing CI OIDC role for Terraform apply
🎯 Interview Tip

When asked "how do you manage infrastructure?", lead with git PR workflow → plan in CI → security scan → approved apply → drift detection cron—not "we use Terraform" alone.