Policy as Code
Spreadsheets cannot keep pace with Kubernetes. This chapter explains why policy belongs in git, how OPA and Rego work, Gatekeeper and Kyverno admission policies, Conftest CI gates, compliance evidence automation, and tying policy metrics to DORA performance.
Why Policy as Code?
Manual compliance checklists do not survive 200 deploys per week. Policy as Code encodes organizational guardrails as executable rules—evaluated on every PR, Terraform plan, and Kubernetes admission request—before non-compliant resources reach production.
Policy evaluation points
| Stage | Engine | Blocks |
|---|---|---|
| CI — Terraform plan | OPA / Conftest / Sentinel | Public S3, wildcard IAM |
| CI — Kubernetes manifest | Conftest / Kyverno CLI | Missing limits, latest tag |
| Admission — K8s API | Kyverno / Gatekeeper | Non-compliant live deploy |
| Runtime — service mesh | OPA Envoy plugin | Unauthorized east-west call |
Policies should be owned by platform security, versioned in git, tested with fixture inputs, and exempted via documented break-glass annotations—not Slack DMs to disable gates.
flowchart LR Dev["Developer PR"] --> CI["CI policy scan"] CI -->|pass| Merge["Merge"] CI -->|fail| Fix["Fix or request exception"] Merge --> GitOps["GitOps sync"] GitOps --> Admit["K8s admission policy"] Admit -->|pass| Live["Production workload"] Admit -->|fail| Reject["HTTP 422 rejected"]
Policy without metrics is blind compliance. Pair deny rules with dashboards: count of blocked deployments, top violating teams, mean time to policy fix.
Policy maturity model
| Level | Behavior | Example |
|---|---|---|
| 0 — Ad hoc | Manual checklist at release | Spreadsheet SOC2 controls |
| 1 — CI warn | Scanners report, do not block | Checkov soft-fail |
| 2 — CI enforce | PR blocked on violation | Conftest deny on plan |
| 3 — Admission enforce | Cluster rejects non-compliant | Kyverno Enforce mode |
| 4 — Continuous audit | Background scan + metrics | Gatekeeper audit + PolicyReport |
Policy ownership RACI
- Platform security — authors ClusterPolicy / ConstraintTemplate
- App teams — fix violations in their namespaces
- SRE — operates admission webhook HA and latency SLOs
- Compliance — maps policies to framework controls
Open Policy Agent (OPA)
OPA is a general-purpose policy engine using Rego—a declarative query language over JSON. One OPA deployment evaluates Terraform plans, Kubernetes admission reviews, API authorization, and microservice sidecar policies.
package terraform.s3
import future.keywords.in
deny[msg] {
some resource in input.resource_changes
resource.type == "aws_s3_bucket"
resource.change.after.acl == "public-read"
msg := sprintf("S3 bucket %s must not be public-read", [resource.address])
}
deny[msg] {
some resource in input.resource_changes
resource.type == "aws_s3_bucket_public_access_block"
not resource.change.after.block_public_acls
msg := sprintf("S3 %s must block public ACLs", [resource.address])
}
package kubernetes.deployments
deny[msg] {
input.kind == "Deployment"
container := input.spec.template.spec.containers[_]
not container.resources.limits.cpu
msg := sprintf("container %s missing CPU limit", [container.name])
}
deny[msg] {
input.kind == "Deployment"
container := input.spec.template.spec.containers[_]
not container.resources.limits.memory
msg := sprintf("container %s missing memory limit", [container.name])
}
name: OPA Conftest
on: [pull_request]
jobs:
conftest:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: go install github.com/open-policy-agent/conftest@latest
- name: Scan K8s manifests
run: conftest test -p policy/kubernetes deploy/ --all-namespaces
- name: Scan Terraform plan
run: |
terraform show -json plan.tfplan > plan.json
conftest test -p policy/terraform plan.json
conftest-k8s:
stage: test
image: openpolicyagent/conftest:latest
script:
- conftest test -p policy/kubernetes deploy/ --all-namespaces
rules:
- changes:
- deploy/**/*
- policy/**/*
conftest-tf:
stage: test
script:
- terraform show -json plan.tfplan > plan.json
- conftest test -p policy/terraform plan.json
needs: [terraform-plan]
OPA evaluates input against rules that define deny sets. If any deny rule matches, decision is reject. Test with opa test policy/ -v.
OPA deployment patterns
| Pattern | Use case | Latency |
|---|---|---|
| Sidecar | Envoy ext_authz per pod | Low — local |
| Centralized service | API authorization | Medium — network hop |
| Bundle + hot reload | K8s admission via Gatekeeper | Low — in-process |
| CI-only (Conftest) | Shift-left without runtime OPA | N/A in prod |
Testing Rego
Every policy package needs *_test.rego with table-driven cases. Run opa test -v policy/ in CI on every policy PR—Rego bugs are logic bugs, not syntax typos.
package kubernetes.deployments
test_deny_missing_cpu {
input := {
"kind": "Deployment",
"spec": {
"template": {
"spec": {
"containers": [{
"name": "api",
"resources": {"limits": {"memory": "512Mi"}}
}]
}
}
}
}
count(deny) == 1 with input as input
}
test_allow_with_limits {
input := {
"kind": "Deployment",
"spec": {
"template": {
"spec": {
"containers": [{
"name": "api",
"resources": {
"limits": {"cpu": "500m", "memory": "512Mi"}
}
}]
}
}
}
}
count(deny) == 0 with input as input
}
OPA Gatekeeper on Kubernetes
Gatekeeper is Kubernetes-native OPA—ConstraintTemplate CRDs define Rego; Constraint CRs parameterize and enable it. Validating admission webhook rejects non-compliant resources at API server.
apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
name: k8srequiredlabels
spec:
crd:
spec:
names:
kind: K8sRequiredLabels
validation:
openAPIV3Schema:
type: object
properties:
labels:
type: array
items:
type: string
targets:
- target: admission.k8s.gatekeeper.sh
rego: |
package k8srequiredlabels
violation[{"msg": msg}] {
required := input.parameters.labels
provided := input.review.object.metadata.labels
missing := required[_]
not provided[missing]
msg := sprintf("missing required label: %v", [missing])
}
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLabels
metadata:
name: platform-mandatory-labels
spec:
match:
kinds:
- apiGroups: ["apps"]
kinds: ["Deployment", "StatefulSet"]
namespaces:
- production
parameters:
labels:
- app.kubernetes.io/name
- app.kubernetes.io/version
- owner.team
$ kubectl get constraints $ kubectl get k8srequiredlabels.constraints.gatekeeper.sh platform-mandatory-labels -o yaml → status.totalViolations: 3 $ kubectl logs -n gatekeeper-system deploy/gatekeeper-audit$ oc get constraints $ oc describe k8srequiredlabels platform-mandatory-labels
Gatekeeper vs Kyverno
| Aspect | Gatekeeper | Kyverno |
|---|---|---|
| Policy language | Rego | YAML patterns + CEL |
| Mutate | Limited via mutation CRDs | Native mutate rules |
| Image verify | Via external data | Built-in verifyImages |
| Learning curve | Steeper | K8s-native |
Gatekeeper Rego is powerful but steep for app teams. Platform owns templates; product teams only supply Constraint parameters—or migrate simple rules to Kyverno YAML.
Audit vs enforce
Start new constraints in audit mode—violations logged without blocking deploys. Gatekeeper audit controller writes violations to status; export to SIEM before flipping to deny.
| Mode | Effect | When |
|---|---|---|
| Audit | Log violation, allow create | Policy rollout week 1–2 |
| Dryrun | Webhook simulates without persist | Testing template changes |
| Enforce | HTTP 422 reject | After fix window + comms |
Kyverno Policies
Kyverno policies are Kubernetes resources—no Rego required. Validate, mutate, generate, and cleanup rules use YAML patterns familiar to platform engineers.
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: require-image-digest
spec:
validationFailureAction: Enforce
background: true
rules:
- name: disallow-latest-tag
match:
any:
- resources:
kinds:
- Pod
validate:
message: "Using 'latest' tag is not allowed."
pattern:
spec:
containers:
- image: "!*:latest"
- name: require-digest
match:
any:
- resources:
kinds:
- Pod
validate:
message: "Images must use digest pinning (@sha256:...)."
pattern:
spec:
containers:
- image: "*@sha256:*"
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: verify-image-signatures
spec:
validationFailureAction: Enforce
webhookTimeoutSeconds: 30
rules:
- name: verify-signature
match:
any:
- resources:
kinds:
- Pod
verifyImages:
- imageReferences:
- "registry.example.com/*"
attestors:
- count: 1
entries:
- keys:
publicKeys: |-
-----BEGIN PUBLIC KEY-----
MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAE...
-----END PUBLIC KEY-----
name: Kyverno policy test
on:
pull_request:
paths: ['policy/kyverno/**', 'deploy/**']
jobs:
kyverno:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: kyverno/[email protected]
- run: kyverno apply policy/kyverno/ --resource deploy/ --policy-report
- run: kyverno test policy/kyverno/tests/
kyverno-test:
stage: test
image: ghcr.io/kyverno/kyverno-cli:v1.12.0
script:
- kyverno apply policy/kyverno/ --resource deploy/ --policy-report
- kyverno test policy/kyverno/tests/
rules:
- changes:
- policy/kyverno/**/*
- deploy/**/*
Run kyverno apply in CI against rendered manifests (Kustomize/Helm output)—catch violations before Argo CD sync surfaces opaque admission errors.
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
name: add-default-resources
spec:
rules:
- name: default-requests-limits
match:
any:
- resources:
kinds: [Deployment]
mutate:
patchStrategicMerge:
spec:
template:
spec:
containers:
- (name): "*"
resources:
requests:
+(cpu): "100m"
+(memory): "128Mi"
limits:
+(memory): "512Mi"
Policy exceptions
Kyverno PolicyException grants time-bound waivers for break-glass deploys— requires security approval annotation and expiry. Permanent exceptions belong in policy revision, not exceptions CRD.
Performance considerations
- verifyImages calls registry—set reasonable webhookTimeoutSeconds
- Background scans run async—do not rely on them for deploy-time safety
- Keep match rules narrow—cluster-wide Pod matches are expensive at scale
Conftest & CI Policy Gates
Conftest wraps OPA for CI-friendly testing of JSON/YAML/HCL plans. Write policies once in policy/; test with fixture files; fail builds before merge.
Policy repository layout
policy/
├── kubernetes/
│ ├── deployments.rego
│ └── tests/
│ └── missing_limits_test.rego
├── terraform/
│ ├── s3.rego
│ └── tests/
└── docker/
└── dockerfile.rego
Testing policies
Each Rego package should have *_test.rego with positive and negative cases. CI runs conftest test and opa test on every policy change—policies are code and deserve the same review rigor as application logic.
package main
deny[msg] {
input.kind == "Deployment"
container := input.spec.template.spec.containers[_]
endswith(container.image, ":latest")
msg := sprintf("container %v uses :latest tag", [container.name])
}
name: Conftest SARIF
on: [pull_request]
jobs:
policy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: go install github.com/open-policy-agent/conftest@latest
- name: Test policies
run: conftest test -p policy/kubernetes deploy/ -o sarif > conftest.sarif
- uses: github/codeql-action/upload-sarif@v3
with:
sarif_file: conftest.sarif
conftest:
stage: test
image: openpolicyagent/conftest:latest
script:
- conftest test -p policy/kubernetes deploy/ -o junit > conftest.xml
artifacts:
reports:
junit: conftest.xml
Upload Conftest SARIF to GitHub/GitLab security dashboards—security champions see policy failures alongside SAST/SCA in one PR view.
Compliance Frameworks & Evidence
SOC 2, PCI-DSS, HIPAA, and ISO 27001 ask for demonstrable controls—not screenshots. Policy-as-code generates continuous evidence: every denied public bucket, every enforced digest pin, every audit log entry is a compliance datapoint.
Control mapping example
| Framework control | Automated policy | Evidence source |
|---|---|---|
| SOC 2 CC6.1 — logical access | Kyverno: require IRSA annotation | PolicyReport CRD export |
| PCI 1.2 — network segmentation | Terraform OPA: private subnets only | Conftest plan output in CI artifacts |
| PCI 3.4 — encryption at rest | Checkov CKV_AWS_19 | Checkov JUnit report |
| HIPAA — audit logging | Terraform: CloudTrail required | Terraform state + AWS Config |
apiVersion: wgpolicyk8s.io/v1alpha2
kind: PolicyReport
metadata:
name: ns-production-audit
namespace: production
results:
- policy: require-image-digest
rule: require-digest
result: pass
scored: true
resource:
kind: Deployment
name: payments-api
namespace: production
- policy: require-image-digest
rule: disallow-latest-tag
result: fail
message: "container sidecar uses :latest"
resource:
kind: Deployment
name: legacy-batch
namespace: production
Evidence automation pipeline
- CI exports SARIF/JUnit from Conftest, Checkov, Kyverno CLI.
- Nightly job aggregates PolicyReport CRDs cluster-wide.
- Object-lock S3 bucket stores immutable monthly evidence bundles.
- GRC tool ingests via API—auditor queries, not spreadsheet hunts.
Auditors trust automated continuous evidence over annual manual sampling. Export PolicyReport and Conftest artifacts to your GRC tool or S3 evidence bucket with object lock.
Framework quick reference
| Framework | Policy focus | Automation priority |
|---|---|---|
| SOC 2 Type II | Change management, access control | Git PR + admission logs |
| PCI-DSS 4.0 | Network segmentation, encryption | Terraform OPA + Kyverno NetworkPolicy |
| HIPAA | Audit trails, minimum necessary | CloudTrail + RBAC policies |
| ISO 27001 | Risk treatment, asset inventory | SBOM + policy reports |
| NIST 800-53 | Configuration management | Conftest + drift detection |
Evidence retention
- CI artifacts: SARIF/JUnit 90 days minimum
- PolicyReport exports: monthly snapshots to WORM storage
- Admission webhook audit logs: 1 year hot, 7 years cold
- Exception tickets linked to PolicyException CRD metadata
DORA Metrics & Policy Automation
Elite DORA performers deploy frequently with low change failure rate because automation catches mistakes early—including policy violations. Policy gates add seconds to pipeline; they subtract hours from MTTR and audit prep.
How policy automation moves DORA metrics
| DORA metric | Policy automation effect |
|---|---|
| Deployment frequency | Fast CI policy feedback enables trunk-based merges without fear |
| Lead time for changes | Shift-left deny in PR vs 2am admission failure in prod |
| Change failure rate | Blocks misconfigurations that cause outages (open SG, no limits) |
| MTTR | Known-good policy baseline + Git revert beats manual firefighting |
Quick DORA tier check
Adjust inputs to see overall tier—policy automation targets elite change failure rate and MTTR.
Overall tier: Elite
All four metrics at elite tier—world-class delivery.
name: Policy metrics export
on:
schedule:
- cron: '0 * * * *'
jobs:
export:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Collect Kyverno policy reports
run: |
kubectl get policyreport -A -o json | jq '[.items[].results[] | select(.result=="fail")] | length' > fails.txt
- name: Push metric
run: |
curl -X POST "https://api.datadoghq.com/api/v1/series" \
-H "DD-API-KEY: ${{ secrets.DD_API_KEY }}" \
-d '{"series":[{"metric":"kyverno.policy.violations","points":[['"$(date +%s)"',"$(cat fails.txt)"]],"type":"gauge"}]}'
policy-violation-count:
stage: report
script:
- kubectl get policyreport -A -o json > reports.json
- FAILS=$(jq '[.items[].results[] | select(.result=="fail")] | length' reports.json)
- echo "KYVERNO_VIOLATIONS=$FAILS" >> report.env
artifacts:
reports:
dotenv: report.env
Explain how you reduced change failure rate: introduced Conftest on PR → Kyverno enforce in cluster → exported PolicyReport for compliance. Quantify: CFR dropped from 28% to 9% over two quarters.
Over-strict policies block legitimate emergencies. Use validationFailureAction: Audit shadow mode first, then Enforce after fix window—with documented exception process.
Policy gate latency budget
Admission webhooks add to API server latency. Target <100ms p99 for validate rules; verifyImages may need 5–30s timeout—run heavy verification in CI and keep admission to digest/signature checks only.
| Gate location | Typical added latency | Blocks |
|---|---|---|
| Pre-commit | 1–5s | Secrets, format |
| PR Conftest | 10–30s | IaC + manifest policy |
| Kyverno validate | 50–200ms | Labels, limits, tags |
| verifyImages | 1–10s | Unsigned images |
Building a policy platform roadmap
- Month 1: Conftest on PR for K8s manifests (warn mode)
- Month 2: Checkov on Terraform PR (block critical)
- Month 3: Kyverno audit for prod namespace violations
- Month 4: Enforce digest pin + disallow latest
- Month 5: verifyImages Cosign for production
- Month 6: Export PolicyReport to compliance dashboard
Teams that jump straight to Enforce on day one get bypass culture—developers kubectl apply from laptops with cluster-admin. Shadow mode builds trust; enforce mode earns it.