Environment Management

From local laptops to production regions—how to design environment topology, spin up ephemeral review apps on every PR, promote immutable artifacts through gated pipelines, and manage configuration without secrets in git.

devops developer Helm Kustomize GitOps

Environment strategy

Environments are not copies of production—they are risk containers with different data policies, access controls, and promotion gates. The goal is predictable promotion: the same artifact tested in staging is what runs in production, with only configuration overlays changing.

Core principles

Parity where it matters — staging mirrors prod topology (ingress, DB engine, message broker) even if scaled down
Immutability across promotion — promote signed image digests, not rebuilt tags
Least privilege per env — prod credentials never exist in dev CI variables
Explicit data classification — each environment declares what data classes it may hold
Cost-aware sprawl control — ephemeral envs auto-destroy; long-lived envs have owners and budgets

Environment matrix

Environment	Purpose	Typical workloads	Config source	Data policy
local	Developer laptop	Feature dev, unit tests	docker-compose, .env.local	None — never prod data
dev	Shared integration	API contract tests, smoke	K8s namespace dev-*	Synthetic fixtures only
staging	Pre-prod mirror	E2E, perf, security scans	Helm values-staging.yaml	Anonymized prod subset
uat	Business acceptance	Stakeholder sign-off	Same chart, uat overlay	Masked PII, time-limited
prod	Customer traffic	Live workloads	Helm values-prod.yaml + sealed secrets	Full — encrypted at rest
dr	Disaster recovery	Failover validation	Cross-region replica	Prod backup restore
sandbox	Experimentation	Spikes, PoCs	Ephemeral namespace TTL 7d	No customer data
perf	Load testing	Soak, spike, chaos	Scaled-down prod topology	Generated traffic data

Naming conventions

Pattern	Example	Rationale
`{service}-{env}`	`payments-staging`	Clear blast-radius boundaries in logs and metrics
`{env}/{service}`	`staging/payments`	GitOps folder structure maps 1:1 to ArgoCD apps
`pr-{number}`	`pr-1842`	Ephemeral review apps tied to merge request lifecycle
`{region}-{env}`	`us-east-prod`	Multi-region deployments with regional config overlays

flowchart LR
  subgraph dev["Development"]
    L[local] --> D[dev]
  end
  subgraph preprod["Pre-production"]
    D --> S[staging]
    S --> U[UAT]
  end
  subgraph live["Live"]
    U --> P[prod]
    P --> DR[dr standby]
  end
  PR[review app] -.->|merge| D
  style P fill:#7c3aed,color:#fff
  style PR fill:#38bdf8,color:#000

🌍 Real World

Shopify runs thousands of ephemeral preview environments per day. Each maps to a PR branch; DNS and TLS are automated. Destroy on merge keeps cloud spend predictable while giving reviewers a live URL.

⚠️ Pitfall

Using production database snapshots in staging without masking creates compliance violations and lets a staging bug corrupt prod-adjacent data. Use synthetic data generators or field-level masking pipelines instead.

⚖️ Trade-off

Full prod parity in staging doubles infrastructure cost. Compromise: match critical dependencies (auth provider, payment sandbox, feature flag service) and accept smaller replica counts.

Anti-patterns to avoid

Anti-pattern	Why it fails	Fix
Snowflake prod	Staging never catches prod-only config	GitOps: same Helm chart, env-specific values only
Rebuild on deploy	Different bytecode than tested artifact	Promote immutable digest from registry
Shared staging DB	Teams block each other; flaky E2E	Namespace-per-team or ephemeral DB per review app
Manual env provisioning	Weeks to onboard; drift from IaC	Terraform/Atlantis or Crossplane self-service
Prod secrets in .env files	Leak via git, Slack, screenshots	Vault/Secrets Manager + OIDC in CI

Decision checklist

How many environments does your compliance framework require? (SOC2 often needs staging + prod separation)
Can you afford full prod parity or only critical-path parity?
Who owns each long-lived environment? (RACI: platform provisions, app team configures)
What is your RTO/RPO for DR—does DR need warm standby or cold restore?
Do you need regional environments (EU data residency) or single-region with regional overlays?

Ephemeral review apps

Review apps (preview environments) deploy every pull request to an isolated URL so reviewers test real behavior—not screenshots. They are short-lived, namespaced, and destroyed on merge or TTL expiry.

Lifecycle

PR opened — CI builds artifact, deploys to pr-{number} namespace
DNS/TLS — wildcard cert + ingress rule: pr-1842.preview.example.com
Dependencies — spin ephemeral Postgres (CloudNativePG) or wire to shared staging services
Comment bot — post preview URL on PR for designer/QA access
PR updated — redeploy same namespace, rolling update
PR merged/closed — teardown namespace, delete DNS, purge secrets

sequenceDiagram
  participant Dev as Developer
  participant CI as CI Pipeline
  participant K8s as Kubernetes
  participant Bot as PR Bot
  Dev->>CI: Push branch / open PR
  CI->>CI: Build + scan image
  CI->>K8s: Helm upgrade --install pr-1842
  K8s-->>CI: Ingress URL ready
  CI->>Bot: Post preview link
  Bot-->>Dev: Comment on PR
  Dev->>K8s: QA on preview URL
  Dev->>CI: Merge PR
  CI->>K8s: helm uninstall pr-1842

GitHub Actions — review app deploy

name: Review App
on:
  pull_request:
    types: [opened, synchronize, reopened, closed]
permissions:
  contents: read
  id-token: write
  pull-requests: write
env:
  PREVIEW_NS: pr-${{ github.event.pull_request.number }}
  PREVIEW_HOST: pr-${{ github.event.pull_request.number }}.preview.example.com
jobs:
  deploy:
    if: github.event.action != 'closed'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/gha-preview
          aws-region: us-east-1
      - name: Build and push
        run: |
          docker build -t $ECR_REPO:${{ github.sha }} .
          docker push $ECR_REPO:${{ github.sha }}
      - name: Deploy preview
        run: |
          helm upgrade --install $PREVIEW_NS ./chart \
            --namespace $PREVIEW_NS --create-namespace \
            --set image.tag=${{ github.sha }} \
            --set ingress.host=$PREVIEW_HOST \
            --wait --timeout 5m
      - name: Comment preview URL
        uses: actions/github-script@v7
        with:
          script: |
            github.rest.issues.createComment({{
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: '🚀 Preview: https://${{ env.PREVIEW_HOST }}'
            }})
  teardown:
    if: github.event.action == 'closed'
    runs-on: ubuntu-latest
    steps:
      - run: helm uninstall $PREVIEW_NS -n $PREVIEW_NS || true
      - run: kubectl delete namespace $PREVIEW_NS --wait=false

GitLab CI — review app with environments

review:deploy:
  stage: deploy
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
  environment:
    name: review/$CI_MERGE_REQUEST_IID
    url: https://mr-$CI_MERGE_REQUEST_IID.preview.example.com
    on_stop: review:stop
    auto_stop_in: 3 days
  script:
    - docker build -t $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA .
    - docker push $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
    - helm upgrade --install mr-$CI_MERGE_REQUEST_IID ./chart
        --set image.tag=$CI_COMMIT_SHA
        --namespace mr-$CI_MERGE_REQUEST_IID --create-namespace

review:stop:
  stage: deploy
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
      when: manual
    - if: $CI_MERGE_REQUEST_EVENT_TYPE == "close"
  environment:
    name: review/$CI_MERGE_REQUEST_IID
    action: stop
  script:
    - helm uninstall mr-$CI_MERGE_REQUEST_IID -n mr-$CI_MERGE_REQUEST_IID
    - kubectl delete namespace mr-$CI_MERGE_REQUEST_IID

Cost and resource controls

Control	Implementation	Impact
TTL auto-destroy	GitLab `auto_stop_in`, CronJob namespace sweeper	Prevents zombie previews billing overnight
Resource quotas	`ResourceQuota` per preview namespace	Caps CPU/memory per PR
LimitRange	Default container requests/limits	Stops runaway Java heaps in previews
Shared dependencies	Point previews at staging Redis/Postgres	Faster spin-up; isolation trade-off
Scale-to-zero	Knative or HPA minReplicas=0 off-hours	Save compute on idle previews
Max concurrent previews	Queue deploys when > N active	Protects cluster during PR storms

💡 Tip

Use ArgoCD ApplicationSet with pull request generator to declaratively manage preview apps. The generator watches PR labels and creates/deletes Applications automatically—no custom teardown scripts.

🔒 Security

Preview URLs are often public or semi-public. Require SSO (OAuth2 proxy), IP allowlists for internal tools, or magic-link auth. Never expose admin panels on preview without authentication.

Database strategies for previews

Strategy	Pros	Cons
Ephemeral DB per PR	Full isolation, schema migration tests	Slow spin-up (30–90s), storage cost
Shared staging DB + schema prefix	Fast, cheap	Cross-PR pollution risk
DB branch (Neon, PlanetScale)	Instant copy, git-like workflow	Vendor lock-in, network egress
Seed fixtures only	Predictable test data	Misses migration edge cases

Promotion pipelines

Promotion moves a verified immutable artifact through environments—not source code branches. The staging deploy and prod deploy reference the same image digest; only configuration overlays and approval gates differ.

Promotion flow

flowchart TD
  A[Build + test + scan] --> B[Push to registry]
  B --> C{Scan pass?}
  C -->|no| X[Block pipeline]
  C -->|yes| D[Sign with Cosign]
  D --> E[Deploy staging]
  E --> F{E2E + perf OK?}
  F -->|no| X
  F -->|yes| G[Manual/auto approval]
  G --> H[Deploy prod same digest]
  H --> I[Deployment marker in APM]

Registry promotion models

Model	How it works	Best for
Tag promotion	Retag `sha-abc` → `staging` → `prod`	Simple; same registry repo
Repo promotion	Copy image to `prod-registry` (crane copy)	Air-gapped prod, separate accounts
Digest-only GitOps	Git commit updates digest in `values-prod.yaml`	ArgoCD/Flux audit trail
Artifact hub	Harbor replication rules between projects	Multi-region, geo-fenced prod
Binary promotion	JFrog/Xray promote after scan gate	Enterprise artifact governance

GitHub Actions — promote to production

name: Promote to Production
on:
  workflow_dispatch:
    inputs:
      image_digest:
        description: 'sha256 digest verified in staging'
        required: true
      release_notes:
        required: true
jobs:
  verify-staging:
    runs-on: ubuntu-latest
    steps:
      - name: Confirm digest deployed in staging
        run: |
          DEPLOYED=$(kubectl -n staging get deploy api -o jsonpath='{{.spec.template.spec.containers[0].image}}')
          echo "Staging runs: $DEPLOYED"
          [[ "$DEPLOYED" == *"${{ inputs.image_digest }}"* ]] || exit 1
  promote:
    needs: verify-staging
    runs-on: ubuntu-latest
    environment: production
    steps:
      - uses: actions/checkout@v4
      - name: Update prod manifest
        run: |
          yq -i '.image.digest = "${{ inputs.image_digest }}"' deploy/overlays/prod/values.yaml
          git config user.name "github-actions[bot]"
          git commit -am "promote: ${{ inputs.image_digest }}"
          git push
      - name: Wait for ArgoCD sync
        run: argocd app wait api-prod --health --timeout 600

GitLab — environment-scoped deploy with protected environments

stages: [build, staging, prod]

build:
  stage: build
  script:
    - docker build -t $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA .
    - docker push $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
    - cosign sign --yes $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
  rules:
    - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH

deploy:staging:
  stage: staging
  environment:
    name: staging
    url: https://staging.example.com
  script:
    - helm upgrade api ./chart --set image.tag=$CI_COMMIT_SHA -n staging
  needs: [build]

deploy:production:
  stage: prod
  environment:
    name: production
    url: https://api.example.com
  script:
    - helm upgrade api ./chart --set image.tag=$CI_COMMIT_SHA -n production
  needs: [deploy:staging]
  when: manual
  only:
    - main

Approval gates and compliance

Gate	Tooling	Audit artifact
Security scan pass	Trivy, Grype SARIF in PR	SARIF report + waiver ticket
Change advisory board	ServiceNow/Jira change request	CHG ticket linked in deploy PR
Four-eyes approval	GitHub environment reviewers	Actions deployment log
Policy admission	Kyverno verifyImages	ClusterPolicy violation events
Rollback readiness	Previous digest pinned in Git	Git revert commit

🎯 Interview Tip

Explain promotion as moving trust, not code: build once, sign, test in staging with production-like config, then deploy the exact digest to prod. Mention Cosign verify in admission controller and GitOps for audit trail.

🔬 Under the Hood

Retagging in a registry does not copy layers—it updates a pointer. Digest sha256:abc... is content-addressable: same bytes, same digest, anywhere. This is why digest promotion beats branch-based deploys.

Rollback strategy

Git revert — revert the promotion commit; ArgoCD syncs previous digest (fastest for GitOps)
Helm rollback — helm rollback api 42 restores previous release revision
Flag flip — disable feature in LaunchDarkly while digest rollback proceeds
Database — forward-only migrations require compatibility window; never auto-rollback schema

Configuration per environment

Twelve-Factor App rule III: store config in the environment. In practice that means non-secret config in Git (Helm values, Kustomize overlays) and secrets in a vault injected at deploy time—never committed, never baked into images.

Configuration layers

Layer	Storage	Examples	Rotation
Build-time	CI variables (non-secret)	Node version, registry URL	Per pipeline
App config	ConfigMap / values.yaml	Log level, feature toggles, timeouts	Git PR
Secrets	Vault, AWS SM, Sealed Secrets	DB password, API keys	Automated rotation
Runtime discovery	K8s downward API, service DNS	Pod name, namespace	N/A
Identity	OIDC / IRSA / Workload Identity	AWS role ARN bound to SA	Short-lived tokens

Helm values hierarchy

replicaCount: 2
image:
  repository: 123456789012.dkr.ecr.us-east-1.amazonaws.com/api
  tag: ""  # set by CI
  pullPolicy: IfNotPresent
resources:
  requests:
    cpu: 100m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 512Mi
env:
  LOG_LEVEL: info
  OTEL_EXPORTER_OTLP_ENDPOINT: ""  # overlay per env

replicaCount: 1
env:
  LOG_LEVEL: debug
  OTEL_EXPORTER_OTLP_ENDPOINT: https://otel-staging.internal:4317
  PAYMENT_API_URL: https://sandbox.payments.example.com
ingress:
  host: api-staging.example.com
  tls:
    secretName: api-staging-tls

replicaCount: 6
resources:
  requests:
    cpu: 500m
    memory: 1Gi
env:
  LOG_LEVEL: warn
  OTEL_EXPORTER_OTLP_ENDPOINT: https://otel-prod.internal:4317
  PAYMENT_API_URL: https://api.payments.example.com
ingress:
  host: api.example.com
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
podDisruptionBudget:
  minAvailable: 4

Kustomize overlay structure

deploy/
├── base/
│   ├── kustomization.yaml
│   ├── deployment.yaml
│   ├── service.yaml
│   └── configmap.yaml
└── overlays/
    ├── staging/
    │   ├── kustomization.yaml    # bases: ../../base + patches
    │   ├── replica-patch.yaml
    │   └── ingress-staging.yaml
    └── production/
        ├── kustomization.yaml
        ├── hpa.yaml
        └── ingress-prod.yaml

External Secrets Operator pattern

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: api-db-credentials
  namespace: production
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: ClusterSecretStore
  target:
    name: api-db-credentials
    creationPolicy: Owner
  data:
    - secretKey: DATABASE_URL
      remoteRef:
        key: prod/api/database-url
    - secretKey: API_KEY
      remoteRef:
        key: prod/api/payment-key
        property: current

Config drift detection

Tool	What it compares	When it runs
ArgoCD diff	Git desired vs cluster live	Continuous sync
Terraform plan	State vs cloud API	PR + scheduled
kubectl diff	Local manifest vs cluster	Pre-apply CI step
Conftest/OPA	Manifest vs policy	Admission + CI

⚙️ Config

Use configMapGenerator in Kustomize for env-specific non-secret files. Hash suffixes trigger rolling updates when config changes—unlike static ConfigMaps that require manual pod restart.

⚠️ Pitfall

Environment variables named ENV=production inside the container image bake environment into the artifact. Build one image; inject env at deploy via Helm --set or Kustomize overlays.

Feature flags vs environment config

Environment config — connection strings, replica counts, ingress hosts (changes rarely, reviewed in Git)
Feature flags — user-facing behavior toggles (changes frequently, owned by product, no redeploy)
Use LaunchDarkly/Unleash/Flagsmith for flags; never conflate with Helm values unless the flag is deploy-gating (e.g. enable canary)

GitOps repository layout

Organize environment config so ArgoCD ApplicationSets can discover overlays without custom glue code.

gitops/
├── apps/
│   ├── api/
│   │   ├── base/                 # shared Deployment, Service, SA
│   │   └── overlays/
│   │       ├── dev/
│   │       ├── staging/
│   │       ├── prod-us-east/
│   │       └── prod-eu-west/
│   └── worker/
├── clusters/
│   ├── staging-eks/              # cluster-specific bootstrap
│   └── prod-eks/
├── policies/                     # Kyverno / OPA bundles
└── environments/
    ├── staging.yaml              # ArgoCD ApplicationSet generator input
    └── production.yaml

Multi-region and data residency

Requirement	Pattern	Config implication
EU GDPR	prod-eu-west overlay only	EU Vault path; no US log shipping
Active-active	Global load balancer + regional deploys	Separate DB per region; async replication
Active-passive DR	DR cluster warm standby	Promote DR overlay on failover runbook
Latency	Deploy closest to users	CDN for static; API in 2+ regions

Environment access control (RBAC)

Role	dev	staging	prod
Developer	read/write pods, logs	read-only + port-forward	no direct kubectl
On-call SRE	full	full	deploy + rollback, no secret read
Security	read all	read all + policy edit	read + break-glass audit
CI service account	deploy to pr-* ns	deploy on main merge	deploy via GitOps only

🌍 Real World

Spotify's deployment model uses a small set of long-lived environments plus dynamic staging for high-risk changes. Platform teams expose a self-service API: teams request an environment tier, Atlantis provisions namespace + quotas, and ArgoCD wires the Application—all without a ticket to ops.

Cost optimization checklist

Auto-stop preview environments after 72h idle (GitLab auto_stop_in or CronJob)
Scale staging to zero replicas nights/weekends if no scheduled tests
Right-size staging nodes—do not run prod-sized instances for 10% traffic
Use spot/preemptible nodes for dev and preview tiers
Tag all cloud resources with env, team, cost-center for chargeback
Review monthly: count orphaned namespaces, unused PVCs, stale preview DNS records

ArgoCD ApplicationSet — preview generator

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: api-previews
  namespace: argocd
spec:
  generators:
    - pullRequest:
        github:
          owner: org
          repo: api
          labels: [preview]
        requeueAfterSeconds: 60
  template:
    metadata:
      name: 'api-pr-{{number}}'
    spec:
      project: previews
      source:
        repoURL: https://github.com/org/api.git
        targetRevision: '{{head_sha}}'
        path: deploy/overlays/preview
        helm:
          parameters:
            - name: ingress.host
              value: 'pr-{{number}}.preview.example.com'
      destination:
        server: https://kubernetes.default.svc
        namespace: 'pr-{{number}}'
      syncPolicy:
        automated:
          prune: true
          selfHeal: true