Environment Management

From local laptops to production regions—how to design environment topology, spin up ephemeral review apps on every PR, promote immutable artifacts through gated pipelines, and manage configuration without secrets in git.

devops developer Helm Kustomize GitOps

Environment strategy

Environments are not copies of production—they are risk containers with different data policies, access controls, and promotion gates. The goal is predictable promotion: the same artifact tested in staging is what runs in production, with only configuration overlays changing.

Core principles

  • Parity where it matters — staging mirrors prod topology (ingress, DB engine, message broker) even if scaled down
  • Immutability across promotion — promote signed image digests, not rebuilt tags
  • Least privilege per env — prod credentials never exist in dev CI variables
  • Explicit data classification — each environment declares what data classes it may hold
  • Cost-aware sprawl control — ephemeral envs auto-destroy; long-lived envs have owners and budgets

Environment matrix

EnvironmentPurposeTypical workloadsConfig sourceData policy
localDeveloper laptopFeature dev, unit testsdocker-compose, .env.localNone — never prod data
devShared integrationAPI contract tests, smokeK8s namespace dev-*Synthetic fixtures only
stagingPre-prod mirrorE2E, perf, security scansHelm values-staging.yamlAnonymized prod subset
uatBusiness acceptanceStakeholder sign-offSame chart, uat overlayMasked PII, time-limited
prodCustomer trafficLive workloadsHelm values-prod.yaml + sealed secretsFull — encrypted at rest
drDisaster recoveryFailover validationCross-region replicaProd backup restore
sandboxExperimentationSpikes, PoCsEphemeral namespace TTL 7dNo customer data
perfLoad testingSoak, spike, chaosScaled-down prod topologyGenerated traffic data

Naming conventions

PatternExampleRationale
{service}-{env}payments-stagingClear blast-radius boundaries in logs and metrics
{env}/{service}staging/paymentsGitOps folder structure maps 1:1 to ArgoCD apps
pr-{number}pr-1842Ephemeral review apps tied to merge request lifecycle
{region}-{env}us-east-prodMulti-region deployments with regional config overlays
flowchart LR
  subgraph dev["Development"]
    L[local] --> D[dev]
  end
  subgraph preprod["Pre-production"]
    D --> S[staging]
    S --> U[UAT]
  end
  subgraph live["Live"]
    U --> P[prod]
    P --> DR[dr standby]
  end
  PR[review app] -.->|merge| D
  style P fill:#7c3aed,color:#fff
  style PR fill:#38bdf8,color:#000
🌍 Real World

Shopify runs thousands of ephemeral preview environments per day. Each maps to a PR branch; DNS and TLS are automated. Destroy on merge keeps cloud spend predictable while giving reviewers a live URL.

⚠️ Pitfall

Using production database snapshots in staging without masking creates compliance violations and lets a staging bug corrupt prod-adjacent data. Use synthetic data generators or field-level masking pipelines instead.

⚖️ Trade-off

Full prod parity in staging doubles infrastructure cost. Compromise: match critical dependencies (auth provider, payment sandbox, feature flag service) and accept smaller replica counts.

Anti-patterns to avoid

Anti-patternWhy it failsFix
Snowflake prodStaging never catches prod-only configGitOps: same Helm chart, env-specific values only
Rebuild on deployDifferent bytecode than tested artifactPromote immutable digest from registry
Shared staging DBTeams block each other; flaky E2ENamespace-per-team or ephemeral DB per review app
Manual env provisioningWeeks to onboard; drift from IaCTerraform/Atlantis or Crossplane self-service
Prod secrets in .env filesLeak via git, Slack, screenshotsVault/Secrets Manager + OIDC in CI

Decision checklist

  1. How many environments does your compliance framework require? (SOC2 often needs staging + prod separation)
  2. Can you afford full prod parity or only critical-path parity?
  3. Who owns each long-lived environment? (RACI: platform provisions, app team configures)
  4. What is your RTO/RPO for DR—does DR need warm standby or cold restore?
  5. Do you need regional environments (EU data residency) or single-region with regional overlays?

Ephemeral review apps

Review apps (preview environments) deploy every pull request to an isolated URL so reviewers test real behavior—not screenshots. They are short-lived, namespaced, and destroyed on merge or TTL expiry.

Lifecycle

  1. PR opened — CI builds artifact, deploys to pr-{number} namespace
  2. DNS/TLS — wildcard cert + ingress rule: pr-1842.preview.example.com
  3. Dependencies — spin ephemeral Postgres (CloudNativePG) or wire to shared staging services
  4. Comment bot — post preview URL on PR for designer/QA access
  5. PR updated — redeploy same namespace, rolling update
  6. PR merged/closed — teardown namespace, delete DNS, purge secrets
sequenceDiagram
  participant Dev as Developer
  participant CI as CI Pipeline
  participant K8s as Kubernetes
  participant Bot as PR Bot
  Dev->>CI: Push branch / open PR
  CI->>CI: Build + scan image
  CI->>K8s: Helm upgrade --install pr-1842
  K8s-->>CI: Ingress URL ready
  CI->>Bot: Post preview link
  Bot-->>Dev: Comment on PR
  Dev->>K8s: QA on preview URL
  Dev->>CI: Merge PR
  CI->>K8s: helm uninstall pr-1842

GitHub Actions — review app deploy

.github/workflows/review-app.yml
name: Review App
on:
  pull_request:
    types: [opened, synchronize, reopened, closed]
permissions:
  contents: read
  id-token: write
  pull-requests: write
env:
  PREVIEW_NS: pr-${{ github.event.pull_request.number }}
  PREVIEW_HOST: pr-${{ github.event.pull_request.number }}.preview.example.com
jobs:
  deploy:
    if: github.event.action != 'closed'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/gha-preview
          aws-region: us-east-1
      - name: Build and push
        run: |
          docker build -t $ECR_REPO:${{ github.sha }} .
          docker push $ECR_REPO:${{ github.sha }}
      - name: Deploy preview
        run: |
          helm upgrade --install $PREVIEW_NS ./chart \
            --namespace $PREVIEW_NS --create-namespace \
            --set image.tag=${{ github.sha }} \
            --set ingress.host=$PREVIEW_HOST \
            --wait --timeout 5m
      - name: Comment preview URL
        uses: actions/github-script@v7
        with:
          script: |
            github.rest.issues.createComment({{
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: '🚀 Preview: https://${{ env.PREVIEW_HOST }}'
            }})
  teardown:
    if: github.event.action == 'closed'
    runs-on: ubuntu-latest
    steps:
      - run: helm uninstall $PREVIEW_NS -n $PREVIEW_NS || true
      - run: kubectl delete namespace $PREVIEW_NS --wait=false

GitLab CI — review app with environments

.gitlab-ci.yml (review)
review:deploy:
  stage: deploy
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
  environment:
    name: review/$CI_MERGE_REQUEST_IID
    url: https://mr-$CI_MERGE_REQUEST_IID.preview.example.com
    on_stop: review:stop
    auto_stop_in: 3 days
  script:
    - docker build -t $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA .
    - docker push $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
    - helm upgrade --install mr-$CI_MERGE_REQUEST_IID ./chart
        --set image.tag=$CI_COMMIT_SHA
        --namespace mr-$CI_MERGE_REQUEST_IID --create-namespace

review:stop:
  stage: deploy
  rules:
    - if: $CI_PIPELINE_SOURCE == "merge_request_event"
      when: manual
    - if: $CI_MERGE_REQUEST_EVENT_TYPE == "close"
  environment:
    name: review/$CI_MERGE_REQUEST_IID
    action: stop
  script:
    - helm uninstall mr-$CI_MERGE_REQUEST_IID -n mr-$CI_MERGE_REQUEST_IID
    - kubectl delete namespace mr-$CI_MERGE_REQUEST_IID

Cost and resource controls

ControlImplementationImpact
TTL auto-destroyGitLab auto_stop_in, CronJob namespace sweeperPrevents zombie previews billing overnight
Resource quotasResourceQuota per preview namespaceCaps CPU/memory per PR
LimitRangeDefault container requests/limitsStops runaway Java heaps in previews
Shared dependenciesPoint previews at staging Redis/PostgresFaster spin-up; isolation trade-off
Scale-to-zeroKnative or HPA minReplicas=0 off-hoursSave compute on idle previews
Max concurrent previewsQueue deploys when > N activeProtects cluster during PR storms
💡 Tip

Use ArgoCD ApplicationSet with pull request generator to declaratively manage preview apps. The generator watches PR labels and creates/deletes Applications automatically—no custom teardown scripts.

🔒 Security

Preview URLs are often public or semi-public. Require SSO (OAuth2 proxy), IP allowlists for internal tools, or magic-link auth. Never expose admin panels on preview without authentication.

Database strategies for previews

StrategyProsCons
Ephemeral DB per PRFull isolation, schema migration testsSlow spin-up (30–90s), storage cost
Shared staging DB + schema prefixFast, cheapCross-PR pollution risk
DB branch (Neon, PlanetScale)Instant copy, git-like workflowVendor lock-in, network egress
Seed fixtures onlyPredictable test dataMisses migration edge cases

Promotion pipelines

Promotion moves a verified immutable artifact through environments—not source code branches. The staging deploy and prod deploy reference the same image digest; only configuration overlays and approval gates differ.

Promotion flow

flowchart TD
  A[Build + test + scan] --> B[Push to registry]
  B --> C{Scan pass?}
  C -->|no| X[Block pipeline]
  C -->|yes| D[Sign with Cosign]
  D --> E[Deploy staging]
  E --> F{E2E + perf OK?}
  F -->|no| X
  F -->|yes| G[Manual/auto approval]
  G --> H[Deploy prod same digest]
  H --> I[Deployment marker in APM]

Registry promotion models

ModelHow it worksBest for
Tag promotionRetag sha-abcstagingprodSimple; same registry repo
Repo promotionCopy image to prod-registry (crane copy)Air-gapped prod, separate accounts
Digest-only GitOpsGit commit updates digest in values-prod.yamlArgoCD/Flux audit trail
Artifact hubHarbor replication rules between projectsMulti-region, geo-fenced prod
Binary promotionJFrog/Xray promote after scan gateEnterprise artifact governance

GitHub Actions — promote to production

.github/workflows/promote.yml
name: Promote to Production
on:
  workflow_dispatch:
    inputs:
      image_digest:
        description: 'sha256 digest verified in staging'
        required: true
      release_notes:
        required: true
jobs:
  verify-staging:
    runs-on: ubuntu-latest
    steps:
      - name: Confirm digest deployed in staging
        run: |
          DEPLOYED=$(kubectl -n staging get deploy api -o jsonpath='{{.spec.template.spec.containers[0].image}}')
          echo "Staging runs: $DEPLOYED"
          [[ "$DEPLOYED" == *"${{ inputs.image_digest }}"* ]] || exit 1
  promote:
    needs: verify-staging
    runs-on: ubuntu-latest
    environment: production
    steps:
      - uses: actions/checkout@v4
      - name: Update prod manifest
        run: |
          yq -i '.image.digest = "${{ inputs.image_digest }}"' deploy/overlays/prod/values.yaml
          git config user.name "github-actions[bot]"
          git commit -am "promote: ${{ inputs.image_digest }}"
          git push
      - name: Wait for ArgoCD sync
        run: argocd app wait api-prod --health --timeout 600

GitLab — environment-scoped deploy with protected environments

.gitlab-ci.yml (promote)
stages: [build, staging, prod]

build:
  stage: build
  script:
    - docker build -t $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA .
    - docker push $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
    - cosign sign --yes $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
  rules:
    - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH

deploy:staging:
  stage: staging
  environment:
    name: staging
    url: https://staging.example.com
  script:
    - helm upgrade api ./chart --set image.tag=$CI_COMMIT_SHA -n staging
  needs: [build]

deploy:production:
  stage: prod
  environment:
    name: production
    url: https://api.example.com
  script:
    - helm upgrade api ./chart --set image.tag=$CI_COMMIT_SHA -n production
  needs: [deploy:staging]
  when: manual
  only:
    - main

Approval gates and compliance

GateToolingAudit artifact
Security scan passTrivy, Grype SARIF in PRSARIF report + waiver ticket
Change advisory boardServiceNow/Jira change requestCHG ticket linked in deploy PR
Four-eyes approvalGitHub environment reviewersActions deployment log
Policy admissionKyverno verifyImagesClusterPolicy violation events
Rollback readinessPrevious digest pinned in GitGit revert commit
🎯 Interview Tip

Explain promotion as moving trust, not code: build once, sign, test in staging with production-like config, then deploy the exact digest to prod. Mention Cosign verify in admission controller and GitOps for audit trail.

🔬 Under the Hood

Retagging in a registry does not copy layers—it updates a pointer. Digest sha256:abc... is content-addressable: same bytes, same digest, anywhere. This is why digest promotion beats branch-based deploys.

Rollback strategy

  • Git revert — revert the promotion commit; ArgoCD syncs previous digest (fastest for GitOps)
  • Helm rollbackhelm rollback api 42 restores previous release revision
  • Flag flip — disable feature in LaunchDarkly while digest rollback proceeds
  • Database — forward-only migrations require compatibility window; never auto-rollback schema

Configuration per environment

Twelve-Factor App rule III: store config in the environment. In practice that means non-secret config in Git (Helm values, Kustomize overlays) and secrets in a vault injected at deploy time—never committed, never baked into images.

Configuration layers

LayerStorageExamplesRotation
Build-timeCI variables (non-secret)Node version, registry URLPer pipeline
App configConfigMap / values.yamlLog level, feature toggles, timeoutsGit PR
SecretsVault, AWS SM, Sealed SecretsDB password, API keysAutomated rotation
Runtime discoveryK8s downward API, service DNSPod name, namespaceN/A
IdentityOIDC / IRSA / Workload IdentityAWS role ARN bound to SAShort-lived tokens

Helm values hierarchy

values.yaml (base)
replicaCount: 2
image:
  repository: 123456789012.dkr.ecr.us-east-1.amazonaws.com/api
  tag: ""  # set by CI
  pullPolicy: IfNotPresent
resources:
  requests:
    cpu: 100m
    memory: 256Mi
  limits:
    cpu: 500m
    memory: 512Mi
env:
  LOG_LEVEL: info
  OTEL_EXPORTER_OTLP_ENDPOINT: ""  # overlay per env
values-staging.yaml
replicaCount: 1
env:
  LOG_LEVEL: debug
  OTEL_EXPORTER_OTLP_ENDPOINT: https://otel-staging.internal:4317
  PAYMENT_API_URL: https://sandbox.payments.example.com
ingress:
  host: api-staging.example.com
  tls:
    secretName: api-staging-tls
values-prod.yaml
replicaCount: 6
resources:
  requests:
    cpu: 500m
    memory: 1Gi
env:
  LOG_LEVEL: warn
  OTEL_EXPORTER_OTLP_ENDPOINT: https://otel-prod.internal:4317
  PAYMENT_API_URL: https://api.payments.example.com
ingress:
  host: api.example.com
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
podDisruptionBudget:
  minAvailable: 4

Kustomize overlay structure

deploy/
├── base/
│   ├── kustomization.yaml
│   ├── deployment.yaml
│   ├── service.yaml
│   └── configmap.yaml
└── overlays/
    ├── staging/
    │   ├── kustomization.yaml    # bases: ../../base + patches
    │   ├── replica-patch.yaml
    │   └── ingress-staging.yaml
    └── production/
        ├── kustomization.yaml
        ├── hpa.yaml
        └── ingress-prod.yaml

External Secrets Operator pattern

external-secret.yaml
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: api-db-credentials
  namespace: production
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-manager
    kind: ClusterSecretStore
  target:
    name: api-db-credentials
    creationPolicy: Owner
  data:
    - secretKey: DATABASE_URL
      remoteRef:
        key: prod/api/database-url
    - secretKey: API_KEY
      remoteRef:
        key: prod/api/payment-key
        property: current

Config drift detection

ToolWhat it comparesWhen it runs
ArgoCD diffGit desired vs cluster liveContinuous sync
Terraform planState vs cloud APIPR + scheduled
kubectl diffLocal manifest vs clusterPre-apply CI step
Conftest/OPAManifest vs policyAdmission + CI
⚙️ Config

Use configMapGenerator in Kustomize for env-specific non-secret files. Hash suffixes trigger rolling updates when config changes—unlike static ConfigMaps that require manual pod restart.

⚠️ Pitfall

Environment variables named ENV=production inside the container image bake environment into the artifact. Build one image; inject env at deploy via Helm --set or Kustomize overlays.

Feature flags vs environment config

  • Environment config — connection strings, replica counts, ingress hosts (changes rarely, reviewed in Git)
  • Feature flags — user-facing behavior toggles (changes frequently, owned by product, no redeploy)
  • Use LaunchDarkly/Unleash/Flagsmith for flags; never conflate with Helm values unless the flag is deploy-gating (e.g. enable canary)

GitOps repository layout

Organize environment config so ArgoCD ApplicationSets can discover overlays without custom glue code.

gitops/
├── apps/
│   ├── api/
│   │   ├── base/                 # shared Deployment, Service, SA
│   │   └── overlays/
│   │       ├── dev/
│   │       ├── staging/
│   │       ├── prod-us-east/
│   │       └── prod-eu-west/
│   └── worker/
├── clusters/
│   ├── staging-eks/              # cluster-specific bootstrap
│   └── prod-eks/
├── policies/                     # Kyverno / OPA bundles
└── environments/
    ├── staging.yaml              # ArgoCD ApplicationSet generator input
    └── production.yaml

Multi-region and data residency

RequirementPatternConfig implication
EU GDPRprod-eu-west overlay onlyEU Vault path; no US log shipping
Active-activeGlobal load balancer + regional deploysSeparate DB per region; async replication
Active-passive DRDR cluster warm standbyPromote DR overlay on failover runbook
LatencyDeploy closest to usersCDN for static; API in 2+ regions

Environment access control (RBAC)

Roledevstagingprod
Developerread/write pods, logsread-only + port-forwardno direct kubectl
On-call SREfullfulldeploy + rollback, no secret read
Securityread allread all + policy editread + break-glass audit
CI service accountdeploy to pr-* nsdeploy on main mergedeploy via GitOps only
🌍 Real World

Spotify's deployment model uses a small set of long-lived environments plus dynamic staging for high-risk changes. Platform teams expose a self-service API: teams request an environment tier, Atlantis provisions namespace + quotas, and ArgoCD wires the Application—all without a ticket to ops.

Cost optimization checklist

  1. Auto-stop preview environments after 72h idle (GitLab auto_stop_in or CronJob)
  2. Scale staging to zero replicas nights/weekends if no scheduled tests
  3. Right-size staging nodes—do not run prod-sized instances for 10% traffic
  4. Use spot/preemptible nodes for dev and preview tiers
  5. Tag all cloud resources with env, team, cost-center for chargeback
  6. Review monthly: count orphaned namespaces, unused PVCs, stale preview DNS records

ArgoCD ApplicationSet — preview generator

applicationset-previews.yaml
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: api-previews
  namespace: argocd
spec:
  generators:
    - pullRequest:
        github:
          owner: org
          repo: api
          labels: [preview]
        requeueAfterSeconds: 60
  template:
    metadata:
      name: 'api-pr-{{number}}'
    spec:
      project: previews
      source:
        repoURL: https://github.com/org/api.git
        targetRevision: '{{head_sha}}'
        path: deploy/overlays/preview
        helm:
          parameters:
            - name: ingress.host
              value: 'pr-{{number}}.preview.example.com'
      destination:
        server: https://kubernetes.default.svc
        namespace: 'pr-{{number}}'
      syncPolicy:
        automated:
          prune: true
          selfHeal: true