Environment Management
From local laptops to production regions—how to design environment topology, spin up ephemeral review apps on every PR, promote immutable artifacts through gated pipelines, and manage configuration without secrets in git.
Environment strategy
Environments are not copies of production—they are risk containers with different data policies, access controls, and promotion gates. The goal is predictable promotion: the same artifact tested in staging is what runs in production, with only configuration overlays changing.
Core principles
- Parity where it matters — staging mirrors prod topology (ingress, DB engine, message broker) even if scaled down
- Immutability across promotion — promote signed image digests, not rebuilt tags
- Least privilege per env — prod credentials never exist in dev CI variables
- Explicit data classification — each environment declares what data classes it may hold
- Cost-aware sprawl control — ephemeral envs auto-destroy; long-lived envs have owners and budgets
Environment matrix
| Environment | Purpose | Typical workloads | Config source | Data policy |
|---|---|---|---|---|
| local | Developer laptop | Feature dev, unit tests | docker-compose, .env.local | None — never prod data |
| dev | Shared integration | API contract tests, smoke | K8s namespace dev-* | Synthetic fixtures only |
| staging | Pre-prod mirror | E2E, perf, security scans | Helm values-staging.yaml | Anonymized prod subset |
| uat | Business acceptance | Stakeholder sign-off | Same chart, uat overlay | Masked PII, time-limited |
| prod | Customer traffic | Live workloads | Helm values-prod.yaml + sealed secrets | Full — encrypted at rest |
| dr | Disaster recovery | Failover validation | Cross-region replica | Prod backup restore |
| sandbox | Experimentation | Spikes, PoCs | Ephemeral namespace TTL 7d | No customer data |
| perf | Load testing | Soak, spike, chaos | Scaled-down prod topology | Generated traffic data |
Naming conventions
| Pattern | Example | Rationale |
|---|---|---|
{service}-{env} | payments-staging | Clear blast-radius boundaries in logs and metrics |
{env}/{service} | staging/payments | GitOps folder structure maps 1:1 to ArgoCD apps |
pr-{number} | pr-1842 | Ephemeral review apps tied to merge request lifecycle |
{region}-{env} | us-east-prod | Multi-region deployments with regional config overlays |
flowchart LR
subgraph dev["Development"]
L[local] --> D[dev]
end
subgraph preprod["Pre-production"]
D --> S[staging]
S --> U[UAT]
end
subgraph live["Live"]
U --> P[prod]
P --> DR[dr standby]
end
PR[review app] -.->|merge| D
style P fill:#7c3aed,color:#fff
style PR fill:#38bdf8,color:#000
Shopify runs thousands of ephemeral preview environments per day. Each maps to a PR branch; DNS and TLS are automated. Destroy on merge keeps cloud spend predictable while giving reviewers a live URL.
Using production database snapshots in staging without masking creates compliance violations and lets a staging bug corrupt prod-adjacent data. Use synthetic data generators or field-level masking pipelines instead.
Full prod parity in staging doubles infrastructure cost. Compromise: match critical dependencies (auth provider, payment sandbox, feature flag service) and accept smaller replica counts.
Anti-patterns to avoid
| Anti-pattern | Why it fails | Fix |
|---|---|---|
| Snowflake prod | Staging never catches prod-only config | GitOps: same Helm chart, env-specific values only |
| Rebuild on deploy | Different bytecode than tested artifact | Promote immutable digest from registry |
| Shared staging DB | Teams block each other; flaky E2E | Namespace-per-team or ephemeral DB per review app |
| Manual env provisioning | Weeks to onboard; drift from IaC | Terraform/Atlantis or Crossplane self-service |
| Prod secrets in .env files | Leak via git, Slack, screenshots | Vault/Secrets Manager + OIDC in CI |
Decision checklist
- How many environments does your compliance framework require? (SOC2 often needs staging + prod separation)
- Can you afford full prod parity or only critical-path parity?
- Who owns each long-lived environment? (RACI: platform provisions, app team configures)
- What is your RTO/RPO for DR—does DR need warm standby or cold restore?
- Do you need regional environments (EU data residency) or single-region with regional overlays?
Ephemeral review apps
Review apps (preview environments) deploy every pull request to an isolated URL so reviewers test real behavior—not screenshots. They are short-lived, namespaced, and destroyed on merge or TTL expiry.
Lifecycle
- PR opened — CI builds artifact, deploys to
pr-{number}namespace - DNS/TLS — wildcard cert + ingress rule:
pr-1842.preview.example.com - Dependencies — spin ephemeral Postgres (CloudNativePG) or wire to shared staging services
- Comment bot — post preview URL on PR for designer/QA access
- PR updated — redeploy same namespace, rolling update
- PR merged/closed — teardown namespace, delete DNS, purge secrets
sequenceDiagram participant Dev as Developer participant CI as CI Pipeline participant K8s as Kubernetes participant Bot as PR Bot Dev->>CI: Push branch / open PR CI->>CI: Build + scan image CI->>K8s: Helm upgrade --install pr-1842 K8s-->>CI: Ingress URL ready CI->>Bot: Post preview link Bot-->>Dev: Comment on PR Dev->>K8s: QA on preview URL Dev->>CI: Merge PR CI->>K8s: helm uninstall pr-1842
GitHub Actions — review app deploy
name: Review App
on:
pull_request:
types: [opened, synchronize, reopened, closed]
permissions:
contents: read
id-token: write
pull-requests: write
env:
PREVIEW_NS: pr-${{ github.event.pull_request.number }}
PREVIEW_HOST: pr-${{ github.event.pull_request.number }}.preview.example.com
jobs:
deploy:
if: github.event.action != 'closed'
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/gha-preview
aws-region: us-east-1
- name: Build and push
run: |
docker build -t $ECR_REPO:${{ github.sha }} .
docker push $ECR_REPO:${{ github.sha }}
- name: Deploy preview
run: |
helm upgrade --install $PREVIEW_NS ./chart \
--namespace $PREVIEW_NS --create-namespace \
--set image.tag=${{ github.sha }} \
--set ingress.host=$PREVIEW_HOST \
--wait --timeout 5m
- name: Comment preview URL
uses: actions/github-script@v7
with:
script: |
github.rest.issues.createComment({{
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: '🚀 Preview: https://${{ env.PREVIEW_HOST }}'
}})
teardown:
if: github.event.action == 'closed'
runs-on: ubuntu-latest
steps:
- run: helm uninstall $PREVIEW_NS -n $PREVIEW_NS || true
- run: kubectl delete namespace $PREVIEW_NS --wait=false
GitLab CI — review app with environments
review:deploy:
stage: deploy
rules:
- if: $CI_PIPELINE_SOURCE == "merge_request_event"
environment:
name: review/$CI_MERGE_REQUEST_IID
url: https://mr-$CI_MERGE_REQUEST_IID.preview.example.com
on_stop: review:stop
auto_stop_in: 3 days
script:
- docker build -t $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA .
- docker push $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
- helm upgrade --install mr-$CI_MERGE_REQUEST_IID ./chart
--set image.tag=$CI_COMMIT_SHA
--namespace mr-$CI_MERGE_REQUEST_IID --create-namespace
review:stop:
stage: deploy
rules:
- if: $CI_PIPELINE_SOURCE == "merge_request_event"
when: manual
- if: $CI_MERGE_REQUEST_EVENT_TYPE == "close"
environment:
name: review/$CI_MERGE_REQUEST_IID
action: stop
script:
- helm uninstall mr-$CI_MERGE_REQUEST_IID -n mr-$CI_MERGE_REQUEST_IID
- kubectl delete namespace mr-$CI_MERGE_REQUEST_IID
Cost and resource controls
| Control | Implementation | Impact |
|---|---|---|
| TTL auto-destroy | GitLab auto_stop_in, CronJob namespace sweeper | Prevents zombie previews billing overnight |
| Resource quotas | ResourceQuota per preview namespace | Caps CPU/memory per PR |
| LimitRange | Default container requests/limits | Stops runaway Java heaps in previews |
| Shared dependencies | Point previews at staging Redis/Postgres | Faster spin-up; isolation trade-off |
| Scale-to-zero | Knative or HPA minReplicas=0 off-hours | Save compute on idle previews |
| Max concurrent previews | Queue deploys when > N active | Protects cluster during PR storms |
Use ArgoCD ApplicationSet with pull request generator to declaratively manage preview apps. The generator watches PR labels and creates/deletes Applications automatically—no custom teardown scripts.
Preview URLs are often public or semi-public. Require SSO (OAuth2 proxy), IP allowlists for internal tools, or magic-link auth. Never expose admin panels on preview without authentication.
Database strategies for previews
| Strategy | Pros | Cons |
|---|---|---|
| Ephemeral DB per PR | Full isolation, schema migration tests | Slow spin-up (30–90s), storage cost |
| Shared staging DB + schema prefix | Fast, cheap | Cross-PR pollution risk |
| DB branch (Neon, PlanetScale) | Instant copy, git-like workflow | Vendor lock-in, network egress |
| Seed fixtures only | Predictable test data | Misses migration edge cases |
Promotion pipelines
Promotion moves a verified immutable artifact through environments—not source code branches. The staging deploy and prod deploy reference the same image digest; only configuration overlays and approval gates differ.
Promotion flow
flowchart TD
A[Build + test + scan] --> B[Push to registry]
B --> C{Scan pass?}
C -->|no| X[Block pipeline]
C -->|yes| D[Sign with Cosign]
D --> E[Deploy staging]
E --> F{E2E + perf OK?}
F -->|no| X
F -->|yes| G[Manual/auto approval]
G --> H[Deploy prod same digest]
H --> I[Deployment marker in APM]
Registry promotion models
| Model | How it works | Best for |
|---|---|---|
| Tag promotion | Retag sha-abc → staging → prod | Simple; same registry repo |
| Repo promotion | Copy image to prod-registry (crane copy) | Air-gapped prod, separate accounts |
| Digest-only GitOps | Git commit updates digest in values-prod.yaml | ArgoCD/Flux audit trail |
| Artifact hub | Harbor replication rules between projects | Multi-region, geo-fenced prod |
| Binary promotion | JFrog/Xray promote after scan gate | Enterprise artifact governance |
GitHub Actions — promote to production
name: Promote to Production
on:
workflow_dispatch:
inputs:
image_digest:
description: 'sha256 digest verified in staging'
required: true
release_notes:
required: true
jobs:
verify-staging:
runs-on: ubuntu-latest
steps:
- name: Confirm digest deployed in staging
run: |
DEPLOYED=$(kubectl -n staging get deploy api -o jsonpath='{{.spec.template.spec.containers[0].image}}')
echo "Staging runs: $DEPLOYED"
[[ "$DEPLOYED" == *"${{ inputs.image_digest }}"* ]] || exit 1
promote:
needs: verify-staging
runs-on: ubuntu-latest
environment: production
steps:
- uses: actions/checkout@v4
- name: Update prod manifest
run: |
yq -i '.image.digest = "${{ inputs.image_digest }}"' deploy/overlays/prod/values.yaml
git config user.name "github-actions[bot]"
git commit -am "promote: ${{ inputs.image_digest }}"
git push
- name: Wait for ArgoCD sync
run: argocd app wait api-prod --health --timeout 600
GitLab — environment-scoped deploy with protected environments
stages: [build, staging, prod]
build:
stage: build
script:
- docker build -t $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA .
- docker push $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
- cosign sign --yes $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
rules:
- if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
deploy:staging:
stage: staging
environment:
name: staging
url: https://staging.example.com
script:
- helm upgrade api ./chart --set image.tag=$CI_COMMIT_SHA -n staging
needs: [build]
deploy:production:
stage: prod
environment:
name: production
url: https://api.example.com
script:
- helm upgrade api ./chart --set image.tag=$CI_COMMIT_SHA -n production
needs: [deploy:staging]
when: manual
only:
- main
Approval gates and compliance
| Gate | Tooling | Audit artifact |
|---|---|---|
| Security scan pass | Trivy, Grype SARIF in PR | SARIF report + waiver ticket |
| Change advisory board | ServiceNow/Jira change request | CHG ticket linked in deploy PR |
| Four-eyes approval | GitHub environment reviewers | Actions deployment log |
| Policy admission | Kyverno verifyImages | ClusterPolicy violation events |
| Rollback readiness | Previous digest pinned in Git | Git revert commit |
Explain promotion as moving trust, not code: build once, sign, test in staging with production-like config, then deploy the exact digest to prod. Mention Cosign verify in admission controller and GitOps for audit trail.
Retagging in a registry does not copy layers—it updates a pointer. Digest sha256:abc... is content-addressable: same bytes, same digest, anywhere. This is why digest promotion beats branch-based deploys.
Rollback strategy
- Git revert — revert the promotion commit; ArgoCD syncs previous digest (fastest for GitOps)
- Helm rollback —
helm rollback api 42restores previous release revision - Flag flip — disable feature in LaunchDarkly while digest rollback proceeds
- Database — forward-only migrations require compatibility window; never auto-rollback schema
Configuration per environment
Twelve-Factor App rule III: store config in the environment. In practice that means non-secret config in Git (Helm values, Kustomize overlays) and secrets in a vault injected at deploy time—never committed, never baked into images.
Configuration layers
| Layer | Storage | Examples | Rotation |
|---|---|---|---|
| Build-time | CI variables (non-secret) | Node version, registry URL | Per pipeline |
| App config | ConfigMap / values.yaml | Log level, feature toggles, timeouts | Git PR |
| Secrets | Vault, AWS SM, Sealed Secrets | DB password, API keys | Automated rotation |
| Runtime discovery | K8s downward API, service DNS | Pod name, namespace | N/A |
| Identity | OIDC / IRSA / Workload Identity | AWS role ARN bound to SA | Short-lived tokens |
Helm values hierarchy
replicaCount: 2
image:
repository: 123456789012.dkr.ecr.us-east-1.amazonaws.com/api
tag: "" # set by CI
pullPolicy: IfNotPresent
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
env:
LOG_LEVEL: info
OTEL_EXPORTER_OTLP_ENDPOINT: "" # overlay per env
replicaCount: 1
env:
LOG_LEVEL: debug
OTEL_EXPORTER_OTLP_ENDPOINT: https://otel-staging.internal:4317
PAYMENT_API_URL: https://sandbox.payments.example.com
ingress:
host: api-staging.example.com
tls:
secretName: api-staging-tls
replicaCount: 6
resources:
requests:
cpu: 500m
memory: 1Gi
env:
LOG_LEVEL: warn
OTEL_EXPORTER_OTLP_ENDPOINT: https://otel-prod.internal:4317
PAYMENT_API_URL: https://api.payments.example.com
ingress:
host: api.example.com
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
podDisruptionBudget:
minAvailable: 4
Kustomize overlay structure
deploy/
├── base/
│ ├── kustomization.yaml
│ ├── deployment.yaml
│ ├── service.yaml
│ └── configmap.yaml
└── overlays/
├── staging/
│ ├── kustomization.yaml # bases: ../../base + patches
│ ├── replica-patch.yaml
│ └── ingress-staging.yaml
└── production/
├── kustomization.yaml
├── hpa.yaml
└── ingress-prod.yaml
External Secrets Operator pattern
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: api-db-credentials
namespace: production
spec:
refreshInterval: 1h
secretStoreRef:
name: aws-secrets-manager
kind: ClusterSecretStore
target:
name: api-db-credentials
creationPolicy: Owner
data:
- secretKey: DATABASE_URL
remoteRef:
key: prod/api/database-url
- secretKey: API_KEY
remoteRef:
key: prod/api/payment-key
property: current
Config drift detection
| Tool | What it compares | When it runs |
|---|---|---|
| ArgoCD diff | Git desired vs cluster live | Continuous sync |
| Terraform plan | State vs cloud API | PR + scheduled |
| kubectl diff | Local manifest vs cluster | Pre-apply CI step |
| Conftest/OPA | Manifest vs policy | Admission + CI |
Use configMapGenerator in Kustomize for env-specific non-secret files. Hash suffixes trigger rolling updates when config changes—unlike static ConfigMaps that require manual pod restart.
Environment variables named ENV=production inside the container image bake environment into the artifact. Build one image; inject env at deploy via Helm --set or Kustomize overlays.
Feature flags vs environment config
- Environment config — connection strings, replica counts, ingress hosts (changes rarely, reviewed in Git)
- Feature flags — user-facing behavior toggles (changes frequently, owned by product, no redeploy)
- Use LaunchDarkly/Unleash/Flagsmith for flags; never conflate with Helm values unless the flag is deploy-gating (e.g. enable canary)
GitOps repository layout
Organize environment config so ArgoCD ApplicationSets can discover overlays without custom glue code.
gitops/
├── apps/
│ ├── api/
│ │ ├── base/ # shared Deployment, Service, SA
│ │ └── overlays/
│ │ ├── dev/
│ │ ├── staging/
│ │ ├── prod-us-east/
│ │ └── prod-eu-west/
│ └── worker/
├── clusters/
│ ├── staging-eks/ # cluster-specific bootstrap
│ └── prod-eks/
├── policies/ # Kyverno / OPA bundles
└── environments/
├── staging.yaml # ArgoCD ApplicationSet generator input
└── production.yaml
Multi-region and data residency
| Requirement | Pattern | Config implication |
|---|---|---|
| EU GDPR | prod-eu-west overlay only | EU Vault path; no US log shipping |
| Active-active | Global load balancer + regional deploys | Separate DB per region; async replication |
| Active-passive DR | DR cluster warm standby | Promote DR overlay on failover runbook |
| Latency | Deploy closest to users | CDN for static; API in 2+ regions |
Environment access control (RBAC)
| Role | dev | staging | prod |
|---|---|---|---|
| Developer | read/write pods, logs | read-only + port-forward | no direct kubectl |
| On-call SRE | full | full | deploy + rollback, no secret read |
| Security | read all | read all + policy edit | read + break-glass audit |
| CI service account | deploy to pr-* ns | deploy on main merge | deploy via GitOps only |
Spotify's deployment model uses a small set of long-lived environments plus dynamic staging for high-risk changes. Platform teams expose a self-service API: teams request an environment tier, Atlantis provisions namespace + quotas, and ArgoCD wires the Application—all without a ticket to ops.
Cost optimization checklist
- Auto-stop preview environments after 72h idle (GitLab
auto_stop_inor CronJob) - Scale staging to zero replicas nights/weekends if no scheduled tests
- Right-size staging nodes—do not run prod-sized instances for 10% traffic
- Use spot/preemptible nodes for dev and preview tiers
- Tag all cloud resources with
env,team,cost-centerfor chargeback - Review monthly: count orphaned namespaces, unused PVCs, stale preview DNS records
ArgoCD ApplicationSet — preview generator
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: api-previews
namespace: argocd
spec:
generators:
- pullRequest:
github:
owner: org
repo: api
labels: [preview]
requeueAfterSeconds: 60
template:
metadata:
name: 'api-pr-{{number}}'
spec:
project: previews
source:
repoURL: https://github.com/org/api.git
targetRevision: '{{head_sha}}'
path: deploy/overlays/preview
helm:
parameters:
- name: ingress.host
value: 'pr-{{number}}.preview.example.com'
destination:
server: https://kubernetes.default.svc
namespace: 'pr-{{number}}'
syncPolicy:
automated:
prune: true
selfHeal: true