Deployment Strategies
Shipping software is not flipping a switch—it is a sequence of reversible decisions. This chapter covers Kubernetes rollout primitives, feature flags, progressive delivery with metric gates, safe database migrations, release promotion across environments, and multi-layer rollback playbooks that keep change failure rate low.
Deployment Strategies
A deployment strategy answers one question: how do we replace running software with a new version without taking the service down—and without betting the business on a single atomic switch? Kubernetes native primitives (RollingUpdate, Recreate) cover basics; production teams layer blue/green, canary, and GitOps promotion on top.
Strategy comparison
| Strategy | Downtime | Rollback speed | Resource cost | Best for |
|---|---|---|---|---|
| Rolling update | None (if probes correct) | Minutes (roll back image tag) | Low — one extra surge pod | Stateless APIs, default K8s path |
| Recreate | Yes — all pods terminated first | Fast redeploy | Lowest | Dev/staging, jobs that cannot run two versions |
| Blue/green | None at switch | Instant (flip Service selector) | 2× capacity during cutover | Regulated releases, schema-compatible versions |
| Canary | None | Traffic shift or image revert | Partial duplicate stack | High-traffic services with metric gates |
| Feature flags | None | Toggle off (seconds) | SDK + flag service | Long-lived branches, A/B experiments |
Kubernetes RollingUpdate mechanics
A Deployment controller owns ReplicaSets—one active, older ones scaled to zero for rollback history. spec.strategy.type: RollingUpdate with maxUnavailable and maxSurge controls how aggressively pods are replaced. The kubelet does not know about rollouts; it only starts/stops containers the ReplicaSet tells it to run.
apiVersion: apps/v1
kind: Deployment
metadata:
name: payments-api
namespace: production
labels:
app: payments-api
track: stable
spec:
replicas: 6
revisionHistoryLimit: 10
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 2
selector:
matchLabels:
app: payments-api
template:
metadata:
labels:
app: payments-api
version: "2.8.1"
spec:
terminationGracePeriodSeconds: 45
containers:
- name: api
image: registry.example.com/payments-api@sha256:abc123...
readinessProbe:
httpGet:
path: /readyz
port: 8080
periodSeconds: 5
failureThreshold: 3
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 15"]
sequenceDiagram participant Dev as Developer participant CI as CI pipeline participant Reg as Container registry participant API as kube-apiserver participant RS as ReplicaSet controller participant KL as kubelet Dev->>CI: merge to main CI->>Reg: push image digest CI->>API: kubectl set image / GitOps sync API->>RS: new ReplicaSet created RS->>KL: create pod v2 (surge) KL-->>RS: readiness OK RS->>KL: terminate pod v1 RS-->>Dev: rollout status complete
The Deployment controller compares desired replicas from the new ReplicaSet template against observed ready replicas. If readiness probes lie—returning 200 before the app can serve traffic—RollingUpdate will drain healthy v1 pods while v2 pods accept traffic they cannot handle. Readiness gates rollout safety.
Setting maxUnavailable: 50% on a two-replica Deployment allows both pods to terminate simultaneously during a bad image push. Combine conservative surge/unavailable with PodDisruptionBudget minimum availability.
Blue/green on Kubernetes
Run two Deployments—payments-api-blue and payments-api-green—behind one Service. Cutover is a label selector patch or Ingress weight change. Rollback is flipping back; old stack stays warm until you decommission it.
name: Blue-green deploy
on:
workflow_dispatch:
inputs:
color:
description: Target stack (blue|green)
required: true
jobs:
deploy:
runs-on: ubuntu-latest
environment: production
steps:
- uses: actions/checkout@v4
- name: Deploy target color
run: |
kubectl set image deployment/payments-api-${{ inputs.color }} \
api=registry.example.com/payments-api:${{ github.sha }}
kubectl rollout status deployment/payments-api-${{ inputs.color }} -n production
- name: Switch Service selector
run: |
kubectl patch service payments-api -n production -p \
'{"spec":{"selector":{"color":"${{ inputs.color }}"}}}'
deploy-blue:
stage: deploy
environment: production
script:
- kubectl set image deployment/payments-api-blue api=$CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
- kubectl rollout status deployment/payments-api-blue -n production
when: manual
switch-to-blue:
stage: deploy
environment: production
script:
- kubectl patch service payments-api -n production -p '{"spec":{"selector":{"color":"blue"}}}'
needs: [deploy-blue]
when: manual
Feature Flags & Decoupled Release
Deploying code and releasing features are different events. Feature flags let you ship dark code to production, enable it for internal testers, then ramp percentage—without a second deployment. Flags complement canary traffic splitting when behavior is gated in application logic.
Flag types
| Type | Purpose | Rollback |
|---|---|---|
| Release toggle | Hide incomplete features until ready | Disable flag — no redeploy |
| Ops kill switch | Disable expensive code path under load | Instant — SRE playbook |
| Experiment | A/B test conversion metrics | Revert cohort assignment |
| Permission | Entitlement / plan gating | Config change in flag service |
Architecture
SDK in app polls or streams flag state from LaunchDarkly, Unleash, Flagsmith, or a ConfigMap-backed open-source server. Evaluate flags server-side for security-sensitive behavior; client-side flags leak intent in JavaScript bundles.
apiVersion: v1
kind: ConfigMap
metadata:
name: unleash-features
namespace: platform
data:
new-checkout-flow.json: |
{
"name": "new-checkout-flow",
"description": "Stripe Elements v2 checkout",
"type": "release",
"enabled": false,
"strategies": [
{
"name": "flexibleRollout",
"parameters": { "rollout": "25", "stickiness": "userId" }
}
]
}
Flags vs canary traffic
Canary routes HTTP requests to different pod versions—whole binary changes. Feature flags route logic inside one binary. Use both: canary validates infra and latency; flags validate product behavior per user cohort.
name: Deploy with flag default-off
on:
push:
branches: [main]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Build and push
run: |
docker build -t registry.example.com/checkout:${{ github.sha }} .
docker push registry.example.com/checkout:${{ github.sha }}
- name: Sync deployment
run: kubectl set image deployment/checkout checkout=registry.example.com/checkout:${{ github.sha }}
- name: Ensure flag disabled
env:
UNLEASH_TOKEN: ${{ secrets.UNLEASH_ADMIN_TOKEN }}
run: |
curl -sf -X POST "$UNLEASH_URL/api/admin/projects/default/features/new-checkout-flow/environments/production/off" \
-H "Authorization: $UNLEASH_TOKEN"
deploy-checkout:
stage: deploy
script:
- docker build -t $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA .
- docker push $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
- kubectl set image deployment/checkout checkout=$CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
- |
curl -sf -X POST "$UNLEASH_URL/api/admin/projects/default/features/new-checkout-flow/environments/production/off" \
-H "Authorization: $UNLEASH_ADMIN_TOKEN"
environment: production
Every flag needs an owner and expiry date. Permanent flags become undeletable conditional branches—exactly the complexity trunk-based development tries to eliminate.
Managed flag services add cost and network dependency. ConfigMap-backed flags are free but lack real-time analytics and sticky cohort targeting at scale.
Progressive Delivery & Metric Gates
Progressive delivery extends continuous deployment with automated promotion gates—error rate, latency p99, business KPIs—before increasing traffic share. Argo Rollouts, Flagger, and service mesh traffic splitting implement this on Kubernetes.
Argo Rollouts canary
Argo Rollouts replaces Deployment for fine-grained steps: 10% → 30% → 60% → 100%, pausing for AnalysisRun success between steps. Failed analysis triggers automatic rollback.
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: payments-api
namespace: production
spec:
replicas: 10
strategy:
canary:
steps:
- setWeight: 10
- pause: { duration: 5m }
- analysis:
templates:
- templateName: success-rate
- setWeight: 30
- pause: { duration: 5m }
- setWeight: 60
- pause: { duration: 10m }
- setWeight: 100
trafficRouting:
istio:
virtualService:
name: payments-api
routes:
- primary
selector:
matchLabels:
app: payments-api
template:
metadata:
labels:
app: payments-api
spec:
containers:
- name: api
image: registry.example.com/payments-api:2.9.0
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
namespace: production
spec:
metrics:
- name: error-rate
interval: 1m
count: 5
successCondition: result[0] < 0.01
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(http_requests_total{app="payments-api",status=~"5.."}[2m]))
/ sum(rate(http_requests_total{app="payments-api"}[2m]))
flowchart LR
CI["CI builds digest"] --> Sync["GitOps sync"]
Sync --> R10["10% canary"]
R10 --> A1{"Analysis OK?"}
A1 -->|yes| R30["30% traffic"]
A1 -->|no| RB["Auto rollback"]
R30 --> A2{"Analysis OK?"}
A2 -->|yes| R100["100% promote"]
A2 -->|no| RB
name: Trigger progressive rollout
on:
workflow_run:
workflows: [CI]
types: [completed]
jobs:
promote:
if: ${{ github.event.workflow_run.conclusion == 'success' }}
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Update image in GitOps repo
run: |
yq -i '.spec.template.spec.containers[0].image = "registry.example.com/api:${{ github.event.workflow_run.head_sha }}"' \
deploy/overlays/production/rollout.yaml
git config user.email "[email protected]"
git commit -am "promote ${{ github.event.workflow_run.head_sha }}"
git push
promote-production:
stage: deploy
image: bitnami/git:latest
script:
- git clone $GITOPS_REPO gitops && cd gitops
- yq -i '.spec.template.spec.containers[0].image = env(CI_REGISTRY_IMAGE) + ":" + env(CI_COMMIT_SHA)' deploy/production/rollout.yaml
- git commit -am "promote $CI_COMMIT_SHA"
- git push origin main
rules:
- if: $CI_COMMIT_BRANCH == "main"
environment: production
Progressive delivery does not replace policy admission. Image must pass Cosign verification and Kyverno digest-pin policies before the Rollout controller creates canary pods. Security gates at deploy time; metric gates at traffic shift time.
Teams that skip analysis templates and manually promote after a 5-minute pause are doing timed canary, not progressive delivery. Wire at least error rate and saturation metrics—or you are guessing.
Database Migrations in Deploy Pipelines
Schema changes are the hardest part of zero-downtime deploys. Application rollouts assume expand/contract compatibility: new code must run against old schema, old code against new schema—during the overlap window.
Expand/contract pattern
- Expand — add nullable column or new table; deploy code that writes both old and new
- Migrate data — backfill job; dual-read validation
- Contract — remove old column after all code reads new path
Never drop a column in the same release that stops reading it. Production overlap can last hours during slow rollouts.
Migration execution models
| Model | When | Risk |
|---|---|---|
| Pre-deploy Job | Before new pods start | Blocks deploy if migration fails — safe default |
| Init container | Per pod startup | Race on concurrent schema lock — use advisory locks |
| App startup | ORM auto-migrate | Forbidden in production — no audit trail |
| Manual DBA window | Large index builds | Requires maintenance window communication |
apiVersion: batch/v1
kind: Job
metadata:
name: flyway-migrate-2-9-0
namespace: production
annotations:
argocd.argoproj.io/hook: PreSync
argocd.argoproj.io/hook-delete-policy: HookSucceeded
spec:
backoffLimit: 0
template:
spec:
restartPolicy: Never
serviceAccountName: migrate-runner
containers:
- name: flyway
image: flyway/flyway:10
args:
- migrate
- -url=jdbc:postgresql://payments-db:5432/payments
- -user=$(DB_USER)
- -password=$(DB_PASSWORD)
- -locations=filesystem:/sql
envFrom:
- secretRef:
name: payments-db-credentials
volumeMounts:
- name: sql
mountPath: /sql
volumes:
- name: sql
configMap:
name: flyway-v2-9-0
name: Migrate then deploy
on:
push:
branches: [main]
jobs:
migrate:
runs-on: ubuntu-latest
environment: production
steps:
- uses: actions/checkout@v4
- name: Run Flyway
run: |
kubectl delete job flyway-migrate-${{ github.sha }} -n production --ignore-not-found
kubectl apply -f k8s/jobs/flyway-${{ github.sha }}.yaml
kubectl wait --for=condition=complete job/flyway-migrate-${{ github.sha }} -n production --timeout=600s
deploy:
needs: migrate
runs-on: ubuntu-latest
steps:
- name: Sync application
run: kubectl rollout restart deployment/payments-api -n production
db-migrate:
stage: deploy
environment: production
script:
- kubectl apply -f k8s/jobs/flyway-$CI_COMMIT_SHA.yaml
- kubectl wait --for=condition=complete job/flyway-migrate-$CI_COMMIT_SHA -n production --timeout=600s
rules:
- if: $CI_COMMIT_BRANCH == "main"
app-deploy:
stage: deploy
needs: [db-migrate]
script:
- kubectl rollout restart deployment/payments-api -n production
environment: production
Long-running migrations inside PreSync hooks block Argo CD sync—and every subsequent deploy. Split additive migrations (fast) from backfills (async Job) from destructive drops (separate release).
Release Management & Environments
Release management connects version semantics, change approval, environment promotion, and audit trails. GitOps makes the desired state in git the release artifact; tags and changelogs remain the human contract with stakeholders.
Promotion pipeline
Typical flow: dev (every merge) → staging (nightly or RC tag) → production (manual approval or automated after soak). Each hop is a Kustomize overlay or Helm values change—not a rebuilt image.
flowchart LR
subgraph git["Git repository"]
Base["base/"]
Dev["overlays/dev"]
Stg["overlays/staging"]
Prd["overlays/production"]
end
CI["CI builds immutable digest"] --> Reg["Registry"]
Reg --> Dev
Dev -->|"auto sync"| DevCls["dev cluster"]
Stg -->|"RC tag"| StgCls["staging cluster"]
Prd -->|"approval + soak"| PrdCls["production cluster"]
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: production
resources:
- ../../base
images:
- name: registry.example.com/payments-api
newName: registry.example.com/payments-api
digest: sha256:abc123def456...
commonAnnotations:
release.sharpbyte.dev/version: "2.9.0"
release.sharpbyte.dev/changelog: "https://github.com/org/payments/releases/tag/v2.9.0"
patches:
- patch: |-
- op: replace
path: /spec/replicas
value: 12
target:
kind: Deployment
name: payments-api
name: Release
on:
push:
tags: ['v*']
permissions:
contents: write
jobs:
release:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: softprops/action-gh-release@v2
with:
generate_release_notes: true
- name: Promote digest to staging
run: |
cd gitops/overlays/staging
yq -i '.images[0].digest = "${{ needs.build.outputs.digest }}"' kustomization.yaml
git commit -am "staging: ${{ github.ref_name }}"
git push
create-release:
stage: release
image: registry.gitlab.com/gitlab-org/release-cli:latest
rules:
- if: $CI_COMMIT_TAG
script:
- release-cli create --name "$CI_COMMIT_TAG" --tag-name "$CI_COMMIT_TAG" --description "Auto changelog"
release:
tag_name: $CI_COMMIT_TAG
description: "Release $CI_COMMIT_TAG"
promote-staging:
stage: deploy
rules:
- if: $CI_COMMIT_TAG
script:
- yq -i '.images[0].digest = env(IMAGE_DIGEST)' gitops/overlays/staging/kustomization.yaml
- git commit -am "staging $CI_COMMIT_TAG" && git push
Pin digest in production overlays, not floating tags. Tags are mutable; digests are the SLSA-provable identity of what ran.
Rollback Strategies
Rollback is restoring known-good behavior faster than MTTR SLO allows. Kubernetes keeps ReplicaSet history; GitOps keeps git history; feature flags give instant logic rollback—use the right layer for the failure mode.
Rollback layers
| Layer | Command / action | Speed | Scope |
|---|---|---|---|
| Feature flag | Disable in flag console | Seconds | Logic only |
| K8s rollout undo | kubectl rollout undo | 1–5 min | Workload image |
| GitOps revert | git revert + sync | 2–10 min | Full manifest set |
| Traffic shift | Route 100% to stable color | Seconds | Ingress/mesh |
| DB rollback | Forward-fix migration | Hours | Schema — avoid reverse migrations |
$ kubectl rollout history deployment/payments-api -n production $ kubectl rollout undo deployment/payments-api -n production --to-revision=42 $ kubectl rollout status deployment/payments-api -n production → deployment "payments-api" successfully rolled out$ oc rollout history deployment/payments-api -n production $ oc rollout undo deployment/payments-api -n production $ oc adm rollback deployment/payments-api -n production --to-revision=42
name: Emergency GitOps rollback
on:
workflow_dispatch:
inputs:
revert_sha:
description: Commit SHA to restore
required: true
jobs:
rollback:
runs-on: ubuntu-latest
environment: production-emergency
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Revert GitOps commit
run: |
git revert --no-edit ${{ inputs.revert_sha }}
git push origin main
- name: Wait for Argo sync
run: argocd app wait payments-api --health --timeout 600
emergency-rollback:
stage: deploy
when: manual
environment: production-emergency
variables:
REVERT_SHA: ""
script:
- git revert --no-edit $REVERT_SHA
- git push origin main
- argocd app wait payments-api --health --timeout 600
When asked "how do you deploy safely?", structure answer: strategy (rolling/canary) → gates (tests, scans, policies) → observability (metrics during progressive delivery) → rollback (multi-layer, practiced in game days). Connect to DORA change failure rate and MTTR.
kubectl rollout undo restores old ReplicaSet but not incompatible database state. Rollback runbooks must state which layers are safe together.