Deployment Strategies

A deployment strategy answers one question: how do we replace running software with a new version without taking the service down—and without betting the business on a single atomic switch? Kubernetes native primitives (RollingUpdate, Recreate) cover basics; production teams layer blue/green, canary, and GitOps promotion on top.

Strategy comparison

Strategy	Downtime	Rollback speed	Resource cost	Best for
Rolling update	None (if probes correct)	Minutes (roll back image tag)	Low — one extra surge pod	Stateless APIs, default K8s path
Recreate	Yes — all pods terminated first	Fast redeploy	Lowest	Dev/staging, jobs that cannot run two versions
Blue/green	None at switch	Instant (flip Service selector)	2× capacity during cutover	Regulated releases, schema-compatible versions
Canary	None	Traffic shift or image revert	Partial duplicate stack	High-traffic services with metric gates
Feature flags	None	Toggle off (seconds)	SDK + flag service	Long-lived branches, A/B experiments

Kubernetes RollingUpdate mechanics

A Deployment controller owns ReplicaSets—one active, older ones scaled to zero for rollback history. spec.strategy.type: RollingUpdate with maxUnavailable and maxSurge controls how aggressively pods are replaced. The kubelet does not know about rollouts; it only starts/stops containers the ReplicaSet tells it to run.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: payments-api
  namespace: production
  labels:
    app: payments-api
    track: stable
spec:
  replicas: 6
  revisionHistoryLimit: 10
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 2
  selector:
    matchLabels:
      app: payments-api
  template:
    metadata:
      labels:
        app: payments-api
        version: "2.8.1"
    spec:
      terminationGracePeriodSeconds: 45
      containers:
        - name: api
          image: registry.example.com/payments-api@sha256:abc123...
          readinessProbe:
            httpGet:
              path: /readyz
              port: 8080
            periodSeconds: 5
            failureThreshold: 3
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 15"]

sequenceDiagram
  participant Dev as Developer
  participant CI as CI pipeline
  participant Reg as Container registry
  participant API as kube-apiserver
  participant RS as ReplicaSet controller
  participant KL as kubelet
  Dev->>CI: merge to main
  CI->>Reg: push image digest
  CI->>API: kubectl set image / GitOps sync
  API->>RS: new ReplicaSet created
  RS->>KL: create pod v2 (surge)
  KL-->>RS: readiness OK
  RS->>KL: terminate pod v1
  RS-->>Dev: rollout status complete

🔬 Under the Hood

The Deployment controller compares desired replicas from the new ReplicaSet template against observed ready replicas. If readiness probes lie—returning 200 before the app can serve traffic—RollingUpdate will drain healthy v1 pods while v2 pods accept traffic they cannot handle. Readiness gates rollout safety.

⚠️ Pitfall

Setting maxUnavailable: 50% on a two-replica Deployment allows both pods to terminate simultaneously during a bad image push. Combine conservative surge/unavailable with PodDisruptionBudget minimum availability.

Blue/green on Kubernetes

Run two Deployments—payments-api-blue and payments-api-green—behind one Service. Cutover is a label selector patch or Ingress weight change. Rollback is flipping back; old stack stays warm until you decommission it.

name: Blue-green deploy
on:
  workflow_dispatch:
    inputs:
      color:
        description: Target stack (blue|green)
        required: true
jobs:
  deploy:
    runs-on: ubuntu-latest
    environment: production
    steps:
      - uses: actions/checkout@v4
      - name: Deploy target color
        run: |
          kubectl set image deployment/payments-api-${{ inputs.color }} \
            api=registry.example.com/payments-api:${{ github.sha }}
          kubectl rollout status deployment/payments-api-${{ inputs.color }} -n production
      - name: Switch Service selector
        run: |
          kubectl patch service payments-api -n production -p \
            '{"spec":{"selector":{"color":"${{ inputs.color }}"}}}'

deploy-blue:
  stage: deploy
  environment: production
  script:
    - kubectl set image deployment/payments-api-blue api=$CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
    - kubectl rollout status deployment/payments-api-blue -n production
  when: manual

switch-to-blue:
  stage: deploy
  environment: production
  script:
    - kubectl patch service payments-api -n production -p '{"spec":{"selector":{"color":"blue"}}}'
  needs: [deploy-blue]
  when: manual

Feature Flags & Decoupled Release

Deploying code and releasing features are different events. Feature flags let you ship dark code to production, enable it for internal testers, then ramp percentage—without a second deployment. Flags complement canary traffic splitting when behavior is gated in application logic.

Flag types

Type	Purpose	Rollback
Release toggle	Hide incomplete features until ready	Disable flag — no redeploy
Ops kill switch	Disable expensive code path under load	Instant — SRE playbook
Experiment	A/B test conversion metrics	Revert cohort assignment
Permission	Entitlement / plan gating	Config change in flag service

Architecture

SDK in app polls or streams flag state from LaunchDarkly, Unleash, Flagsmith, or a ConfigMap-backed open-source server. Evaluate flags server-side for security-sensitive behavior; client-side flags leak intent in JavaScript bundles.

apiVersion: v1
kind: ConfigMap
metadata:
  name: unleash-features
  namespace: platform
data:
  new-checkout-flow.json: |
    {
      "name": "new-checkout-flow",
      "description": "Stripe Elements v2 checkout",
      "type": "release",
      "enabled": false,
      "strategies": [
        {
          "name": "flexibleRollout",
          "parameters": { "rollout": "25", "stickiness": "userId" }
        }
      ]
    }

Flags vs canary traffic

Canary routes HTTP requests to different pod versions—whole binary changes. Feature flags route logic inside one binary. Use both: canary validates infra and latency; flags validate product behavior per user cohort.

name: Deploy with flag default-off
on:
  push:
    branches: [main]
jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Build and push
        run: |
          docker build -t registry.example.com/checkout:${{ github.sha }} .
          docker push registry.example.com/checkout:${{ github.sha }}
      - name: Sync deployment
        run: kubectl set image deployment/checkout checkout=registry.example.com/checkout:${{ github.sha }}
      - name: Ensure flag disabled
        env:
          UNLEASH_TOKEN: ${{ secrets.UNLEASH_ADMIN_TOKEN }}
        run: |
          curl -sf -X POST "$UNLEASH_URL/api/admin/projects/default/features/new-checkout-flow/environments/production/off" \
            -H "Authorization: $UNLEASH_TOKEN"

deploy-checkout:
  stage: deploy
  script:
    - docker build -t $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA .
    - docker push $CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
    - kubectl set image deployment/checkout checkout=$CI_REGISTRY_IMAGE:$CI_COMMIT_SHA
    - |
      curl -sf -X POST "$UNLEASH_URL/api/admin/projects/default/features/new-checkout-flow/environments/production/off" \
        -H "Authorization: $UNLEASH_ADMIN_TOKEN"
  environment: production

💡 Pro Tip

Every flag needs an owner and expiry date. Permanent flags become undeletable conditional branches—exactly the complexity trunk-based development tries to eliminate.

⚖️ Trade-off

Managed flag services add cost and network dependency. ConfigMap-backed flags are free but lack real-time analytics and sticky cohort targeting at scale.

Progressive Delivery & Metric Gates

Progressive delivery extends continuous deployment with automated promotion gates—error rate, latency p99, business KPIs—before increasing traffic share. Argo Rollouts, Flagger, and service mesh traffic splitting implement this on Kubernetes.

Argo Rollouts canary

Argo Rollouts replaces Deployment for fine-grained steps: 10% → 30% → 60% → 100%, pausing for AnalysisRun success between steps. Failed analysis triggers automatic rollback.

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: payments-api
  namespace: production
spec:
  replicas: 10
  strategy:
    canary:
      steps:
        - setWeight: 10
        - pause: { duration: 5m }
        - analysis:
            templates:
              - templateName: success-rate
        - setWeight: 30
        - pause: { duration: 5m }
        - setWeight: 60
        - pause: { duration: 10m }
        - setWeight: 100
      trafficRouting:
        istio:
          virtualService:
            name: payments-api
            routes:
              - primary
  selector:
    matchLabels:
      app: payments-api
  template:
    metadata:
      labels:
        app: payments-api
    spec:
      containers:
        - name: api
          image: registry.example.com/payments-api:2.9.0
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
  namespace: production
spec:
  metrics:
    - name: error-rate
      interval: 1m
      count: 5
      successCondition: result[0] < 0.01
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{app="payments-api",status=~"5.."}[2m]))
            / sum(rate(http_requests_total{app="payments-api"}[2m]))

flowchart LR
  CI["CI builds digest"] --> Sync["GitOps sync"]
  Sync --> R10["10% canary"]
  R10 --> A1{"Analysis OK?"}
  A1 -->|yes| R30["30% traffic"]
  A1 -->|no| RB["Auto rollback"]
  R30 --> A2{"Analysis OK?"}
  A2 -->|yes| R100["100% promote"]
  A2 -->|no| RB

name: Trigger progressive rollout
on:
  workflow_run:
    workflows: [CI]
    types: [completed]
jobs:
  promote:
    if: ${{ github.event.workflow_run.conclusion == 'success' }}
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Update image in GitOps repo
        run: |
          yq -i '.spec.template.spec.containers[0].image = "registry.example.com/api:${{ github.event.workflow_run.head_sha }}"' \
            deploy/overlays/production/rollout.yaml
          git config user.email "[email protected]"
          git commit -am "promote ${{ github.event.workflow_run.head_sha }}"
          git push

promote-production:
  stage: deploy
  image: bitnami/git:latest
  script:
    - git clone $GITOPS_REPO gitops && cd gitops
    - yq -i '.spec.template.spec.containers[0].image = env(CI_REGISTRY_IMAGE) + ":" + env(CI_COMMIT_SHA)' deploy/production/rollout.yaml
    - git commit -am "promote $CI_COMMIT_SHA"
    - git push origin main
  rules:
    - if: $CI_COMMIT_BRANCH == "main"
  environment: production

🛡️ Security gate

Progressive delivery does not replace policy admission. Image must pass Cosign verification and Kyverno digest-pin policies before the Rollout controller creates canary pods. Security gates at deploy time; metric gates at traffic shift time.

🌍 Real World

Teams that skip analysis templates and manually promote after a 5-minute pause are doing timed canary, not progressive delivery. Wire at least error rate and saturation metrics—or you are guessing.

Database Migrations in Deploy Pipelines

Schema changes are the hardest part of zero-downtime deploys. Application rollouts assume expand/contract compatibility: new code must run against old schema, old code against new schema—during the overlap window.

Expand/contract pattern

Expand — add nullable column or new table; deploy code that writes both old and new
Migrate data — backfill job; dual-read validation
Contract — remove old column after all code reads new path

Never drop a column in the same release that stops reading it. Production overlap can last hours during slow rollouts.

Migration execution models

Model	When	Risk
Pre-deploy Job	Before new pods start	Blocks deploy if migration fails — safe default
Init container	Per pod startup	Race on concurrent schema lock — use advisory locks
App startup	ORM auto-migrate	Forbidden in production — no audit trail
Manual DBA window	Large index builds	Requires maintenance window communication

apiVersion: batch/v1
kind: Job
metadata:
  name: flyway-migrate-2-9-0
  namespace: production
  annotations:
    argocd.argoproj.io/hook: PreSync
    argocd.argoproj.io/hook-delete-policy: HookSucceeded
spec:
  backoffLimit: 0
  template:
    spec:
      restartPolicy: Never
      serviceAccountName: migrate-runner
      containers:
        - name: flyway
          image: flyway/flyway:10
          args:
            - migrate
            - -url=jdbc:postgresql://payments-db:5432/payments
            - -user=$(DB_USER)
            - -password=$(DB_PASSWORD)
            - -locations=filesystem:/sql
          envFrom:
            - secretRef:
                name: payments-db-credentials
          volumeMounts:
            - name: sql
              mountPath: /sql
      volumes:
        - name: sql
          configMap:
            name: flyway-v2-9-0

name: Migrate then deploy
on:
  push:
    branches: [main]
jobs:
  migrate:
    runs-on: ubuntu-latest
    environment: production
    steps:
      - uses: actions/checkout@v4
      - name: Run Flyway
        run: |
          kubectl delete job flyway-migrate-${{ github.sha }} -n production --ignore-not-found
          kubectl apply -f k8s/jobs/flyway-${{ github.sha }}.yaml
          kubectl wait --for=condition=complete job/flyway-migrate-${{ github.sha }} -n production --timeout=600s
  deploy:
    needs: migrate
    runs-on: ubuntu-latest
    steps:
      - name: Sync application
        run: kubectl rollout restart deployment/payments-api -n production

db-migrate:
  stage: deploy
  environment: production
  script:
    - kubectl apply -f k8s/jobs/flyway-$CI_COMMIT_SHA.yaml
    - kubectl wait --for=condition=complete job/flyway-migrate-$CI_COMMIT_SHA -n production --timeout=600s
  rules:
    - if: $CI_COMMIT_BRANCH == "main"

app-deploy:
  stage: deploy
  needs: [db-migrate]
  script:
    - kubectl rollout restart deployment/payments-api -n production
  environment: production

⚠️ Pitfall

Long-running migrations inside PreSync hooks block Argo CD sync—and every subsequent deploy. Split additive migrations (fast) from backfills (async Job) from destructive drops (separate release).

Release Management & Environments

Release management connects version semantics, change approval, environment promotion, and audit trails. GitOps makes the desired state in git the release artifact; tags and changelogs remain the human contract with stakeholders.

Promotion pipeline

Typical flow: dev (every merge) → staging (nightly or RC tag) → production (manual approval or automated after soak). Each hop is a Kustomize overlay or Helm values change—not a rebuilt image.

flowchart LR
  subgraph git["Git repository"]
    Base["base/"]
    Dev["overlays/dev"]
    Stg["overlays/staging"]
    Prd["overlays/production"]
  end
  CI["CI builds immutable digest"] --> Reg["Registry"]
  Reg --> Dev
  Dev -->|"auto sync"| DevCls["dev cluster"]
  Stg -->|"RC tag"| StgCls["staging cluster"]
  Prd -->|"approval + soak"| PrdCls["production cluster"]

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: production
resources:
  - ../../base
images:
  - name: registry.example.com/payments-api
    newName: registry.example.com/payments-api
    digest: sha256:abc123def456...
commonAnnotations:
  release.sharpbyte.dev/version: "2.9.0"
  release.sharpbyte.dev/changelog: "https://github.com/org/payments/releases/tag/v2.9.0"
patches:
  - patch: |-
      - op: replace
        path: /spec/replicas
        value: 12
    target:
      kind: Deployment
      name: payments-api

name: Release
on:
  push:
    tags: ['v*']
permissions:
  contents: write
jobs:
  release:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: softprops/action-gh-release@v2
        with:
          generate_release_notes: true
      - name: Promote digest to staging
        run: |
          cd gitops/overlays/staging
          yq -i '.images[0].digest = "${{ needs.build.outputs.digest }}"' kustomization.yaml
          git commit -am "staging: ${{ github.ref_name }}"
          git push

create-release:
  stage: release
  image: registry.gitlab.com/gitlab-org/release-cli:latest
  rules:
    - if: $CI_COMMIT_TAG
  script:
    - release-cli create --name "$CI_COMMIT_TAG" --tag-name "$CI_COMMIT_TAG" --description "Auto changelog"
  release:
    tag_name: $CI_COMMIT_TAG
    description: "Release $CI_COMMIT_TAG"

promote-staging:
  stage: deploy
  rules:
    - if: $CI_COMMIT_TAG
  script:
    - yq -i '.images[0].digest = env(IMAGE_DIGEST)' gitops/overlays/staging/kustomization.yaml
    - git commit -am "staging $CI_COMMIT_TAG" && git push

💡 Pro Tip

Pin digest in production overlays, not floating tags. Tags are mutable; digests are the SLSA-provable identity of what ran.

Rollback Strategies

Rollback is restoring known-good behavior faster than MTTR SLO allows. Kubernetes keeps ReplicaSet history; GitOps keeps git history; feature flags give instant logic rollback—use the right layer for the failure mode.

Rollback layers

Layer	Command / action	Speed	Scope
Feature flag	Disable in flag console	Seconds	Logic only
K8s rollout undo	kubectl rollout undo	1–5 min	Workload image
GitOps revert	git revert + sync	2–10 min	Full manifest set
Traffic shift	Route 100% to stable color	Seconds	Ingress/mesh
DB rollback	Forward-fix migration	Hours	Schema — avoid reverse migrations

$ kubectl rollout history deployment/payments-api -n production
$ kubectl rollout undo deployment/payments-api -n production --to-revision=42
$ kubectl rollout status deployment/payments-api -n production
→ deployment "payments-api" successfully rolled out$ oc rollout history deployment/payments-api -n production
$ oc rollout undo deployment/payments-api -n production
$ oc adm rollback deployment/payments-api -n production --to-revision=42

name: Emergency GitOps rollback
on:
  workflow_dispatch:
    inputs:
      revert_sha:
        description: Commit SHA to restore
        required: true
jobs:
  rollback:
    runs-on: ubuntu-latest
    environment: production-emergency
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - name: Revert GitOps commit
        run: |
          git revert --no-edit ${{ inputs.revert_sha }}
          git push origin main
      - name: Wait for Argo sync
        run: argocd app wait payments-api --health --timeout 600

emergency-rollback:
  stage: deploy
  when: manual
  environment: production-emergency
  variables:
    REVERT_SHA: ""
  script:
    - git revert --no-edit $REVERT_SHA
    - git push origin main
    - argocd app wait payments-api --health --timeout 600

🎯 Interview angle

When asked "how do you deploy safely?", structure answer: strategy (rolling/canary) → gates (tests, scans, policies) → observability (metrics during progressive delivery) → rollback (multi-layer, practiced in game days). Connect to DORA change failure rate and MTTR.

⚖️ Trade-off

kubectl rollout undo restores old ReplicaSet but not incompatible database state. Rollback runbooks must state which layers are safe together.