We deployed a bug that crashed the payment service. All users affected. Revenue stopped for 45 minutes. We needed a better way.

I implemented canary deployment. Now we test new versions with 5% of traffic first. Last bug? Caught in 2 minutes, affected only 5% of users, rolled back automatically.

Table of Contents

The Problem

Blue-green deployment issues:

  • All-or-nothing switch
  • Bug affects 100% of users
  • No gradual testing
  • Rollback affects everyone

We needed gradual rollout.

Canary Concept

Release process:

  1. Deploy new version (canary)
  2. Route 5% traffic to canary
  3. Monitor metrics
  4. Gradually increase to 10%, 25%, 50%, 100%
  5. Or rollback if issues detected

Like a canary in a coal mine - early warning system.

Kubernetes Native Approach

Two deployments with different replicas:

Stable (v1.0):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app-stable
spec:
  replicas: 19  # 95% of traffic
  selector:
    matchLabels:
      app: web-app
      version: stable
  template:
    metadata:
      labels:
        app: web-app
        version: stable
    spec:
      containers:
      - name: web-app
        image: web-app:1.0.0

Canary (v2.0):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app-canary
spec:
  replicas: 1  # 5% of traffic
  selector:
    matchLabels:
      app: web-app
      version: canary
  template:
    metadata:
      labels:
        app: web-app
        version: canary
    spec:
      containers:
      - name: web-app
        image: web-app:2.0.0

Service (load balances across both):

apiVersion: v1
kind: Service
metadata:
  name: web-app
spec:
  selector:
    app: web-app  # Matches both stable and canary
  ports:
  - port: 80
    targetPort: 3000

Traffic split: 19:1 = 95%:5%

Istio for Advanced Canary

Install Istio:

curl -L https://istio.io/downloadIstio | sh -
cd istio-1.0.0
kubectl apply -f install/kubernetes/istio-demo.yaml

Enable sidecar injection:

kubectl label namespace default istio-injection=enabled

VirtualService for traffic splitting:

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: web-app
spec:
  hosts:
  - web-app
  http:
  - match:
    - headers:
        canary:
          exact: "true"
    route:
    - destination:
        host: web-app
        subset: canary
  - route:
    - destination:
        host: web-app
        subset: stable
      weight: 95
    - destination:
        host: web-app
        subset: canary
      weight: 5

DestinationRule:

apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: web-app
spec:
  host: web-app
  subsets:
  - name: stable
    labels:
      version: stable
  - name: canary
    labels:
      version: canary

Gradual Rollout Script

Automated canary rollout:

#!/bin/bash

set -e

CANARY_WEIGHTS=(5 10 25 50 100)
WAIT_TIME=300  # 5 minutes between steps

for weight in "${CANARY_WEIGHTS[@]}"; do
    echo "Routing $weight% traffic to canary..."
    
    # Update VirtualService
    kubectl patch virtualservice web-app --type merge -p "
    spec:
      http:
      - route:
        - destination:
            host: web-app
            subset: stable
          weight: $((100 - weight))
        - destination:
            host: web-app
            subset: canary
          weight: $weight
    "
    
    # Wait and monitor
    echo "Monitoring for $WAIT_TIME seconds..."
    sleep $WAIT_TIME
    
    # Check error rate
    ERROR_RATE=$(curl -s http://prometheus:9090/api/v1/query?query='rate(http_requests_total{status=~"5..",version="canary"}[5m])' | jq -r '.data.result[0].value[1]')
    
    if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
        echo "Error rate too high: $ERROR_RATE"
        echo "Rolling back..."
        kubectl patch virtualservice web-app --type merge -p "
        spec:
          http:
          - route:
            - destination:
                host: web-app
                subset: stable
              weight: 100
        "
        exit 1
    fi
    
    echo "Canary at $weight% looks good"
done

echo "Canary rollout complete!"

Monitoring Canary

Prometheus queries:

Error rate:

rate(http_requests_total{status=~"5..", version="canary"}[5m])
/
rate(http_requests_total{version="canary"}[5m])

Latency p95:

histogram_quantile(0.95, 
  rate(http_request_duration_seconds_bucket{version="canary"}[5m])
)

Compare canary vs stable:

rate(http_requests_total{status=~"5..", version="canary"}[5m])
/
rate(http_requests_total{status=~"5..", version="stable"}[5m])

Automated Rollback

Prometheus alert:

groups:
- name: canary
  rules:
  - alert: CanaryHighErrorRate
    expr: |
      rate(http_requests_total{status=~"5..", version="canary"}[5m])
      /
      rate(http_requests_total{version="canary"}[5m])
      > 0.01
    for: 2m
    annotations:
      summary: "Canary error rate too high"
      description: "Canary version has {{ $value }} error rate"

Alertmanager webhook triggers rollback:

from flask import Flask, request
import subprocess

app = Flask(__name__)

@app.route('/webhook', methods=['POST'])
def rollback():
    alert = request.json
    
    if alert['alerts'][0]['labels']['alertname'] == 'CanaryHighErrorRate':
        print("Rolling back canary deployment...")
        
        # Set canary weight to 0
        subprocess.run([
            'kubectl', 'patch', 'virtualservice', 'web-app',
            '--type', 'merge',
            '-p', '{"spec":{"http":[{"route":[{"destination":{"host":"web-app","subset":"stable"},"weight":100}]}]}}'
        ])
        
        return "Rollback triggered", 200
    
    return "OK", 200

if __name__ == '__main__':
    app.run(port=5000)

Header-Based Routing

Test canary with specific header:

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: web-app
spec:
  hosts:
  - web-app
  http:
  - match:
    - headers:
        x-canary:
          exact: "true"
    route:
    - destination:
        host: web-app
        subset: canary
  - route:
    - destination:
        host: web-app
        subset: stable

Test:

curl -H "x-canary: true" http://web-app/

User-Based Canary

Route specific users to canary:

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: web-app
spec:
  hosts:
  - web-app
  http:
  - match:
    - headers:
        user-id:
          regex: "^(1|2|3|4|5)$"  # Users 1-5
    route:
    - destination:
        host: web-app
        subset: canary
  - route:
    - destination:
        host: web-app
        subset: stable

Flagger for Automated Canary

Install Flagger:

kubectl apply -k github.com/weaveworks/flagger//kustomize/istio

Canary resource:

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: web-app
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  service:
    port: 80
  analysis:
    interval: 1m
    threshold: 5
    maxWeight: 50
    stepWeight: 5
    metrics:
    - name: request-success-rate
      thresholdRange:
        min: 99
      interval: 1m
    - name: request-duration
      thresholdRange:
        max: 500
      interval: 1m

Flagger automatically:

  1. Deploys canary
  2. Increases traffic gradually
  3. Monitors metrics
  4. Rolls back if metrics fail
  5. Promotes if successful

Real-World Example

Our payment service canary:

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: payment-service
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payment-service
  service:
    port: 8080
  analysis:
    interval: 2m
    threshold: 3
    maxWeight: 50
    stepWeight: 10
    metrics:
    - name: request-success-rate
      thresholdRange:
        min: 99.5
      interval: 1m
    - name: request-duration
      thresholdRange:
        max: 1000
      interval: 1m
    webhooks:
    - name: load-test
      url: http://flagger-loadtester/
      timeout: 5s
      metadata:
        cmd: "hey -z 1m -q 10 -c 2 http://payment-service:8080/"

Rollout process:

  1. 0% → 10% (2 min)
  2. 10% → 20% (2 min)
  3. 20% → 30% (2 min)
  4. 30% → 40% (2 min)
  5. 40% → 50% (2 min)
  6. 50% → 100% (promote)

Total: 10 minutes for full rollout

Canary vs Blue-Green

FeatureCanaryBlue-Green
RolloutGradualInstant
RiskLowMedium
RollbackInstantInstant
ComplexityHighMedium
Resource Cost1.05x2x
TestingProduction trafficPre-switch testing

Use canary for:

  • High-risk changes
  • Gradual validation
  • A/B testing

Use blue-green for:

  • Database migrations
  • Quick rollback needed
  • All-or-nothing changes

Results

Before (blue-green):

  • All users affected by bugs
  • 45-minute outage from bad deploy
  • No gradual testing

After (canary):

  • Only 5% affected initially
  • 2-minute detection time
  • Automatic rollback
  • Zero major outages in 6 months

Lessons Learned

  1. Start small - 5% is enough to catch issues
  2. Monitor closely - Automated metrics are essential
  3. Automate rollback - Don’t rely on manual intervention
  4. Test with real traffic - Staging doesn’t catch everything
  5. Be patient - Gradual rollout takes time but worth it

Conclusion

Canary deployment reduces risk by testing with real traffic gradually. Essential for high-availability services.

Key takeaways:

  1. Gradual traffic shift (5% → 100%)
  2. Monitor error rate and latency
  3. Automated rollback on issues
  4. Use Istio or Flagger for automation
  5. Balance speed vs safety

Deploy with confidence. Canary releases catch bugs before they become disasters.