Canary Deployment: Testing in Production with Confidence
We deployed a bug that crashed the payment service. All users affected. Revenue stopped for 45 minutes. We needed a better way.
I implemented canary deployment. Now we test new versions with 5% of traffic first. Last bug? Caught in 2 minutes, affected only 5% of users, rolled back automatically.
Table of Contents
The Problem
Blue-green deployment issues:
- All-or-nothing switch
- Bug affects 100% of users
- No gradual testing
- Rollback affects everyone
We needed gradual rollout.
Canary Concept
Release process:
- Deploy new version (canary)
- Route 5% traffic to canary
- Monitor metrics
- Gradually increase to 10%, 25%, 50%, 100%
- Or rollback if issues detected
Like a canary in a coal mine - early warning system.
Kubernetes Native Approach
Two deployments with different replicas:
Stable (v1.0):
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app-stable
spec:
replicas: 19 # 95% of traffic
selector:
matchLabels:
app: web-app
version: stable
template:
metadata:
labels:
app: web-app
version: stable
spec:
containers:
- name: web-app
image: web-app:1.0.0
Canary (v2.0):
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app-canary
spec:
replicas: 1 # 5% of traffic
selector:
matchLabels:
app: web-app
version: canary
template:
metadata:
labels:
app: web-app
version: canary
spec:
containers:
- name: web-app
image: web-app:2.0.0
Service (load balances across both):
apiVersion: v1
kind: Service
metadata:
name: web-app
spec:
selector:
app: web-app # Matches both stable and canary
ports:
- port: 80
targetPort: 3000
Traffic split: 19:1 = 95%:5%
Istio for Advanced Canary
Install Istio:
curl -L https://istio.io/downloadIstio | sh -
cd istio-1.0.0
kubectl apply -f install/kubernetes/istio-demo.yaml
Enable sidecar injection:
kubectl label namespace default istio-injection=enabled
VirtualService for traffic splitting:
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: web-app
spec:
hosts:
- web-app
http:
- match:
- headers:
canary:
exact: "true"
route:
- destination:
host: web-app
subset: canary
- route:
- destination:
host: web-app
subset: stable
weight: 95
- destination:
host: web-app
subset: canary
weight: 5
DestinationRule:
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: web-app
spec:
host: web-app
subsets:
- name: stable
labels:
version: stable
- name: canary
labels:
version: canary
Gradual Rollout Script
Automated canary rollout:
#!/bin/bash
set -e
CANARY_WEIGHTS=(5 10 25 50 100)
WAIT_TIME=300 # 5 minutes between steps
for weight in "${CANARY_WEIGHTS[@]}"; do
echo "Routing $weight% traffic to canary..."
# Update VirtualService
kubectl patch virtualservice web-app --type merge -p "
spec:
http:
- route:
- destination:
host: web-app
subset: stable
weight: $((100 - weight))
- destination:
host: web-app
subset: canary
weight: $weight
"
# Wait and monitor
echo "Monitoring for $WAIT_TIME seconds..."
sleep $WAIT_TIME
# Check error rate
ERROR_RATE=$(curl -s http://prometheus:9090/api/v1/query?query='rate(http_requests_total{status=~"5..",version="canary"}[5m])' | jq -r '.data.result[0].value[1]')
if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
echo "Error rate too high: $ERROR_RATE"
echo "Rolling back..."
kubectl patch virtualservice web-app --type merge -p "
spec:
http:
- route:
- destination:
host: web-app
subset: stable
weight: 100
"
exit 1
fi
echo "Canary at $weight% looks good"
done
echo "Canary rollout complete!"
Monitoring Canary
Prometheus queries:
Error rate:
rate(http_requests_total{status=~"5..", version="canary"}[5m])
/
rate(http_requests_total{version="canary"}[5m])
Latency p95:
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket{version="canary"}[5m])
)
Compare canary vs stable:
rate(http_requests_total{status=~"5..", version="canary"}[5m])
/
rate(http_requests_total{status=~"5..", version="stable"}[5m])
Automated Rollback
Prometheus alert:
groups:
- name: canary
rules:
- alert: CanaryHighErrorRate
expr: |
rate(http_requests_total{status=~"5..", version="canary"}[5m])
/
rate(http_requests_total{version="canary"}[5m])
> 0.01
for: 2m
annotations:
summary: "Canary error rate too high"
description: "Canary version has {{ $value }} error rate"
Alertmanager webhook triggers rollback:
from flask import Flask, request
import subprocess
app = Flask(__name__)
@app.route('/webhook', methods=['POST'])
def rollback():
alert = request.json
if alert['alerts'][0]['labels']['alertname'] == 'CanaryHighErrorRate':
print("Rolling back canary deployment...")
# Set canary weight to 0
subprocess.run([
'kubectl', 'patch', 'virtualservice', 'web-app',
'--type', 'merge',
'-p', '{"spec":{"http":[{"route":[{"destination":{"host":"web-app","subset":"stable"},"weight":100}]}]}}'
])
return "Rollback triggered", 200
return "OK", 200
if __name__ == '__main__':
app.run(port=5000)
Header-Based Routing
Test canary with specific header:
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: web-app
spec:
hosts:
- web-app
http:
- match:
- headers:
x-canary:
exact: "true"
route:
- destination:
host: web-app
subset: canary
- route:
- destination:
host: web-app
subset: stable
Test:
curl -H "x-canary: true" http://web-app/
User-Based Canary
Route specific users to canary:
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: web-app
spec:
hosts:
- web-app
http:
- match:
- headers:
user-id:
regex: "^(1|2|3|4|5)$" # Users 1-5
route:
- destination:
host: web-app
subset: canary
- route:
- destination:
host: web-app
subset: stable
Flagger for Automated Canary
Install Flagger:
kubectl apply -k github.com/weaveworks/flagger//kustomize/istio
Canary resource:
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: web-app
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
service:
port: 80
analysis:
interval: 1m
threshold: 5
maxWeight: 50
stepWeight: 5
metrics:
- name: request-success-rate
thresholdRange:
min: 99
interval: 1m
- name: request-duration
thresholdRange:
max: 500
interval: 1m
Flagger automatically:
- Deploys canary
- Increases traffic gradually
- Monitors metrics
- Rolls back if metrics fail
- Promotes if successful
Real-World Example
Our payment service canary:
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: payment-service
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: payment-service
service:
port: 8080
analysis:
interval: 2m
threshold: 3
maxWeight: 50
stepWeight: 10
metrics:
- name: request-success-rate
thresholdRange:
min: 99.5
interval: 1m
- name: request-duration
thresholdRange:
max: 1000
interval: 1m
webhooks:
- name: load-test
url: http://flagger-loadtester/
timeout: 5s
metadata:
cmd: "hey -z 1m -q 10 -c 2 http://payment-service:8080/"
Rollout process:
- 0% → 10% (2 min)
- 10% → 20% (2 min)
- 20% → 30% (2 min)
- 30% → 40% (2 min)
- 40% → 50% (2 min)
- 50% → 100% (promote)
Total: 10 minutes for full rollout
Canary vs Blue-Green
| Feature | Canary | Blue-Green |
|---|---|---|
| Rollout | Gradual | Instant |
| Risk | Low | Medium |
| Rollback | Instant | Instant |
| Complexity | High | Medium |
| Resource Cost | 1.05x | 2x |
| Testing | Production traffic | Pre-switch testing |
Use canary for:
- High-risk changes
- Gradual validation
- A/B testing
Use blue-green for:
- Database migrations
- Quick rollback needed
- All-or-nothing changes
Results
Before (blue-green):
- All users affected by bugs
- 45-minute outage from bad deploy
- No gradual testing
After (canary):
- Only 5% affected initially
- 2-minute detection time
- Automatic rollback
- Zero major outages in 6 months
Lessons Learned
- Start small - 5% is enough to catch issues
- Monitor closely - Automated metrics are essential
- Automate rollback - Don’t rely on manual intervention
- Test with real traffic - Staging doesn’t catch everything
- Be patient - Gradual rollout takes time but worth it
Conclusion
Canary deployment reduces risk by testing with real traffic gradually. Essential for high-availability services.
Key takeaways:
- Gradual traffic shift (5% → 100%)
- Monitor error rate and latency
- Automated rollback on issues
- Use Istio or Flagger for automation
- Balance speed vs safety
Deploy with confidence. Canary releases catch bugs before they become disasters.