Blue-Green Deployment: Zero-Downtime Releases
Our deployments caused 2-3 minutes of downtime. Users saw errors during deploys. We deployed only during low-traffic hours (2 AM).
I implemented blue-green deployment. Now we deploy anytime, zero downtime, instant rollback. Deployed during Black Friday with zero issues.
Table of Contents
The Old Way
Rolling update in Kubernetes:
spec:
replicas: 10
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 1
Problems:
- Mixed versions running simultaneously
- Database migrations tricky
- Rollback takes time (rolling back)
- Can’t test production traffic before full rollout
Blue-Green Concept
Two identical environments:
- Blue: Current production (v1.0)
- Green: New version (v2.0)
Process:
- Deploy v2.0 to green environment
- Test green environment
- Switch traffic from blue to green
- Keep blue for quick rollback
Kubernetes Implementation
Use labels and services:
Blue deployment (current):
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app-blue
spec:
replicas: 10
selector:
matchLabels:
app: web-app
version: blue
template:
metadata:
labels:
app: web-app
version: blue
spec:
containers:
- name: web-app
image: web-app:1.0.0
ports:
- containerPort: 3000
Green deployment (new):
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app-green
spec:
replicas: 10
selector:
matchLabels:
app: web-app
version: green
template:
metadata:
labels:
app: web-app
version: green
spec:
containers:
- name: web-app
image: web-app:2.0.0
ports:
- containerPort: 3000
Service (routes traffic):
apiVersion: v1
kind: Service
metadata:
name: web-app
spec:
selector:
app: web-app
version: blue # Points to blue initially
ports:
- port: 80
targetPort: 3000
Deployment Process
Step 1: Deploy green
kubectl apply -f deployment-green.yaml
Wait for pods to be ready:
kubectl wait --for=condition=available --timeout=300s deployment/web-app-green
Step 2: Test green internally
# Port-forward to green pod
kubectl port-forward deployment/web-app-green 8080:3000
# Test
curl http://localhost:8080/health
Or create temporary service:
apiVersion: v1
kind: Service
metadata:
name: web-app-green-test
spec:
selector:
app: web-app
version: green
ports:
- port: 80
targetPort: 3000
Test via web-app-green-test service.
Step 3: Switch traffic
Update service selector:
kubectl patch service web-app -p '{"spec":{"selector":{"version":"green"}}}'
Traffic now goes to green!
Step 4: Monitor
Watch metrics, logs, errors:
kubectl logs -f deployment/web-app-green
Step 5: Delete blue (after confidence)
kubectl delete deployment web-app-blue
Instant Rollback
If green has issues, switch back to blue:
kubectl patch service web-app -p '{"spec":{"selector":{"version":"blue"}}}'
Traffic back to blue in seconds!
Automated Script
Created deployment script:
#!/bin/bash
# blue-green-deploy.sh
set -e
NEW_VERSION=$1
CURRENT_COLOR=$(kubectl get service web-app -o jsonpath='{.spec.selector.version}')
if [ "$CURRENT_COLOR" = "blue" ]; then
NEW_COLOR="green"
else
NEW_COLOR="blue"
fi
echo "Current: $CURRENT_COLOR"
echo "Deploying to: $NEW_COLOR"
# Update deployment with new image
kubectl set image deployment/web-app-$NEW_COLOR web-app=web-app:$NEW_VERSION
# Wait for rollout
kubectl rollout status deployment/web-app-$NEW_COLOR
# Run smoke tests
echo "Running smoke tests..."
kubectl run smoke-test --rm -i --restart=Never --image=curlimages/curl -- \
curl -f http://web-app-$NEW_COLOR/health || exit 1
# Switch traffic
echo "Switching traffic to $NEW_COLOR..."
kubectl patch service web-app -p "{\"spec\":{\"selector\":{\"version\":\"$NEW_COLOR\"}}}"
echo "Deployment complete!"
echo "To rollback: kubectl patch service web-app -p '{\"spec\":{\"selector\":{\"version\":\"$CURRENT_COLOR\"}}}'"
Usage:
./blue-green-deploy.sh 2.0.0
Database Migrations
Challenge: Database changes during deployment.
Strategy 1: Backward-compatible migrations
-- v1.0 schema
CREATE TABLE users (
id INT PRIMARY KEY,
name VARCHAR(100)
);
-- v2.0 migration (add column with default)
ALTER TABLE users ADD COLUMN email VARCHAR(255) DEFAULT '';
-- v2.0 code uses email column
-- v1.0 code ignores email column (backward compatible)
Deploy process:
- Run migration
- Deploy v2.0 to green
- Switch traffic
- Delete blue
Strategy 2: Multi-phase deployment
Phase 1: Add column (both versions work) Phase 2: Deploy v2.0 (uses new column) Phase 3: Remove old code
Canary Testing
Test with small percentage of traffic:
Use Istio or Nginx Ingress for traffic splitting:
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: web-app
spec:
hosts:
- web-app
http:
- match:
- headers:
canary:
exact: "true"
route:
- destination:
host: web-app
subset: green
- route:
- destination:
host: web-app
subset: blue
weight: 90
- destination:
host: web-app
subset: green
weight: 10
10% traffic to green, 90% to blue.
Monitoring During Switch
Watch key metrics:
# Error rate
watch -n 1 'kubectl logs deployment/web-app-green | grep ERROR | wc -l'
# Request rate
watch -n 1 'kubectl top pods -l version=green'
Prometheus queries:
# Error rate
rate(http_requests_total{status=~"5..", version="green"}[1m])
# Latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{version="green"}[1m]))
Cost Consideration
Blue-green requires 2x resources during deployment.
Optimization: Scale down blue after switch
# After successful switch
kubectl scale deployment web-app-blue --replicas=2
# Keep 2 pods for quick rollback
# Delete after 24 hours if no issues
Real-World Example
Our production deployment:
# 1. Deploy green
kubectl apply -f deployment-green.yaml
# 2. Wait for ready
kubectl wait --for=condition=available deployment/web-app-green
# 3. Smoke test
curl -f http://web-app-green-test/health
curl -f http://web-app-green-test/api/users
# 4. Switch 10% traffic (canary)
# (using Istio VirtualService)
# 5. Monitor for 10 minutes
# Check Grafana dashboards, error rates
# 6. Switch 100% traffic
kubectl patch service web-app -p '{"spec":{"selector":{"version":"green"}}}'
# 7. Monitor for 1 hour
# 8. Scale down blue
kubectl scale deployment web-app-blue --replicas=2
# 9. Delete blue after 24 hours
kubectl delete deployment web-app-blue
Rollback Story
Deployed v2.5 during business hours:
14:00 - Deploy to green
14:05 - Switch traffic
14:07 - Error rate spike! (bug in payment processing)
14:08 - Rollback to blue
14:09 - Error rate normal
Total impact: 2 minutes, 50 failed payments.
Without blue-green: Would have taken 15+ minutes to rollback.
Comparison with Other Strategies
| Strategy | Downtime | Rollback Time | Resource Cost |
|---|---|---|---|
| Recreate | 2-3 min | 5-10 min | 1x |
| Rolling Update | 0 | 5-10 min | 1.1x |
| Blue-Green | 0 | < 1 min | 2x |
| Canary | 0 | < 1 min | 1.1-2x |
Blue-green: Best for critical services where instant rollback is essential.
Lessons Learned
- Test green thoroughly - Don’t rush the switch
- Monitor closely - Watch metrics during switch
- Keep blue running - For at least 1 hour after switch
- Automate - Manual switches are error-prone
- Database migrations - Plan carefully
Results
Before:
- 2-3 min downtime per deploy
- Deploys only at 2 AM
- Rollback takes 15 min
- 2-3 deploys per week
After:
- Zero downtime
- Deploy anytime
- Rollback in < 1 min
- 10-15 deploys per week
Conclusion
Blue-green deployment enables zero-downtime releases and instant rollback. Essential for high-availability services.
Key takeaways:
- Two identical environments (blue and green)
- Switch traffic instantly
- Rollback in seconds
- Requires 2x resources temporarily
- Plan database migrations carefully
If downtime is unacceptable, implement blue-green deployment. Your users will never know you deployed.