Our deployments caused 2-3 minutes of downtime. Users saw errors during deploys. We deployed only during low-traffic hours (2 AM).

I implemented blue-green deployment. Now we deploy anytime, zero downtime, instant rollback. Deployed during Black Friday with zero issues.

Table of Contents

The Old Way

Rolling update in Kubernetes:

spec:
  replicas: 10
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1

Problems:

  • Mixed versions running simultaneously
  • Database migrations tricky
  • Rollback takes time (rolling back)
  • Can’t test production traffic before full rollout

Blue-Green Concept

Two identical environments:

  • Blue: Current production (v1.0)
  • Green: New version (v2.0)

Process:

  1. Deploy v2.0 to green environment
  2. Test green environment
  3. Switch traffic from blue to green
  4. Keep blue for quick rollback

Kubernetes Implementation

Use labels and services:

Blue deployment (current):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app-blue
spec:
  replicas: 10
  selector:
    matchLabels:
      app: web-app
      version: blue
  template:
    metadata:
      labels:
        app: web-app
        version: blue
    spec:
      containers:
      - name: web-app
        image: web-app:1.0.0
        ports:
        - containerPort: 3000

Green deployment (new):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app-green
spec:
  replicas: 10
  selector:
    matchLabels:
      app: web-app
      version: green
  template:
    metadata:
      labels:
        app: web-app
        version: green
    spec:
      containers:
      - name: web-app
        image: web-app:2.0.0
        ports:
        - containerPort: 3000

Service (routes traffic):

apiVersion: v1
kind: Service
metadata:
  name: web-app
spec:
  selector:
    app: web-app
    version: blue  # Points to blue initially
  ports:
  - port: 80
    targetPort: 3000

Deployment Process

Step 1: Deploy green

kubectl apply -f deployment-green.yaml

Wait for pods to be ready:

kubectl wait --for=condition=available --timeout=300s deployment/web-app-green

Step 2: Test green internally

# Port-forward to green pod
kubectl port-forward deployment/web-app-green 8080:3000

# Test
curl http://localhost:8080/health

Or create temporary service:

apiVersion: v1
kind: Service
metadata:
  name: web-app-green-test
spec:
  selector:
    app: web-app
    version: green
  ports:
  - port: 80
    targetPort: 3000

Test via web-app-green-test service.

Step 3: Switch traffic

Update service selector:

kubectl patch service web-app -p '{"spec":{"selector":{"version":"green"}}}'

Traffic now goes to green!

Step 4: Monitor

Watch metrics, logs, errors:

kubectl logs -f deployment/web-app-green

Step 5: Delete blue (after confidence)

kubectl delete deployment web-app-blue

Instant Rollback

If green has issues, switch back to blue:

kubectl patch service web-app -p '{"spec":{"selector":{"version":"blue"}}}'

Traffic back to blue in seconds!

Automated Script

Created deployment script:

#!/bin/bash
# blue-green-deploy.sh

set -e

NEW_VERSION=$1
CURRENT_COLOR=$(kubectl get service web-app -o jsonpath='{.spec.selector.version}')

if [ "$CURRENT_COLOR" = "blue" ]; then
    NEW_COLOR="green"
else
    NEW_COLOR="blue"
fi

echo "Current: $CURRENT_COLOR"
echo "Deploying to: $NEW_COLOR"

# Update deployment with new image
kubectl set image deployment/web-app-$NEW_COLOR web-app=web-app:$NEW_VERSION

# Wait for rollout
kubectl rollout status deployment/web-app-$NEW_COLOR

# Run smoke tests
echo "Running smoke tests..."
kubectl run smoke-test --rm -i --restart=Never --image=curlimages/curl -- \
    curl -f http://web-app-$NEW_COLOR/health || exit 1

# Switch traffic
echo "Switching traffic to $NEW_COLOR..."
kubectl patch service web-app -p "{\"spec\":{\"selector\":{\"version\":\"$NEW_COLOR\"}}}"

echo "Deployment complete!"
echo "To rollback: kubectl patch service web-app -p '{\"spec\":{\"selector\":{\"version\":\"$CURRENT_COLOR\"}}}'"

Usage:

./blue-green-deploy.sh 2.0.0

Database Migrations

Challenge: Database changes during deployment.

Strategy 1: Backward-compatible migrations

-- v1.0 schema
CREATE TABLE users (
    id INT PRIMARY KEY,
    name VARCHAR(100)
);

-- v2.0 migration (add column with default)
ALTER TABLE users ADD COLUMN email VARCHAR(255) DEFAULT '';

-- v2.0 code uses email column
-- v1.0 code ignores email column (backward compatible)

Deploy process:

  1. Run migration
  2. Deploy v2.0 to green
  3. Switch traffic
  4. Delete blue

Strategy 2: Multi-phase deployment

Phase 1: Add column (both versions work) Phase 2: Deploy v2.0 (uses new column) Phase 3: Remove old code

Canary Testing

Test with small percentage of traffic:

Use Istio or Nginx Ingress for traffic splitting:

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: web-app
spec:
  hosts:
  - web-app
  http:
  - match:
    - headers:
        canary:
          exact: "true"
    route:
    - destination:
        host: web-app
        subset: green
  - route:
    - destination:
        host: web-app
        subset: blue
      weight: 90
    - destination:
        host: web-app
        subset: green
      weight: 10

10% traffic to green, 90% to blue.

Monitoring During Switch

Watch key metrics:

# Error rate
watch -n 1 'kubectl logs deployment/web-app-green | grep ERROR | wc -l'

# Request rate
watch -n 1 'kubectl top pods -l version=green'

Prometheus queries:

# Error rate
rate(http_requests_total{status=~"5..", version="green"}[1m])

# Latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{version="green"}[1m]))

Cost Consideration

Blue-green requires 2x resources during deployment.

Optimization: Scale down blue after switch

# After successful switch
kubectl scale deployment web-app-blue --replicas=2

# Keep 2 pods for quick rollback
# Delete after 24 hours if no issues

Real-World Example

Our production deployment:

# 1. Deploy green
kubectl apply -f deployment-green.yaml

# 2. Wait for ready
kubectl wait --for=condition=available deployment/web-app-green

# 3. Smoke test
curl -f http://web-app-green-test/health
curl -f http://web-app-green-test/api/users

# 4. Switch 10% traffic (canary)
# (using Istio VirtualService)

# 5. Monitor for 10 minutes
# Check Grafana dashboards, error rates

# 6. Switch 100% traffic
kubectl patch service web-app -p '{"spec":{"selector":{"version":"green"}}}'

# 7. Monitor for 1 hour

# 8. Scale down blue
kubectl scale deployment web-app-blue --replicas=2

# 9. Delete blue after 24 hours
kubectl delete deployment web-app-blue

Rollback Story

Deployed v2.5 during business hours:

14:00 - Deploy to green
14:05 - Switch traffic
14:07 - Error rate spike! (bug in payment processing)
14:08 - Rollback to blue
14:09 - Error rate normal

Total impact: 2 minutes, 50 failed payments.

Without blue-green: Would have taken 15+ minutes to rollback.

Comparison with Other Strategies

StrategyDowntimeRollback TimeResource Cost
Recreate2-3 min5-10 min1x
Rolling Update05-10 min1.1x
Blue-Green0< 1 min2x
Canary0< 1 min1.1-2x

Blue-green: Best for critical services where instant rollback is essential.

Lessons Learned

  1. Test green thoroughly - Don’t rush the switch
  2. Monitor closely - Watch metrics during switch
  3. Keep blue running - For at least 1 hour after switch
  4. Automate - Manual switches are error-prone
  5. Database migrations - Plan carefully

Results

Before:

  • 2-3 min downtime per deploy
  • Deploys only at 2 AM
  • Rollback takes 15 min
  • 2-3 deploys per week

After:

  • Zero downtime
  • Deploy anytime
  • Rollback in < 1 min
  • 10-15 deploys per week

Conclusion

Blue-green deployment enables zero-downtime releases and instant rollback. Essential for high-availability services.

Key takeaways:

  1. Two identical environments (blue and green)
  2. Switch traffic instantly
  3. Rollback in seconds
  4. Requires 2x resources temporarily
  5. Plan database migrations carefully

If downtime is unacceptable, implement blue-green deployment. Your users will never know you deployed.