Zero-Downtime Database Migrations: Strategies for Production Systems

We needed to add a column to our users table. 50 million rows. ALTER TABLE would lock the table for 2 hours. Unacceptable.

I learned zero-downtime migration strategies. Added the column with zero downtime. Users didn’t notice. Here’s how.

The Problem

Traditional migration:

ALTER TABLE users ADD COLUMN email VARCHAR(255);

On large table:

Locks table for hours
All writes blocked
Application errors
Angry users

We needed better.

Backward-Compatible Changes

Rule: Old code must work with new schema.

Bad (breaking):

-- Rename column (breaks old code)
ALTER TABLE users RENAME COLUMN name TO full_name;

Good (compatible):

-- Add new column
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);

-- Copy data
UPDATE users SET full_name = name WHERE full_name IS NULL;

-- Later: drop old column
ALTER TABLE users DROP COLUMN name;

Multi-Phase Migration

Phase 1: Add new column

ALTER TABLE users ADD COLUMN email VARCHAR(255);

Deploy code that writes to both columns.

Phase 2: Backfill data

UPDATE users SET email = old_email WHERE email IS NULL;

Phase 3: Make required

ALTER TABLE users ALTER COLUMN email SET NOT NULL;

Phase 4: Drop old column

ALTER TABLE users DROP COLUMN old_email;

Online Schema Change

Use pt-online-schema-change (Percona Toolkit):

pt-online-schema-change \
  --alter "ADD COLUMN email VARCHAR(255)" \
  D=production,t=users \
  --execute

How it works:

Creates new table with changes
Copies data in chunks
Keeps tables in sync with triggers
Swaps tables atomically

Zero downtime!

PostgreSQL Concurrent Index

Bad (locks table):

CREATE INDEX idx_email ON users(email);

Good (no lock):

CREATE INDEX CONCURRENTLY idx_email ON users(email);

Takes longer but doesn’t block writes!

Adding NOT NULL Column

Bad (locks table):

ALTER TABLE users ADD COLUMN status VARCHAR(20) NOT NULL DEFAULT 'active';

Good (multi-step):

Step 1: Add nullable column

ALTER TABLE users ADD COLUMN status VARCHAR(20);

Step 2: Set default for new rows

ALTER TABLE users ALTER COLUMN status SET DEFAULT 'active';

Step 3: Backfill existing rows

UPDATE users SET status = 'active' WHERE status IS NULL;

Step 4: Add NOT NULL constraint

ALTER TABLE users ALTER COLUMN status SET NOT NULL;

Dual-Write Pattern

Migrating to new table:

Phase 1: Create new table

CREATE TABLE users_v2 (
    id BIGSERIAL PRIMARY KEY,
    email VARCHAR(255) NOT NULL,
    full_name VARCHAR(255),
    created_at TIMESTAMP DEFAULT NOW()
);

Phase 2: Dual-write code

def create_user(email, name):
    # Write to old table
    old_user = db.execute(
        "INSERT INTO users (email, name) VALUES (%s, %s) RETURNING id",
        (email, name)
    )
    
    # Write to new table
    new_user = db.execute(
        "INSERT INTO users_v2 (id, email, full_name) VALUES (%s, %s, %s)",
        (old_user.id, email, name)
    )
    
    return old_user

Phase 3: Backfill data

def backfill_users(batch_size=1000):
    last_id = 0
    while True:
        users = db.execute(
            "SELECT * FROM users WHERE id > %s ORDER BY id LIMIT %s",
            (last_id, batch_size)
        )
        
        if not users:
            break
        
        for user in users:
            db.execute(
                "INSERT INTO users_v2 (id, email, full_name) VALUES (%s, %s, %s) ON CONFLICT DO NOTHING",
                (user.id, user.email, user.name)
            )
        
        last_id = users[-1].id
        time.sleep(0.1)  # Don't overload database

Phase 4: Switch reads

def get_user(user_id):
    # Read from new table
    return db.execute(
        "SELECT * FROM users_v2 WHERE id = %s",
        (user_id,)
    )

Phase 5: Stop dual-write

def create_user(email, name):
    # Only write to new table
    return db.execute(
        "INSERT INTO users_v2 (email, full_name) VALUES (%s, %s) RETURNING id",
        (email, name)
    )

Phase 6: Drop old table

DROP TABLE users;
ALTER TABLE users_v2 RENAME TO users;

Data Validation

Verify migration:

def validate_migration():
    # Count rows
    old_count = db.execute("SELECT COUNT(*) FROM users").fetchone()[0]
    new_count = db.execute("SELECT COUNT(*) FROM users_v2").fetchone()[0]
    
    assert old_count == new_count, f"Row count mismatch: {old_count} vs {new_count}"
    
    # Sample data
    sample = db.execute("SELECT * FROM users ORDER BY RANDOM() LIMIT 1000")
    for row in sample:
        new_row = db.execute("SELECT * FROM users_v2 WHERE id = %s", (row.id,)).fetchone()
        assert row.email == new_row.email
        assert row.name == new_row.full_name
    
    print("Validation passed!")

Rollback Plan

Always have rollback:

# Rollback Phase 4 (switch reads back)
def get_user(user_id):
    return db.execute("SELECT * FROM users WHERE id = %s", (user_id,))

# Rollback Phase 3 (stop backfill)
# Just stop the backfill process

# Rollback Phase 2 (stop dual-write)
def create_user(email, name):
    return db.execute(
        "INSERT INTO users (email, name) VALUES (%s, %s) RETURNING id",
        (email, name)
    )

Monitoring Migration

Track progress:

-- Check backfill progress
SELECT 
    (SELECT COUNT(*) FROM users_v2) * 100.0 / (SELECT COUNT(*) FROM users) AS percent_complete;

-- Check lag
SELECT 
    MAX(id) AS max_old_id,
    (SELECT MAX(id) FROM users_v2) AS max_new_id,
    MAX(id) - (SELECT MAX(id) FROM users_v2) AS lag
FROM users;

Prometheus metrics:

from prometheus_client import Gauge

migration_progress = Gauge('migration_progress_percent', 'Migration progress')
migration_lag = Gauge('migration_lag_rows', 'Migration lag in rows')

def update_metrics():
    old_count = db.execute("SELECT COUNT(*) FROM users").fetchone()[0]
    new_count = db.execute("SELECT COUNT(*) FROM users_v2").fetchone()[0]
    
    migration_progress.set(new_count * 100.0 / old_count)
    migration_lag.set(old_count - new_count)

Real-World Example

Migrating users table (50M rows, 500GB):

Timeline:

Day 1: Deploy dual-write code
Day 2-7: Backfill data (1M rows/hour)
Day 8: Validate data
Day 9: Switch reads to new table
Day 10: Monitor for issues
Day 11: Stop dual-write
Day 12: Drop old table

Downtime: 0 seconds

Common Pitfalls

1. Forgetting indexes:

-- Add indexes BEFORE switching reads
CREATE INDEX CONCURRENTLY idx_users_v2_email ON users_v2(email);

2. Not testing rollback:

# Test rollback in staging
./deploy.sh phase4
./rollback.sh phase4
./deploy.sh phase4  # Should work again

3. Skipping validation:

# Always validate before proceeding
validate_migration()

Results

Before:

2-hour downtime for schema changes
Risky deployments
Angry users

After:

Zero downtime migrations
Gradual, safe rollout
Happy users

Lessons Learned

Always backward-compatible - Old code must work
Multi-phase approach - Don’t rush
Validate thoroughly - Before switching
Have rollback plan - Test it
Monitor progress - Know where you are

Conclusion

Zero-downtime database migrations are possible with proper planning and execution.

Key takeaways:

Backward-compatible changes
Multi-phase deployment
Dual-write pattern
Thorough validation
Tested rollback plan

Never take downtime for database migrations again.

Table of Contents