Zero-Downtime Database Migrations: Strategies for Production Systems
We needed to add a column to our users table. 50 million rows. ALTER TABLE would lock the table for 2 hours. Unacceptable.
I learned zero-downtime migration strategies. Added the column with zero downtime. Users didn’t notice. Here’s how.
Table of Contents
The Problem
Traditional migration:
ALTER TABLE users ADD COLUMN email VARCHAR(255);
On large table:
- Locks table for hours
- All writes blocked
- Application errors
- Angry users
We needed better.
Backward-Compatible Changes
Rule: Old code must work with new schema.
Bad (breaking):
-- Rename column (breaks old code)
ALTER TABLE users RENAME COLUMN name TO full_name;
Good (compatible):
-- Add new column
ALTER TABLE users ADD COLUMN full_name VARCHAR(255);
-- Copy data
UPDATE users SET full_name = name WHERE full_name IS NULL;
-- Later: drop old column
ALTER TABLE users DROP COLUMN name;
Multi-Phase Migration
Phase 1: Add new column
ALTER TABLE users ADD COLUMN email VARCHAR(255);
Deploy code that writes to both columns.
Phase 2: Backfill data
UPDATE users SET email = old_email WHERE email IS NULL;
Phase 3: Make required
ALTER TABLE users ALTER COLUMN email SET NOT NULL;
Phase 4: Drop old column
ALTER TABLE users DROP COLUMN old_email;
Online Schema Change
Use pt-online-schema-change (Percona Toolkit):
pt-online-schema-change \
--alter "ADD COLUMN email VARCHAR(255)" \
D=production,t=users \
--execute
How it works:
- Creates new table with changes
- Copies data in chunks
- Keeps tables in sync with triggers
- Swaps tables atomically
Zero downtime!
PostgreSQL Concurrent Index
Bad (locks table):
CREATE INDEX idx_email ON users(email);
Good (no lock):
CREATE INDEX CONCURRENTLY idx_email ON users(email);
Takes longer but doesn’t block writes!
Adding NOT NULL Column
Bad (locks table):
ALTER TABLE users ADD COLUMN status VARCHAR(20) NOT NULL DEFAULT 'active';
Good (multi-step):
Step 1: Add nullable column
ALTER TABLE users ADD COLUMN status VARCHAR(20);
Step 2: Set default for new rows
ALTER TABLE users ALTER COLUMN status SET DEFAULT 'active';
Step 3: Backfill existing rows
UPDATE users SET status = 'active' WHERE status IS NULL;
Step 4: Add NOT NULL constraint
ALTER TABLE users ALTER COLUMN status SET NOT NULL;
Dual-Write Pattern
Migrating to new table:
Phase 1: Create new table
CREATE TABLE users_v2 (
id BIGSERIAL PRIMARY KEY,
email VARCHAR(255) NOT NULL,
full_name VARCHAR(255),
created_at TIMESTAMP DEFAULT NOW()
);
Phase 2: Dual-write code
def create_user(email, name):
# Write to old table
old_user = db.execute(
"INSERT INTO users (email, name) VALUES (%s, %s) RETURNING id",
(email, name)
)
# Write to new table
new_user = db.execute(
"INSERT INTO users_v2 (id, email, full_name) VALUES (%s, %s, %s)",
(old_user.id, email, name)
)
return old_user
Phase 3: Backfill data
def backfill_users(batch_size=1000):
last_id = 0
while True:
users = db.execute(
"SELECT * FROM users WHERE id > %s ORDER BY id LIMIT %s",
(last_id, batch_size)
)
if not users:
break
for user in users:
db.execute(
"INSERT INTO users_v2 (id, email, full_name) VALUES (%s, %s, %s) ON CONFLICT DO NOTHING",
(user.id, user.email, user.name)
)
last_id = users[-1].id
time.sleep(0.1) # Don't overload database
Phase 4: Switch reads
def get_user(user_id):
# Read from new table
return db.execute(
"SELECT * FROM users_v2 WHERE id = %s",
(user_id,)
)
Phase 5: Stop dual-write
def create_user(email, name):
# Only write to new table
return db.execute(
"INSERT INTO users_v2 (email, full_name) VALUES (%s, %s) RETURNING id",
(email, name)
)
Phase 6: Drop old table
DROP TABLE users;
ALTER TABLE users_v2 RENAME TO users;
Data Validation
Verify migration:
def validate_migration():
# Count rows
old_count = db.execute("SELECT COUNT(*) FROM users").fetchone()[0]
new_count = db.execute("SELECT COUNT(*) FROM users_v2").fetchone()[0]
assert old_count == new_count, f"Row count mismatch: {old_count} vs {new_count}"
# Sample data
sample = db.execute("SELECT * FROM users ORDER BY RANDOM() LIMIT 1000")
for row in sample:
new_row = db.execute("SELECT * FROM users_v2 WHERE id = %s", (row.id,)).fetchone()
assert row.email == new_row.email
assert row.name == new_row.full_name
print("Validation passed!")
Rollback Plan
Always have rollback:
# Rollback Phase 4 (switch reads back)
def get_user(user_id):
return db.execute("SELECT * FROM users WHERE id = %s", (user_id,))
# Rollback Phase 3 (stop backfill)
# Just stop the backfill process
# Rollback Phase 2 (stop dual-write)
def create_user(email, name):
return db.execute(
"INSERT INTO users (email, name) VALUES (%s, %s) RETURNING id",
(email, name)
)
Monitoring Migration
Track progress:
-- Check backfill progress
SELECT
(SELECT COUNT(*) FROM users_v2) * 100.0 / (SELECT COUNT(*) FROM users) AS percent_complete;
-- Check lag
SELECT
MAX(id) AS max_old_id,
(SELECT MAX(id) FROM users_v2) AS max_new_id,
MAX(id) - (SELECT MAX(id) FROM users_v2) AS lag
FROM users;
Prometheus metrics:
from prometheus_client import Gauge
migration_progress = Gauge('migration_progress_percent', 'Migration progress')
migration_lag = Gauge('migration_lag_rows', 'Migration lag in rows')
def update_metrics():
old_count = db.execute("SELECT COUNT(*) FROM users").fetchone()[0]
new_count = db.execute("SELECT COUNT(*) FROM users_v2").fetchone()[0]
migration_progress.set(new_count * 100.0 / old_count)
migration_lag.set(old_count - new_count)
Real-World Example
Migrating users table (50M rows, 500GB):
Timeline:
- Day 1: Deploy dual-write code
- Day 2-7: Backfill data (1M rows/hour)
- Day 8: Validate data
- Day 9: Switch reads to new table
- Day 10: Monitor for issues
- Day 11: Stop dual-write
- Day 12: Drop old table
Downtime: 0 seconds
Common Pitfalls
1. Forgetting indexes:
-- Add indexes BEFORE switching reads
CREATE INDEX CONCURRENTLY idx_users_v2_email ON users_v2(email);
2. Not testing rollback:
# Test rollback in staging
./deploy.sh phase4
./rollback.sh phase4
./deploy.sh phase4 # Should work again
3. Skipping validation:
# Always validate before proceeding
validate_migration()
Results
Before:
- 2-hour downtime for schema changes
- Risky deployments
- Angry users
After:
- Zero downtime migrations
- Gradual, safe rollout
- Happy users
Lessons Learned
- Always backward-compatible - Old code must work
- Multi-phase approach - Don’t rush
- Validate thoroughly - Before switching
- Have rollback plan - Test it
- Monitor progress - Know where you are
Conclusion
Zero-downtime database migrations are possible with proper planning and execution.
Key takeaways:
- Backward-compatible changes
- Multi-phase deployment
- Dual-write pattern
- Thorough validation
- Tested rollback plan
Never take downtime for database migrations again.