PostgreSQL Backup and Recovery: Lessons from a 3AM Data Loss Incident
3 AM. Phone rings. “Database is corrupted. We lost 6 hours of data.” My heart sank. We had backups, but they were 24 hours old.
I rebuilt our backup strategy from scratch. Now we have point-in-time recovery, automated backups every hour, and tested restore procedures. We can recover to any second in the last 30 days.
Table of Contents
The Disaster
What happened:
- Disk failure on primary database
- Last backup: 24 hours old
- Lost: 6 hours of customer orders
- Impact: $50K in lost revenue
- Recovery time: 8 hours
We learned the hard way.
Backup Types
Logical Backup (pg_dump):
- SQL statements to recreate data
- Portable across versions
- Slower for large databases
Physical Backup (pg_basebackup):
- Copy of data directory
- Faster for large databases
- Same PostgreSQL version required
Logical Backup with pg_dump
Basic dump:
pg_dump -U postgres -d production > backup.sql
Compressed:
pg_dump -U postgres -d production | gzip > backup.sql.gz
Custom format (recommended):
pg_dump -U postgres -d production -F c -f backup.dump
Benefits:
- Parallel restore
- Selective restore
- Better compression
Automated Backup Script
backup.sh:
#!/bin/bash
BACKUP_DIR="/var/backups/postgresql"
DATE=$(date +%Y%m%d_%H%M%S)
DB_NAME="production"
RETENTION_DAYS=30
# Create backup
pg_dump -U postgres -d $DB_NAME -F c -f $BACKUP_DIR/${DB_NAME}_${DATE}.dump
# Compress
gzip $BACKUP_DIR/${DB_NAME}_${DATE}.dump
# Upload to S3
aws s3 cp $BACKUP_DIR/${DB_NAME}_${DATE}.dump.gz s3://backups/postgresql/
# Delete old backups
find $BACKUP_DIR -name "*.dump.gz" -mtime +$RETENTION_DAYS -delete
# Verify backup
if [ $? -eq 0 ]; then
echo "Backup successful: ${DB_NAME}_${DATE}.dump.gz"
else
echo "Backup failed!" | mail -s "PostgreSQL Backup Failed" admin@example.com
fi
Cron job:
# Hourly backups
0 * * * * /usr/local/bin/backup.sh
# Daily full backup at 2 AM
0 2 * * * /usr/local/bin/backup_full.sh
Physical Backup with pg_basebackup
Create base backup:
pg_basebackup -U postgres -D /var/backups/base -F tar -z -P
Options:
-D: Output directory-F tar: TAR format-z: Compress-P: Show progress
Point-in-Time Recovery (PITR)
Enable WAL archiving:
postgresql.conf:
wal_level = replica
archive_mode = on
archive_command = 'test ! -f /var/lib/postgresql/wal_archive/%f && cp %p /var/lib/postgresql/wal_archive/%f'
max_wal_senders = 3
Restart PostgreSQL:
sudo systemctl restart postgresql
Creating PITR Backup
- Start base backup:
pg_basebackup -U postgres -D /var/backups/pitr_base -F tar -z -X fetch
- WAL files automatically archived to
/var/lib/postgresql/wal_archive/
Restoring from PITR
Scenario: Recover to 2018-03-28 10:30:00
- Stop PostgreSQL:
sudo systemctl stop postgresql
- Move current data:
mv /var/lib/postgresql/12/main /var/lib/postgresql/12/main.old
- Extract base backup:
mkdir /var/lib/postgresql/12/main
tar -xzf /var/backups/pitr_base/base.tar.gz -C /var/lib/postgresql/12/main
- Create recovery.conf:
restore_command = 'cp /var/lib/postgresql/wal_archive/%f %p'
recovery_target_time = '2018-03-28 10:30:00'
recovery_target_action = 'promote'
- Start PostgreSQL:
sudo systemctl start postgresql
Database recovers to exact point in time!
Continuous Archiving to S3
Archive WAL to S3:
postgresql.conf:
archive_command = 'aws s3 cp %p s3://backups/postgresql/wal/%f'
Restore from S3:
recovery.conf:
restore_command = 'aws s3 cp s3://backups/postgresql/wal/%f %p'
Backup Verification
Test restore monthly:
#!/bin/bash
# Restore to test database
pg_restore -U postgres -d test_restore -c backup.dump
# Run validation queries
psql -U postgres -d test_restore -c "SELECT COUNT(*) FROM users;"
psql -U postgres -d test_restore -c "SELECT COUNT(*) FROM orders;"
# Compare with production
PROD_COUNT=$(psql -U postgres -d production -t -c "SELECT COUNT(*) FROM users;")
TEST_COUNT=$(psql -U postgres -d test_restore -t -c "SELECT COUNT(*) FROM users;")
if [ "$PROD_COUNT" -eq "$TEST_COUNT" ]; then
echo "Backup verification successful"
else
echo "Backup verification failed!" | mail -s "Backup Verification Failed" admin@example.com
fi
# Cleanup
dropdb test_restore
Monitoring Backup Status
Check last backup:
#!/bin/bash
LAST_BACKUP=$(ls -t /var/backups/postgresql/*.dump.gz | head -1)
BACKUP_AGE=$(( ($(date +%s) - $(stat -f %m $LAST_BACKUP)) / 3600 ))
if [ $BACKUP_AGE -gt 2 ]; then
echo "WARNING: Last backup is $BACKUP_AGE hours old"
# Send alert
fi
Prometheus metrics:
from prometheus_client import Gauge
import os
import time
backup_age = Gauge('postgresql_backup_age_seconds', 'Age of last backup')
def update_backup_age():
backup_dir = '/var/backups/postgresql'
files = [os.path.join(backup_dir, f) for f in os.listdir(backup_dir)]
latest = max(files, key=os.path.getmtime)
age = time.time() - os.path.getmtime(latest)
backup_age.set(age)
Incremental Backups with pgBackRest
Install:
sudo apt-get install pgbackrest
Configure /etc/pgbackrest.conf:
[global]
repo1-path=/var/lib/pgbackrest
repo1-retention-full=2
repo1-retention-diff=4
[production]
pg1-path=/var/lib/postgresql/12/main
Create full backup:
pgbackrest --stanza=production backup --type=full
Incremental backup:
pgbackrest --stanza=production backup --type=incr
Restore:
pgbackrest --stanza=production restore
Replication for High Availability
Primary server postgresql.conf:
wal_level = replica
max_wal_senders = 3
wal_keep_segments = 64
Standby server setup:
# Stop standby
sudo systemctl stop postgresql
# Remove data directory
rm -rf /var/lib/postgresql/12/main
# Clone from primary
pg_basebackup -h primary-server -D /var/lib/postgresql/12/main -U replication -P
# Create recovery.conf
cat > /var/lib/postgresql/12/main/recovery.conf << EOF
standby_mode = 'on'
primary_conninfo = 'host=primary-server port=5432 user=replication'
trigger_file = '/tmp/postgresql.trigger'
EOF
# Start standby
sudo systemctl start postgresql
Our Production Setup
Backup Strategy:
- Continuous WAL archiving to S3
- Hourly pg_dump (compressed, custom format)
- Daily pg_basebackup at 2 AM
- Weekly full backup to offsite storage
- Monthly restore test
Retention:
- WAL archives: 30 days
- Hourly dumps: 7 days
- Daily backups: 30 days
- Weekly backups: 90 days
Monitoring:
- Backup age alerts (> 2 hours)
- WAL archive lag
- Disk space on backup server
- Monthly restore test results
Disaster Recovery Procedure
-
Assess damage:
- What data is lost?
- What time to recover to?
-
Choose recovery method:
- PITR for specific time
- Latest backup for full restore
-
Execute recovery:
- Stop application
- Restore database
- Verify data integrity
-
Resume operations:
- Start application
- Monitor for issues
- Communicate with users
-
Post-mortem:
- What went wrong?
- How to prevent?
- Update procedures
Recovery Time Objectives
Our targets:
| Scenario | RTO | RPO |
|---|---|---|
| Disk failure | 30 min | 0 (replica) |
| Data corruption | 1 hour | 1 hour (PITR) |
| Accidental DELETE | 15 min | 1 hour (PITR) |
| Complete disaster | 4 hours | 24 hours |
RTO: Recovery Time Objective
RPO: Recovery Point Objective
Cost Analysis
Monthly costs:
- S3 storage (500GB): $12
- Backup server: $50
- Offsite storage: $20
- Total: $82/month
Cost of data loss: $50K (one incident)
ROI: 610x in first year!
Results
Before:
- 24-hour old backups
- No PITR
- 8-hour recovery time
- Lost $50K in one incident
After:
- Hourly backups
- PITR to any second
- 30-minute recovery time
- Zero data loss in 18 months
Lessons Learned
- Test restores regularly - Backups are useless if you can’t restore
- Automate everything - Manual backups will be forgotten
- Monitor backup age - Alert if backup is too old
- Use PITR - Point-in-time recovery is essential
- Document procedures - 3 AM is not the time to figure it out
Conclusion
Backups are insurance. You hope you never need them, but when you do, they’re priceless.
Key takeaways:
- Multiple backup types (logical + physical)
- Continuous WAL archiving for PITR
- Automated, tested backups
- Monitor backup health
- Document and practice recovery
Don’t learn this lesson the hard way. Implement proper backups today.