PostgreSQL Backup and Recovery: Lessons from a 3AM Data Loss Incident

3 AM. Phone rings. “Database is corrupted. We lost 6 hours of data.” My heart sank. We had backups, but they were 24 hours old.

I rebuilt our backup strategy from scratch. Now we have point-in-time recovery, automated backups every hour, and tested restore procedures. We can recover to any second in the last 30 days.

The Disaster

What happened:

Disk failure on primary database
Last backup: 24 hours old
Lost: 6 hours of customer orders
Impact: $50K in lost revenue
Recovery time: 8 hours

We learned the hard way.

Backup Types

Logical Backup (pg_dump):

SQL statements to recreate data
Portable across versions
Slower for large databases

Physical Backup (pg_basebackup):

Copy of data directory
Faster for large databases
Same PostgreSQL version required

Logical Backup with pg_dump

Basic dump:

pg_dump -U postgres -d production > backup.sql

Compressed:

pg_dump -U postgres -d production | gzip > backup.sql.gz

Custom format (recommended):

pg_dump -U postgres -d production -F c -f backup.dump

Benefits:

Parallel restore
Selective restore
Better compression

Automated Backup Script

backup.sh:

#!/bin/bash

BACKUP_DIR="/var/backups/postgresql"
DATE=$(date +%Y%m%d_%H%M%S)
DB_NAME="production"
RETENTION_DAYS=30

# Create backup
pg_dump -U postgres -d $DB_NAME -F c -f $BACKUP_DIR/${DB_NAME}_${DATE}.dump

# Compress
gzip $BACKUP_DIR/${DB_NAME}_${DATE}.dump

# Upload to S3
aws s3 cp $BACKUP_DIR/${DB_NAME}_${DATE}.dump.gz s3://backups/postgresql/

# Delete old backups
find $BACKUP_DIR -name "*.dump.gz" -mtime +$RETENTION_DAYS -delete

# Verify backup
if [ $? -eq 0 ]; then
    echo "Backup successful: ${DB_NAME}_${DATE}.dump.gz"
else
    echo "Backup failed!" | mail -s "PostgreSQL Backup Failed" admin@example.com
fi

Cron job:

# Hourly backups
0 * * * * /usr/local/bin/backup.sh

# Daily full backup at 2 AM
0 2 * * * /usr/local/bin/backup_full.sh

Physical Backup with pg_basebackup

Create base backup:

pg_basebackup -U postgres -D /var/backups/base -F tar -z -P

Options:

-D: Output directory
-F tar: TAR format
-z: Compress
-P: Show progress

Point-in-Time Recovery (PITR)

Enable WAL archiving:

postgresql.conf:

wal_level = replica
archive_mode = on
archive_command = 'test ! -f /var/lib/postgresql/wal_archive/%f && cp %p /var/lib/postgresql/wal_archive/%f'
max_wal_senders = 3

Restart PostgreSQL:

sudo systemctl restart postgresql

Creating PITR Backup

Start base backup:

pg_basebackup -U postgres -D /var/backups/pitr_base -F tar -z -X fetch

WAL files automatically archived to /var/lib/postgresql/wal_archive/

Restoring from PITR

Scenario: Recover to 2018-03-28 10:30:00

Stop PostgreSQL:

sudo systemctl stop postgresql

Move current data:

mv /var/lib/postgresql/12/main /var/lib/postgresql/12/main.old

Extract base backup:

mkdir /var/lib/postgresql/12/main
tar -xzf /var/backups/pitr_base/base.tar.gz -C /var/lib/postgresql/12/main

Create recovery.conf:

restore_command = 'cp /var/lib/postgresql/wal_archive/%f %p'
recovery_target_time = '2018-03-28 10:30:00'
recovery_target_action = 'promote'

Start PostgreSQL:

sudo systemctl start postgresql

Database recovers to exact point in time!

Continuous Archiving to S3

Archive WAL to S3:

postgresql.conf:

archive_command = 'aws s3 cp %p s3://backups/postgresql/wal/%f'

Restore from S3:

recovery.conf:

restore_command = 'aws s3 cp s3://backups/postgresql/wal/%f %p'

Backup Verification

Test restore monthly:

#!/bin/bash

# Restore to test database
pg_restore -U postgres -d test_restore -c backup.dump

# Run validation queries
psql -U postgres -d test_restore -c "SELECT COUNT(*) FROM users;"
psql -U postgres -d test_restore -c "SELECT COUNT(*) FROM orders;"

# Compare with production
PROD_COUNT=$(psql -U postgres -d production -t -c "SELECT COUNT(*) FROM users;")
TEST_COUNT=$(psql -U postgres -d test_restore -t -c "SELECT COUNT(*) FROM users;")

if [ "$PROD_COUNT" -eq "$TEST_COUNT" ]; then
    echo "Backup verification successful"
else
    echo "Backup verification failed!" | mail -s "Backup Verification Failed" admin@example.com
fi

# Cleanup
dropdb test_restore

Monitoring Backup Status

Check last backup:

#!/bin/bash

LAST_BACKUP=$(ls -t /var/backups/postgresql/*.dump.gz | head -1)
BACKUP_AGE=$(( ($(date +%s) - $(stat -f %m $LAST_BACKUP)) / 3600 ))

if [ $BACKUP_AGE -gt 2 ]; then
    echo "WARNING: Last backup is $BACKUP_AGE hours old"
    # Send alert
fi

Prometheus metrics:

from prometheus_client import Gauge
import os
import time

backup_age = Gauge('postgresql_backup_age_seconds', 'Age of last backup')

def update_backup_age():
    backup_dir = '/var/backups/postgresql'
    files = [os.path.join(backup_dir, f) for f in os.listdir(backup_dir)]
    latest = max(files, key=os.path.getmtime)
    age = time.time() - os.path.getmtime(latest)
    backup_age.set(age)

Incremental Backups with pgBackRest

Install:

sudo apt-get install pgbackrest

Configure /etc/pgbackrest.conf:

[global]
repo1-path=/var/lib/pgbackrest
repo1-retention-full=2
repo1-retention-diff=4

[production]
pg1-path=/var/lib/postgresql/12/main

Create full backup:

pgbackrest --stanza=production backup --type=full

Incremental backup:

pgbackrest --stanza=production backup --type=incr

Restore:

pgbackrest --stanza=production restore

Replication for High Availability

Primary server postgresql.conf:

wal_level = replica
max_wal_senders = 3
wal_keep_segments = 64

Standby server setup:

# Stop standby
sudo systemctl stop postgresql

# Remove data directory
rm -rf /var/lib/postgresql/12/main

# Clone from primary
pg_basebackup -h primary-server -D /var/lib/postgresql/12/main -U replication -P

# Create recovery.conf
cat > /var/lib/postgresql/12/main/recovery.conf << EOF
standby_mode = 'on'
primary_conninfo = 'host=primary-server port=5432 user=replication'
trigger_file = '/tmp/postgresql.trigger'
EOF

# Start standby
sudo systemctl start postgresql

Our Production Setup

Backup Strategy:

Continuous WAL archiving to S3
Hourly pg_dump (compressed, custom format)
Daily pg_basebackup at 2 AM
Weekly full backup to offsite storage
Monthly restore test

Retention:

WAL archives: 30 days
Hourly dumps: 7 days
Daily backups: 30 days
Weekly backups: 90 days

Monitoring:

Backup age alerts (> 2 hours)
WAL archive lag
Disk space on backup server
Monthly restore test results

Disaster Recovery Procedure

Assess damage:
- What data is lost?
- What time to recover to?
Choose recovery method:
- PITR for specific time
- Latest backup for full restore
Execute recovery:
- Stop application
- Restore database
- Verify data integrity
Resume operations:
- Start application
- Monitor for issues
- Communicate with users
Post-mortem:
- What went wrong?
- How to prevent?
- Update procedures

Recovery Time Objectives

Our targets:

Scenario	RTO	RPO
Disk failure	30 min	0 (replica)
Data corruption	1 hour	1 hour (PITR)
Accidental DELETE	15 min	1 hour (PITR)
Complete disaster	4 hours	24 hours

RTO: Recovery Time Objective
RPO: Recovery Point Objective

Cost Analysis

Monthly costs:

S3 storage (500GB): $12
Backup server: $50
Offsite storage: $20
Total: $82/month

Cost of data loss: $50K (one incident)

ROI: 610x in first year!

Results

Before:

24-hour old backups
No PITR
8-hour recovery time
Lost $50K in one incident

After:

Hourly backups
PITR to any second
30-minute recovery time
Zero data loss in 18 months

Lessons Learned

Test restores regularly - Backups are useless if you can’t restore
Automate everything - Manual backups will be forgotten
Monitor backup age - Alert if backup is too old
Use PITR - Point-in-time recovery is essential
Document procedures - 3 AM is not the time to figure it out

Conclusion

Backups are insurance. You hope you never need them, but when you do, they’re priceless.

Key takeaways:

Multiple backup types (logical + physical)
Continuous WAL archiving for PITR
Automated, tested backups
Monitor backup health
Document and practice recovery

Don’t learn this lesson the hard way. Implement proper backups today.

Table of Contents