3 AM. Phone rings. “Database is corrupted. We lost 6 hours of data.” My heart sank. We had backups, but they were 24 hours old.

I rebuilt our backup strategy from scratch. Now we have point-in-time recovery, automated backups every hour, and tested restore procedures. We can recover to any second in the last 30 days.

Table of Contents

The Disaster

What happened:

  • Disk failure on primary database
  • Last backup: 24 hours old
  • Lost: 6 hours of customer orders
  • Impact: $50K in lost revenue
  • Recovery time: 8 hours

We learned the hard way.

Backup Types

Logical Backup (pg_dump):

  • SQL statements to recreate data
  • Portable across versions
  • Slower for large databases

Physical Backup (pg_basebackup):

  • Copy of data directory
  • Faster for large databases
  • Same PostgreSQL version required

Logical Backup with pg_dump

Basic dump:

pg_dump -U postgres -d production > backup.sql

Compressed:

pg_dump -U postgres -d production | gzip > backup.sql.gz

Custom format (recommended):

pg_dump -U postgres -d production -F c -f backup.dump

Benefits:

  • Parallel restore
  • Selective restore
  • Better compression

Automated Backup Script

backup.sh:

#!/bin/bash

BACKUP_DIR="/var/backups/postgresql"
DATE=$(date +%Y%m%d_%H%M%S)
DB_NAME="production"
RETENTION_DAYS=30

# Create backup
pg_dump -U postgres -d $DB_NAME -F c -f $BACKUP_DIR/${DB_NAME}_${DATE}.dump

# Compress
gzip $BACKUP_DIR/${DB_NAME}_${DATE}.dump

# Upload to S3
aws s3 cp $BACKUP_DIR/${DB_NAME}_${DATE}.dump.gz s3://backups/postgresql/

# Delete old backups
find $BACKUP_DIR -name "*.dump.gz" -mtime +$RETENTION_DAYS -delete

# Verify backup
if [ $? -eq 0 ]; then
    echo "Backup successful: ${DB_NAME}_${DATE}.dump.gz"
else
    echo "Backup failed!" | mail -s "PostgreSQL Backup Failed" admin@example.com
fi

Cron job:

# Hourly backups
0 * * * * /usr/local/bin/backup.sh

# Daily full backup at 2 AM
0 2 * * * /usr/local/bin/backup_full.sh

Physical Backup with pg_basebackup

Create base backup:

pg_basebackup -U postgres -D /var/backups/base -F tar -z -P

Options:

  • -D: Output directory
  • -F tar: TAR format
  • -z: Compress
  • -P: Show progress

Point-in-Time Recovery (PITR)

Enable WAL archiving:

postgresql.conf:

wal_level = replica
archive_mode = on
archive_command = 'test ! -f /var/lib/postgresql/wal_archive/%f && cp %p /var/lib/postgresql/wal_archive/%f'
max_wal_senders = 3

Restart PostgreSQL:

sudo systemctl restart postgresql

Creating PITR Backup

  1. Start base backup:
pg_basebackup -U postgres -D /var/backups/pitr_base -F tar -z -X fetch
  1. WAL files automatically archived to /var/lib/postgresql/wal_archive/

Restoring from PITR

Scenario: Recover to 2018-03-28 10:30:00

  1. Stop PostgreSQL:
sudo systemctl stop postgresql
  1. Move current data:
mv /var/lib/postgresql/12/main /var/lib/postgresql/12/main.old
  1. Extract base backup:
mkdir /var/lib/postgresql/12/main
tar -xzf /var/backups/pitr_base/base.tar.gz -C /var/lib/postgresql/12/main
  1. Create recovery.conf:
restore_command = 'cp /var/lib/postgresql/wal_archive/%f %p'
recovery_target_time = '2018-03-28 10:30:00'
recovery_target_action = 'promote'
  1. Start PostgreSQL:
sudo systemctl start postgresql

Database recovers to exact point in time!

Continuous Archiving to S3

Archive WAL to S3:

postgresql.conf:

archive_command = 'aws s3 cp %p s3://backups/postgresql/wal/%f'

Restore from S3:

recovery.conf:

restore_command = 'aws s3 cp s3://backups/postgresql/wal/%f %p'

Backup Verification

Test restore monthly:

#!/bin/bash

# Restore to test database
pg_restore -U postgres -d test_restore -c backup.dump

# Run validation queries
psql -U postgres -d test_restore -c "SELECT COUNT(*) FROM users;"
psql -U postgres -d test_restore -c "SELECT COUNT(*) FROM orders;"

# Compare with production
PROD_COUNT=$(psql -U postgres -d production -t -c "SELECT COUNT(*) FROM users;")
TEST_COUNT=$(psql -U postgres -d test_restore -t -c "SELECT COUNT(*) FROM users;")

if [ "$PROD_COUNT" -eq "$TEST_COUNT" ]; then
    echo "Backup verification successful"
else
    echo "Backup verification failed!" | mail -s "Backup Verification Failed" admin@example.com
fi

# Cleanup
dropdb test_restore

Monitoring Backup Status

Check last backup:

#!/bin/bash

LAST_BACKUP=$(ls -t /var/backups/postgresql/*.dump.gz | head -1)
BACKUP_AGE=$(( ($(date +%s) - $(stat -f %m $LAST_BACKUP)) / 3600 ))

if [ $BACKUP_AGE -gt 2 ]; then
    echo "WARNING: Last backup is $BACKUP_AGE hours old"
    # Send alert
fi

Prometheus metrics:

from prometheus_client import Gauge
import os
import time

backup_age = Gauge('postgresql_backup_age_seconds', 'Age of last backup')

def update_backup_age():
    backup_dir = '/var/backups/postgresql'
    files = [os.path.join(backup_dir, f) for f in os.listdir(backup_dir)]
    latest = max(files, key=os.path.getmtime)
    age = time.time() - os.path.getmtime(latest)
    backup_age.set(age)

Incremental Backups with pgBackRest

Install:

sudo apt-get install pgbackrest

Configure /etc/pgbackrest.conf:

[global]
repo1-path=/var/lib/pgbackrest
repo1-retention-full=2
repo1-retention-diff=4

[production]
pg1-path=/var/lib/postgresql/12/main

Create full backup:

pgbackrest --stanza=production backup --type=full

Incremental backup:

pgbackrest --stanza=production backup --type=incr

Restore:

pgbackrest --stanza=production restore

Replication for High Availability

Primary server postgresql.conf:

wal_level = replica
max_wal_senders = 3
wal_keep_segments = 64

Standby server setup:

# Stop standby
sudo systemctl stop postgresql

# Remove data directory
rm -rf /var/lib/postgresql/12/main

# Clone from primary
pg_basebackup -h primary-server -D /var/lib/postgresql/12/main -U replication -P

# Create recovery.conf
cat > /var/lib/postgresql/12/main/recovery.conf << EOF
standby_mode = 'on'
primary_conninfo = 'host=primary-server port=5432 user=replication'
trigger_file = '/tmp/postgresql.trigger'
EOF

# Start standby
sudo systemctl start postgresql

Our Production Setup

Backup Strategy:

  1. Continuous WAL archiving to S3
  2. Hourly pg_dump (compressed, custom format)
  3. Daily pg_basebackup at 2 AM
  4. Weekly full backup to offsite storage
  5. Monthly restore test

Retention:

  • WAL archives: 30 days
  • Hourly dumps: 7 days
  • Daily backups: 30 days
  • Weekly backups: 90 days

Monitoring:

  • Backup age alerts (> 2 hours)
  • WAL archive lag
  • Disk space on backup server
  • Monthly restore test results

Disaster Recovery Procedure

  1. Assess damage:

    • What data is lost?
    • What time to recover to?
  2. Choose recovery method:

    • PITR for specific time
    • Latest backup for full restore
  3. Execute recovery:

    • Stop application
    • Restore database
    • Verify data integrity
  4. Resume operations:

    • Start application
    • Monitor for issues
    • Communicate with users
  5. Post-mortem:

    • What went wrong?
    • How to prevent?
    • Update procedures

Recovery Time Objectives

Our targets:

ScenarioRTORPO
Disk failure30 min0 (replica)
Data corruption1 hour1 hour (PITR)
Accidental DELETE15 min1 hour (PITR)
Complete disaster4 hours24 hours

RTO: Recovery Time Objective
RPO: Recovery Point Objective

Cost Analysis

Monthly costs:

  • S3 storage (500GB): $12
  • Backup server: $50
  • Offsite storage: $20
  • Total: $82/month

Cost of data loss: $50K (one incident)

ROI: 610x in first year!

Results

Before:

  • 24-hour old backups
  • No PITR
  • 8-hour recovery time
  • Lost $50K in one incident

After:

  • Hourly backups
  • PITR to any second
  • 30-minute recovery time
  • Zero data loss in 18 months

Lessons Learned

  1. Test restores regularly - Backups are useless if you can’t restore
  2. Automate everything - Manual backups will be forgotten
  3. Monitor backup age - Alert if backup is too old
  4. Use PITR - Point-in-time recovery is essential
  5. Document procedures - 3 AM is not the time to figure it out

Conclusion

Backups are insurance. You hope you never need them, but when you do, they’re priceless.

Key takeaways:

  1. Multiple backup types (logical + physical)
  2. Continuous WAL archiving for PITR
  3. Automated, tested backups
  4. Monitor backup health
  5. Document and practice recovery

Don’t learn this lesson the hard way. Implement proper backups today.