Our MySQL server was the single point of failure. When it went down, everything stopped. Read queries were also overwhelming the single server.

I set up master-slave replication with 2 read replicas. Now we have redundancy and 3x read capacity. Last master failure? Promoted slave in 2 minutes.

Table of Contents

The Problem

Single MySQL server:

  • All reads and writes
  • No redundancy
  • Slow during peak hours
  • Downtime = complete outage

We needed:

  • Read scaling (multiple read replicas)
  • High availability (failover capability)
  • Backup source (replicas for backups)

Master-Slave Architecture

Master (writes)
├── Slave 1 (reads)
├── Slave 2 (reads)
└── Slave 3 (backup)
  • Master handles all writes
  • Slaves replicate from master
  • Slaves handle read queries
  • Can promote slave to master if master fails

Setting Up Master

MySQL 5.7 on master server.

Edit /etc/mysql/my.cnf:

[mysqld]
server-id = 1
log_bin = /var/log/mysql/mysql-bin.log
binlog_do_db = production_db
binlog_format = ROW
expire_logs_days = 7
max_binlog_size = 100M

Restart MySQL:

sudo systemctl restart mysql

Create replication user:

CREATE USER 'repl'@'%' IDENTIFIED BY 'strong_password';
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'%';
FLUSH PRIVILEGES;

Get master status:

SHOW MASTER STATUS;

Output:

+------------------+----------+--------------+------------------+
| File             | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+------------------+----------+--------------+------------------+
| mysql-bin.000001 |      154 | production_db|                  |
+------------------+----------+--------------+------------------+

Note the File and Position!

Setting Up Slave

On slave server, edit /etc/mysql/my.cnf:

[mysqld]
server-id = 2
relay-log = /var/log/mysql/mysql-relay-bin
log_bin = /var/log/mysql/mysql-bin.log
binlog_format = ROW
read_only = 1

server-id must be unique for each slave.

Restart MySQL:

sudo systemctl restart mysql

Initial Data Copy

Stop writes on master (briefly):

FLUSH TABLES WITH READ LOCK;

Dump database:

mysqldump -u root -p production_db > production_db.sql

Unlock tables:

UNLOCK TABLES;

Copy dump to slave:

scp production_db.sql slave1:/tmp/

Import on slave:

mysql -u root -p production_db < /tmp/production_db.sql

Configure Replication

On slave:

CHANGE MASTER TO
  MASTER_HOST='master.example.com',
  MASTER_USER='repl',
  MASTER_PASSWORD='strong_password',
  MASTER_LOG_FILE='mysql-bin.000001',
  MASTER_LOG_POS=154;

Use File and Position from master status!

Start replication:

START SLAVE;

Check status:

SHOW SLAVE STATUS\G

Look for:

Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Seconds_Behind_Master: 0

Both should be “Yes”, lag should be low.

Setting Up Multiple Slaves

Repeat for slave2 and slave3:

Slave 2:

server-id = 3

Slave 3 (backup only):

server-id = 4

Application Configuration

Split reads and writes:

# database.py
import pymysql

# Master connection (writes)
master = pymysql.connect(
    host='master.example.com',
    user='app',
    password='***',
    database='production_db'
)

# Slave connections (reads)
slaves = [
    pymysql.connect(host='slave1.example.com', ...),
    pymysql.connect(host='slave2.example.com', ...)
]

def get_read_connection():
    # Round-robin load balancing
    return random.choice(slaves)

def get_write_connection():
    return master

# Usage
def get_user(user_id):
    conn = get_read_connection()
    cursor = conn.cursor()
    cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))
    return cursor.fetchone()

def create_user(email, name):
    conn = get_write_connection()
    cursor = conn.cursor()
    cursor.execute("INSERT INTO users (email, name) VALUES (%s, %s)", (email, name))
    conn.commit()

Monitoring Replication

Check replication lag:

SHOW SLAVE STATUS\G

Key metrics:

Seconds_Behind_Master: 0     # Replication lag in seconds
Slave_IO_Running: Yes        # IO thread running
Slave_SQL_Running: Yes       # SQL thread running
Last_Error:                  # Any errors

Automated Monitoring

Created monitoring script:

#!/bin/bash
# check-replication.sh

SLAVE_STATUS=$(mysql -e "SHOW SLAVE STATUS\G")

IO_RUNNING=$(echo "$SLAVE_STATUS" | grep "Slave_IO_Running:" | awk '{print $2}')
SQL_RUNNING=$(echo "$SLAVE_STATUS" | grep "Slave_SQL_Running:" | awk '{print $2}')
LAG=$(echo "$SLAVE_STATUS" | grep "Seconds_Behind_Master:" | awk '{print $2}')

if [ "$IO_RUNNING" != "Yes" ] || [ "$SQL_RUNNING" != "Yes" ]; then
    echo "CRITICAL: Replication stopped!"
    # Send alert
    exit 2
fi

if [ "$LAG" -gt 60 ]; then
    echo "WARNING: Replication lag is ${LAG}s"
    # Send warning
    exit 1
fi

echo "OK: Replication healthy, lag ${LAG}s"
exit 0

Run via cron every minute:

* * * * * /usr/local/bin/check-replication.sh

Handling Replication Errors

Common error: duplicate key

SHOW SLAVE STATUS\G

Output:

Last_Error: Error 'Duplicate entry '123' for key 'PRIMARY'' on query...

Skip one event:

STOP SLAVE;
SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1;
START SLAVE;

Warning: Only skip if you understand the impact!

Failover to Slave

Master failed! Promote slave1 to master:

1. Stop replication on slave1:

STOP SLAVE;
RESET SLAVE ALL;

2. Make slave1 writable:

SET GLOBAL read_only = 0;

3. Point other slaves to new master:

On slave2 and slave3:

STOP SLAVE;

CHANGE MASTER TO
  MASTER_HOST='slave1.example.com',
  MASTER_USER='repl',
  MASTER_PASSWORD='strong_password',
  MASTER_LOG_FILE='mysql-bin.000005',
  MASTER_LOG_POS=789;

START SLAVE;

4. Update application config:

Point writes to slave1 (new master).

Semi-Synchronous Replication

Ensure at least one slave has data before commit:

On master:

INSTALL PLUGIN rpl_semi_sync_master SONAME 'semisync_master.so';
SET GLOBAL rpl_semi_sync_master_enabled = 1;
SET GLOBAL rpl_semi_sync_master_timeout = 1000; -- 1 second

On slaves:

INSTALL PLUGIN rpl_semi_sync_slave SONAME 'semisync_slave.so';
SET GLOBAL rpl_semi_sync_slave_enabled = 1;

Restart replication on slaves:

STOP SLAVE IO_THREAD;
START SLAVE IO_THREAD;

Now master waits for at least one slave to acknowledge before commit.

GTID Replication

Global Transaction Identifiers (MySQL 5.6+):

Master config:

gtid_mode = ON
enforce_gtid_consistency = ON

Slave config:

gtid_mode = ON
enforce_gtid_consistency = ON

Configure replication:

CHANGE MASTER TO
  MASTER_HOST='master.example.com',
  MASTER_USER='repl',
  MASTER_PASSWORD='strong_password',
  MASTER_AUTO_POSITION=1;

No need to specify log file and position!

Backup from Slave

Use slave3 for backups (doesn’t impact master):

#!/bin/bash
# backup-mysql.sh

DATE=$(date +%Y%m%d)
BACKUP_DIR=/backups/mysql

mysqldump -h slave3.example.com -u backup -p production_db | gzip > $BACKUP_DIR/backup-$DATE.sql.gz

# Keep last 7 days
find $BACKUP_DIR -name "backup-*.sql.gz" -mtime +7 -delete

Monitoring with Prometheus

Expose MySQL metrics:

# Install mysqld_exporter
wget https://github.com/prometheus/mysqld_exporter/releases/download/v0.10.0/mysqld_exporter-0.10.0.linux-amd64.tar.gz
tar xvfz mysqld_exporter-0.10.0.linux-amd64.tar.gz

# Create MySQL user
mysql -e "CREATE USER 'exporter'@'localhost' IDENTIFIED BY '***';"
mysql -e "GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost';"

# Run exporter
./mysqld_exporter --config.my-cnf=/etc/mysql/exporter.cnf

Prometheus queries:

# Replication lag
mysql_slave_status_seconds_behind_master

# Replication running
mysql_slave_status_slave_io_running
mysql_slave_status_slave_sql_running

Results

Before:

  • Single server
  • No redundancy
  • Slow reads during peak
  • Downtime = outage

After:

  • 1 master + 2 read slaves
  • High availability
  • 3x read capacity
  • 2-minute failover time

Lessons Learned

  1. Test failover - Practice promoting slaves
  2. Monitor lag - Alert if lag > 60s
  3. Use semi-sync - Prevent data loss
  4. Backup from slave - Don’t impact master
  5. Automate monitoring - Don’t rely on manual checks

Conclusion

MySQL replication provides read scaling and high availability. Essential for production systems.

Key takeaways:

  1. Master for writes, slaves for reads
  2. Monitor replication lag constantly
  3. Practice failover procedures
  4. Use semi-synchronous replication
  5. Take backups from slaves

Replication isn’t perfect (eventual consistency), but it’s a huge improvement over single server.