MySQL Master-Slave Replication: High Availability Setup
Our MySQL server was the single point of failure. When it went down, everything stopped. Read queries were also overwhelming the single server.
I set up master-slave replication with 2 read replicas. Now we have redundancy and 3x read capacity. Last master failure? Promoted slave in 2 minutes.
Table of Contents
The Problem
Single MySQL server:
- All reads and writes
- No redundancy
- Slow during peak hours
- Downtime = complete outage
We needed:
- Read scaling (multiple read replicas)
- High availability (failover capability)
- Backup source (replicas for backups)
Master-Slave Architecture
Master (writes)
├── Slave 1 (reads)
├── Slave 2 (reads)
└── Slave 3 (backup)
- Master handles all writes
- Slaves replicate from master
- Slaves handle read queries
- Can promote slave to master if master fails
Setting Up Master
MySQL 5.7 on master server.
Edit /etc/mysql/my.cnf:
[mysqld]
server-id = 1
log_bin = /var/log/mysql/mysql-bin.log
binlog_do_db = production_db
binlog_format = ROW
expire_logs_days = 7
max_binlog_size = 100M
Restart MySQL:
sudo systemctl restart mysql
Create replication user:
CREATE USER 'repl'@'%' IDENTIFIED BY 'strong_password';
GRANT REPLICATION SLAVE ON *.* TO 'repl'@'%';
FLUSH PRIVILEGES;
Get master status:
SHOW MASTER STATUS;
Output:
+------------------+----------+--------------+------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+------------------+----------+--------------+------------------+
| mysql-bin.000001 | 154 | production_db| |
+------------------+----------+--------------+------------------+
Note the File and Position!
Setting Up Slave
On slave server, edit /etc/mysql/my.cnf:
[mysqld]
server-id = 2
relay-log = /var/log/mysql/mysql-relay-bin
log_bin = /var/log/mysql/mysql-bin.log
binlog_format = ROW
read_only = 1
server-id must be unique for each slave.
Restart MySQL:
sudo systemctl restart mysql
Initial Data Copy
Stop writes on master (briefly):
FLUSH TABLES WITH READ LOCK;
Dump database:
mysqldump -u root -p production_db > production_db.sql
Unlock tables:
UNLOCK TABLES;
Copy dump to slave:
scp production_db.sql slave1:/tmp/
Import on slave:
mysql -u root -p production_db < /tmp/production_db.sql
Configure Replication
On slave:
CHANGE MASTER TO
MASTER_HOST='master.example.com',
MASTER_USER='repl',
MASTER_PASSWORD='strong_password',
MASTER_LOG_FILE='mysql-bin.000001',
MASTER_LOG_POS=154;
Use File and Position from master status!
Start replication:
START SLAVE;
Check status:
SHOW SLAVE STATUS\G
Look for:
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Seconds_Behind_Master: 0
Both should be “Yes”, lag should be low.
Setting Up Multiple Slaves
Repeat for slave2 and slave3:
Slave 2:
server-id = 3
Slave 3 (backup only):
server-id = 4
Application Configuration
Split reads and writes:
# database.py
import pymysql
# Master connection (writes)
master = pymysql.connect(
host='master.example.com',
user='app',
password='***',
database='production_db'
)
# Slave connections (reads)
slaves = [
pymysql.connect(host='slave1.example.com', ...),
pymysql.connect(host='slave2.example.com', ...)
]
def get_read_connection():
# Round-robin load balancing
return random.choice(slaves)
def get_write_connection():
return master
# Usage
def get_user(user_id):
conn = get_read_connection()
cursor = conn.cursor()
cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))
return cursor.fetchone()
def create_user(email, name):
conn = get_write_connection()
cursor = conn.cursor()
cursor.execute("INSERT INTO users (email, name) VALUES (%s, %s)", (email, name))
conn.commit()
Monitoring Replication
Check replication lag:
SHOW SLAVE STATUS\G
Key metrics:
Seconds_Behind_Master: 0 # Replication lag in seconds
Slave_IO_Running: Yes # IO thread running
Slave_SQL_Running: Yes # SQL thread running
Last_Error: # Any errors
Automated Monitoring
Created monitoring script:
#!/bin/bash
# check-replication.sh
SLAVE_STATUS=$(mysql -e "SHOW SLAVE STATUS\G")
IO_RUNNING=$(echo "$SLAVE_STATUS" | grep "Slave_IO_Running:" | awk '{print $2}')
SQL_RUNNING=$(echo "$SLAVE_STATUS" | grep "Slave_SQL_Running:" | awk '{print $2}')
LAG=$(echo "$SLAVE_STATUS" | grep "Seconds_Behind_Master:" | awk '{print $2}')
if [ "$IO_RUNNING" != "Yes" ] || [ "$SQL_RUNNING" != "Yes" ]; then
echo "CRITICAL: Replication stopped!"
# Send alert
exit 2
fi
if [ "$LAG" -gt 60 ]; then
echo "WARNING: Replication lag is ${LAG}s"
# Send warning
exit 1
fi
echo "OK: Replication healthy, lag ${LAG}s"
exit 0
Run via cron every minute:
* * * * * /usr/local/bin/check-replication.sh
Handling Replication Errors
Common error: duplicate key
SHOW SLAVE STATUS\G
Output:
Last_Error: Error 'Duplicate entry '123' for key 'PRIMARY'' on query...
Skip one event:
STOP SLAVE;
SET GLOBAL SQL_SLAVE_SKIP_COUNTER = 1;
START SLAVE;
Warning: Only skip if you understand the impact!
Failover to Slave
Master failed! Promote slave1 to master:
1. Stop replication on slave1:
STOP SLAVE;
RESET SLAVE ALL;
2. Make slave1 writable:
SET GLOBAL read_only = 0;
3. Point other slaves to new master:
On slave2 and slave3:
STOP SLAVE;
CHANGE MASTER TO
MASTER_HOST='slave1.example.com',
MASTER_USER='repl',
MASTER_PASSWORD='strong_password',
MASTER_LOG_FILE='mysql-bin.000005',
MASTER_LOG_POS=789;
START SLAVE;
4. Update application config:
Point writes to slave1 (new master).
Semi-Synchronous Replication
Ensure at least one slave has data before commit:
On master:
INSTALL PLUGIN rpl_semi_sync_master SONAME 'semisync_master.so';
SET GLOBAL rpl_semi_sync_master_enabled = 1;
SET GLOBAL rpl_semi_sync_master_timeout = 1000; -- 1 second
On slaves:
INSTALL PLUGIN rpl_semi_sync_slave SONAME 'semisync_slave.so';
SET GLOBAL rpl_semi_sync_slave_enabled = 1;
Restart replication on slaves:
STOP SLAVE IO_THREAD;
START SLAVE IO_THREAD;
Now master waits for at least one slave to acknowledge before commit.
GTID Replication
Global Transaction Identifiers (MySQL 5.6+):
Master config:
gtid_mode = ON
enforce_gtid_consistency = ON
Slave config:
gtid_mode = ON
enforce_gtid_consistency = ON
Configure replication:
CHANGE MASTER TO
MASTER_HOST='master.example.com',
MASTER_USER='repl',
MASTER_PASSWORD='strong_password',
MASTER_AUTO_POSITION=1;
No need to specify log file and position!
Backup from Slave
Use slave3 for backups (doesn’t impact master):
#!/bin/bash
# backup-mysql.sh
DATE=$(date +%Y%m%d)
BACKUP_DIR=/backups/mysql
mysqldump -h slave3.example.com -u backup -p production_db | gzip > $BACKUP_DIR/backup-$DATE.sql.gz
# Keep last 7 days
find $BACKUP_DIR -name "backup-*.sql.gz" -mtime +7 -delete
Monitoring with Prometheus
Expose MySQL metrics:
# Install mysqld_exporter
wget https://github.com/prometheus/mysqld_exporter/releases/download/v0.10.0/mysqld_exporter-0.10.0.linux-amd64.tar.gz
tar xvfz mysqld_exporter-0.10.0.linux-amd64.tar.gz
# Create MySQL user
mysql -e "CREATE USER 'exporter'@'localhost' IDENTIFIED BY '***';"
mysql -e "GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'localhost';"
# Run exporter
./mysqld_exporter --config.my-cnf=/etc/mysql/exporter.cnf
Prometheus queries:
# Replication lag
mysql_slave_status_seconds_behind_master
# Replication running
mysql_slave_status_slave_io_running
mysql_slave_status_slave_sql_running
Results
Before:
- Single server
- No redundancy
- Slow reads during peak
- Downtime = outage
After:
- 1 master + 2 read slaves
- High availability
- 3x read capacity
- 2-minute failover time
Lessons Learned
- Test failover - Practice promoting slaves
- Monitor lag - Alert if lag > 60s
- Use semi-sync - Prevent data loss
- Backup from slave - Don’t impact master
- Automate monitoring - Don’t rely on manual checks
Conclusion
MySQL replication provides read scaling and high availability. Essential for production systems.
Key takeaways:
- Master for writes, slaves for reads
- Monitor replication lag constantly
- Practice failover procedures
- Use semi-synchronous replication
- Take backups from slaves
Replication isn’t perfect (eventual consistency), but it’s a huge improvement over single server.