Here's how to failover from primary to secondary master.
This was written following the My Opera case, and we use MySQL, but should be fairly generic.
Disable monitoring checks
- Pause any Pingdom checks that are running for the affected systems
- Set downtime or disable notifications in Nagios for the affected systems
Record log file and position
Assuming your secondary master (DB2) is idle, then now you have to record the log file and position by issuing a SHOW MASTER STATUS
command:
mysql> SHOW MASTER STATUS G
*************************** 1. row ***************************
File: mysql-bin.000024 <- MASTER LOG FILE
Position: 91074774 <- MASTER LOG POSITION
Binlog_Do_DB:
Binlog_Ignore_DB:
1 row in set (0.00 sec)
Write them down somewhere.
If you need to perform any kind of write/alter query on this host, then you have to issue the show master status command again, because position will change.
Also try repeating this command. You should see that the log file and position do not change between different runs.
Enable maintenance mode
Now is the time to enable your maintenance or downtime mode for the site or service. That will of course depend on your deployment tools.
Stop backend servers
Your backend/application servers might need to stay up and running. For example, in case of the Auth service, we want this, because we're going to serve static responses (html, xml, etc…) to the clients instead of just letting the connections hang.
In other cases, it's fine to just shut down the backends. You may want to do this for 2 reasons:
- to make sure that nobody is accidentally hitting your master database, from your internal network or otherwise
- because doing so should close all the connections to your master database. This is actually depending on the
wait_timeout
variable in the mysql server. The connections won't go away untilwait_timeout
seconds have passed. This is the normal behaviour, so don't panic if you still see connections after you shut down the backends.
Switch to the new master now
This depends on how you actually perform the switch. I can imagine at least 2 ways to do this:
- by instructing LVS to direct all connections to the secondary master
- take over the IP address either manually or using keepalived
On My Opera, we use keepalived
with a private group between the two master database servers, so it's just a matter of:
- stopping
keepalived
on the primary master database - starting
keepalived
on the secondary master database
There is a quick and dirty bash script that allows to verify who's the master and makes the switch.
#!/bin/sh DB1=pri-master-hostname DB2=sec-master-hostname function toggle_keepalive() { host=$1 if [[ `ssh $host pidof keepalived` == "" ]]; then ssh $host /etc/init.d/keepalived start if [[ `ssh $host pidof keepalived` == "" ]]; then echo '*** KEEPALIVE START FAILED ***' echo 'Aborting the master failover procedure' exit fi else ssh $host /etc/init.d/keepalived stop if [[ `ssh $host pidof keepalived` != "" ]]; then echo '*** KEEPALIVE STOP FAILED ***' echo 'Aborting the master failover procedure' exit fi fi } echo "Master Database failover" echo # Find out who's the primary master now, and swap them if [[ `ssh $DB1 pidof keepalived` == "" ]]; then PRIMARY=$DB2 SECONDARY=$DB1 else PRIMARY=$DB1 SECONDARY=$DB2 fi echo Primary is $PRIMARY echo Secondary is $SECONDARY # Shutdown primary first, then enable secondary toggle_keepalive $PRIMARY toggle_keepalive $SECONDARY
As soon as you do that, the secondary master will be promoted to primary master.
Since they are assumed to be already replicating from each other, nothing will change for them. It will however for all the slaves that were replicating from the primary master. We'll see what to do about that later.
Restart backend servers
Now it's the right time to restart the backend servers, and check that they correctly connect to the new primary master.
On My Opera, we're using a virtual address, w-mlb
(write-mysql-load-balancer), to refer to the active primary master database. We use this name in the configuration files everywhere.
This means that we don't have to change anything in the backend servers configuration. We just restart them, and they will connect to the new primary master, due to the IP takeover step described above.
Turn off maintenance mode
If the backends are working correctly, they're connecting to the new master db, it's time to remove the maintenance page, so do that.
We're enabling and disabling maintenance mode by enabling and disabling a virtual host configuration in our frontends and reloading or restarting the frontend httpd servers.
From now on, your application is hopefully up and running and receiving client requests, so your downtime window is over.
Check replication lag
The database slaves at this point are still replicating from the former primary master database (DB1).
But DB1 now is not receiving any traffic (queries) anymore, so it's basically idle, and it should be. Any queries happening on DB1 now mean that something is seriously wrong. There might be lingering connections, but no activity.
Then it's important that all the slaves show no replication lag, so issuing a SHOW SLAVE STATUS
command should show zero seconds behind.
mysql> SHOW SLAVE STATUS G *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: <DB1-ip> Master_Port: <port> Connect_Retry: 60 Master_Log_File: mysql-bin.000025 Read_Master_Log_Pos: 13691126 ... Seconds_Behind_Master: 0 1 row in set (0.00 sec)
It's important that Seconds Behind Master
is zero.
If it's not, it means that the slave needs more time to fully replicate all the past traffic that had been going on on the former primary master, DB1.
Remember that the primary master is now DB2, while DB1 is the secondary master.
Change master on the slaves
Now you can perform the CHANGE MASTER TO command on all the slaves.
Now you have to bring back the notes about MASTER LOG FILE and MASTER LOG POSITION.
First, stop the slave replication.
mysql> STOP SLAVE;
Then the exact command to issue, if nothing else about your replication changed, is:
mysql> CHANGE MASTER TO MASTER_HOST='<DB2-ip>', MASTER_LOG_FILE='<master_log_file>', MASTER_LOG_POSITION='<master_log_position>';
Then restart the slave replication:
mysql> START SLAVE; mysql> SHOW SLAVE STATUS G
The following SHOW SLAVE STATUS G
command should show the replication running, and, depending on how long it took you to change master since the new master took over the ip, the number of seconds of replication lag.
This number should rapidly go down towards zero.
If it's not, then you might have a problem. Go hide now or take the first flight to Australia or something.
We wrote a switch-master
Perl script that proved to be very effective and useful. Example:
./switch-master --host <your_slave> --new-master <new_master_ip> --log-file <master_log_file> --log-pos <master_log_position>
This script performs a lot of sanity checks. Before switching master, it checks that replication lag is zero. If it's not, waits a bit and checks again, etc…
It's made to try to prevent disaster from striking. Very useful and quick to use.
Enable monitoring checks
Now verify that everything looks fine, replication lag is zero, your backends are working correctly, try to use your site a bit.
If everything's fine, enable or unpause the monitoring checks.
You have made it!