{"id":504,"date":"2010-04-16T14:01:50","date_gmt":"2010-04-16T13:01:50","guid":{"rendered":"http:\/\/www.streppone.it\/cosimo\/blog\/2010\/04\/primary-to-secondary-master-failover\/"},"modified":"2010-04-16T14:01:50","modified_gmt":"2010-04-16T13:01:50","slug":"primary-to-secondary-master-failover","status":"publish","type":"post","link":"https:\/\/www.streppone.it\/cosimo\/blog\/2010\/04\/primary-to-secondary-master-failover\/","title":{"rendered":"Primary to secondary master failover"},"content":{"rendered":"<p>Here&#39;s how to failover from primary to secondary master.<br \/>\nThis was written following the My Opera case, and we use MySQL, but should be fairly generic.<\/p>\n<h3>Disable monitoring checks<\/h3>\n<ul>\n<li>Pause any Pingdom checks that are running for the affected systems\n<\/li>\n<li>Set downtime or disable notifications in Nagios for the affected systems<\/li>\n<\/ul>\n<h3>Record log file and position<\/h3>\n<p>Assuming your secondary master (DB2) is idle, then now you have to record the log file and position by issuing a <code>SHOW MASTER STATUS<\/code> command:<\/p>\n<pre><code>\r\nmysql&gt; SHOW MASTER STATUS G\r\n*************************** 1. row ***************************\r\n            File: mysql-bin.000024    &lt;- MASTER LOG FILE\r\n        Position: 91074774            &lt;- MASTER LOG POSITION\r\n    Binlog_Do_DB:\r\nBinlog_Ignore_DB:\r\n\r\n1 row in set (0.00 sec)\r\n<\/code><\/pre>\n<p>Write them down somewhere.<br \/>\nIf you need to perform any kind of write\/alter query on this host, then you have to issue the show master status command again, because position will change.<\/p>\n<p>Also try repeating this command. You should see <b>that the log file and position do not change between different runs<\/b>.<\/p>\n<h3>Enable maintenance mode<\/h3>\n<p>Now is the time to enable your maintenance or downtime mode for the site or service. That will of course depend on your deployment tools.<\/p>\n<h3>Stop backend servers<\/h3>\n<p>Your backend\/application servers might need to stay up and running. For example, in case of the Auth service, we want this, because we&#39;re going to serve static responses (html, xml, etc&#8230;) to the clients instead of just letting the connections hang.<\/p>\n<p>In other cases, it&#39;s fine to just shut down the backends. You may want to do this for 2 reasons:<\/p>\n<ul>\n<li>to make sure that nobody is accidentally hitting your master database, from your internal network or otherwise\n<\/li>\n<li>because doing so should close all the connections to your master database. This is actually depending on the <code>wait_timeout<\/code> variable in the mysql server. The connections won&#39;t go away until <code>wait_timeout<\/code> seconds have passed. This is the normal behaviour, so don&#39;t panic if you still see connections after you shut down the backends.<\/li>\n<\/ul>\n<h3>Switch to the new master now<\/h3>\n<p>This depends on how you actually perform the switch. I can imagine at least 2 ways to do this:<\/p>\n<ul>\n<li>by instructing LVS to direct all connections to the secondary master\n<\/li>\n<li>take over the IP address either manually or using keepalived<\/li>\n<\/ul>\n<p>On My Opera, we use <code>keepalived<\/code> with a private group between the two master database servers, so it&#39;s just a matter of:<\/p>\n<ul>\n<li>stopping <code>keepalived<\/code> on the primary master database\n<\/li>\n<li>starting <code>keepalived<\/code> on the secondary master database<\/li>\n<\/ul>\n<p>There is a quick and dirty bash script that allows to verify who&#39;s the master and makes the switch.<\/p>\n<pre>\r\n#!\/bin\/sh\r\n\r\nDB1=pri-master-hostname\r\nDB2=sec-master-hostname\r\n\r\nfunction toggle_keepalive() {\r\n        host=$1\r\n        if [[ `ssh $host pidof keepalived` == &quot;&quot; ]]; then\r\n                ssh $host \/etc\/init.d\/keepalived start\r\n                if [[ `ssh $host pidof keepalived` == &quot;&quot; ]]; then\r\n                        echo &#39;*** KEEPALIVE START FAILED ***&#39;\r\n                        echo &#39;Aborting the master failover procedure&#39;\r\n                        exit\r\n                fi\r\n        else\r\n                ssh $host \/etc\/init.d\/keepalived stop\r\n                if [[ `ssh $host pidof keepalived` != &quot;&quot; ]]; then\r\n                        echo &#39;*** KEEPALIVE STOP FAILED ***&#39;\r\n                        echo &#39;Aborting the master failover procedure&#39;\r\n                        exit\r\n                fi\r\n        fi\r\n}\r\n\r\necho &quot;Master Database failover&quot;\r\necho\r\n\r\n# Find out who&#39;s the primary master now, and swap them\r\nif [[ `ssh $DB1 pidof keepalived` == &quot;&quot; ]]; then\r\n        PRIMARY=$DB2\r\n        SECONDARY=$DB1\r\nelse\r\n        PRIMARY=$DB1\r\n        SECONDARY=$DB2\r\nfi\r\n\r\necho Primary is $PRIMARY\r\necho Secondary is $SECONDARY\r\n\r\n# Shutdown primary first, then enable secondary\r\ntoggle_keepalive $PRIMARY\r\ntoggle_keepalive $SECONDARY\r\n<\/pre>\n<p>As soon as you do that, the secondary master will be promoted to primary master.<br \/>\nSince they are assumed to be already replicating from each other, nothing will change for them. It will however for all the slaves that were replicating from the primary master. We&#39;ll see what to do about that later.<\/p>\n<h3>Restart backend servers<\/h3>\n<p>Now it&#39;s the right time to restart the backend servers, and check that they correctly connect to the new primary master.<\/p>\n<p>On My Opera, we&#39;re using a virtual address, <code>w-mlb<\/code> (write-mysql-load-balancer), to refer to the active primary master database. We use this name in the configuration files everywhere.<\/p>\n<p>This means that we don&#39;t have to change anything in the backend servers configuration. We just restart them, and they will connect to the new primary master, due to the IP takeover step described above.<\/p>\n<h3>Turn off maintenance mode<\/h3>\n<p>If the backends are working correctly, they&#39;re connecting to the new master db, it&#39;s time to remove the maintenance page, so do that.<\/p>\n<p>We&#39;re enabling and disabling maintenance mode by enabling and disabling a virtual host configuration in our frontends and reloading or restarting the frontend httpd servers.<\/p>\n<p>From now on, your application is hopefully up and running and receiving client requests, so your downtime window is over.<\/p>\n<h3>Check replication lag<\/h3>\n<p>The database slaves at this point are still replicating from the former primary master database (DB1).<\/p>\n<p>But DB1 now is not receiving any traffic (queries) anymore, so it&#39;s basically idle, and <b>it should be<\/b>. Any queries happening on DB1 now mean that <b>something is seriously wrong<\/b>. There might be lingering connections, but no activity.<\/p>\n<p>Then it&#39;s important that all the slaves show <b>no replication lag<\/b>, so issuing a <code>SHOW SLAVE STATUS<\/code> command should show zero seconds behind.<\/p>\n<pre>\r\nmysql&gt; SHOW SLAVE STATUS G\r\n*************************** 1. row ***************************\r\n             Slave_IO_State: Waiting for master to send event\r\n                Master_Host: &lt;DB1-ip&gt;\r\n                Master_Port: &lt;port&gt;\r\n              Connect_Retry: 60\r\n            Master_Log_File: mysql-bin.000025\r\n        Read_Master_Log_Pos: 13691126\r\n...\r\n      Seconds_Behind_Master: 0\r\n\r\n1 row in set (0.00 sec)\r\n<\/pre>\n<p>It&#39;s important that <b><code>Seconds Behind Master<\/code> is zero<\/b>.<br \/>\nIf it&#39;s not, it means that the slave needs more time to fully replicate all the past traffic that had been going on on the former primary master, DB1.<\/p>\n<p>Remember that the primary master is now DB2, while DB1 is the secondary master.<\/p>\n<h3>Change master on the slaves<\/h3>\n<p>Now you can perform the <a href=\"http:\/\/dev.mysql.com\/doc\/refman\/5.0\/en\/change-master-to.html\" rel=\"nofollow\">CHANGE MASTER TO<\/a> command on all the slaves.<\/p>\n<p>Now you have to bring back the notes about <b>MASTER LOG FILE<\/b> and <b>MASTER LOG POSITION<\/b>.<\/p>\n<p>First, stop the slave replication.<\/p>\n<pre>\r\nmysql&gt; STOP SLAVE;\r\n<\/pre>\n<p>Then the exact command to issue, if nothing else about your replication changed, is:<\/p>\n<pre>\r\nmysql&gt; CHANGE MASTER TO MASTER_HOST=&#39;&lt;DB2-ip&gt;&#39;, MASTER_LOG_FILE=&#39;&lt;master_log_file&gt;&#39;, MASTER_LOG_POSITION=&#39;&lt;master_log_position&gt;&#39;;\r\n<\/pre>\n<p>Then restart the slave replication:<\/p>\n<pre>\r\nmysql&gt; START SLAVE;\r\nmysql&gt; SHOW SLAVE STATUS G\r\n<\/pre>\n<p>The following <code>SHOW SLAVE STATUS G<\/code> command should show the replication running, and, depending on how long it took you to change master since the new master took over the ip, the number of seconds of replication lag.<\/p>\n<p>This number <b>should rapidly go down towards zero<\/b>.<\/p>\n<p>If it&#39;s not, then you might have a problem. Go hide now or take the first flight to Australia or something.<\/p>\n<p>We wrote a <code>switch-master<\/code> Perl script that proved to be very effective and useful. Example:<\/p>\n<pre>\r\n.\/switch-master --host &lt;your_slave&gt; --new-master &lt;new_master_ip&gt; --log-file &lt;master_log_file&gt; --log-pos &lt;master_log_position&gt;\r\n<\/pre>\n<p>This script performs a lot of sanity checks. Before switching master, it checks that replication lag is zero. If it&#39;s not, waits a bit and checks again, etc&#8230;<\/p>\n<p>It&#39;s made to try to prevent disaster from striking. Very useful and quick to use.<\/p>\n<h3>Enable monitoring checks<\/h3>\n<p>Now verify that everything looks fine, replication lag is zero, your backends are working correctly, try to use your site a bit.<\/p>\n<p>If everything&#39;s fine, enable or unpause the monitoring checks.<br \/>\nYou have made it!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Here&#39;s how to failover from primary to secondary master. This was written following the My Opera case, and we use MySQL, but should be fairly generic. Disable monitoring checks Pause any Pingdom checks that are running for the affected systems Set downtime or disable notifications in Nagios for the affected systems Record log file and [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[58,466,467,468,76],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v22.9 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Primary to secondary master failover - Random hacking<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/www.streppone.it\/cosimo\/blog\/2010\/04\/primary-to-secondary-master-failover\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Primary to secondary master failover - Random hacking\" \/>\n<meta property=\"og:description\" content=\"Here&#039;s how to failover from primary to secondary master. This was written following the My Opera case, and we use MySQL, but should be fairly generic. Disable monitoring checks Pause any Pingdom checks that are running for the affected systems Set downtime or disable notifications in Nagios for the affected systems Record log file and [&hellip;]\" \/>\n<meta property=\"og:url\" content=\"https:\/\/www.streppone.it\/cosimo\/blog\/2010\/04\/primary-to-secondary-master-failover\/\" \/>\n<meta property=\"og:site_name\" content=\"Random hacking\" \/>\n<meta property=\"article:published_time\" content=\"2010-04-16T13:01:50+00:00\" \/>\n<meta name=\"author\" content=\"cosimo\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"cosimo\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"6 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/www.streppone.it\/cosimo\/blog\/2010\/04\/primary-to-secondary-master-failover\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/www.streppone.it\/cosimo\/blog\/2010\/04\/primary-to-secondary-master-failover\/\"},\"author\":{\"name\":\"cosimo\",\"@id\":\"https:\/\/www.streppone.it\/cosimo\/blog\/#\/schema\/person\/c443bedbf6ecf99550d6395620801df1\"},\"headline\":\"Primary to secondary master failover\",\"datePublished\":\"2010-04-16T13:01:50+00:00\",\"dateModified\":\"2010-04-16T13:01:50+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/www.streppone.it\/cosimo\/blog\/2010\/04\/primary-to-secondary-master-failover\/\"},\"wordCount\":966,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/www.streppone.it\/cosimo\/blog\/#\/schema\/person\/c443bedbf6ecf99550d6395620801df1\"},\"keywords\":[\"database\",\"failover\",\"master-master\",\"replication\",\"servers\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/www.streppone.it\/cosimo\/blog\/2010\/04\/primary-to-secondary-master-failover\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/www.streppone.it\/cosimo\/blog\/2010\/04\/primary-to-secondary-master-failover\/\",\"url\":\"https:\/\/www.streppone.it\/cosimo\/blog\/2010\/04\/primary-to-secondary-master-failover\/\",\"name\":\"Primary to secondary master failover - Random hacking\",\"isPartOf\":{\"@id\":\"https:\/\/www.streppone.it\/cosimo\/blog\/#website\"},\"datePublished\":\"2010-04-16T13:01:50+00:00\",\"dateModified\":\"2010-04-16T13:01:50+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/www.streppone.it\/cosimo\/blog\/2010\/04\/primary-to-secondary-master-failover\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/www.streppone.it\/cosimo\/blog\/2010\/04\/primary-to-secondary-master-failover\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/www.streppone.it\/cosimo\/blog\/2010\/04\/primary-to-secondary-master-failover\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/www.streppone.it\/cosimo\/blog\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Primary to secondary master failover\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/www.streppone.it\/cosimo\/blog\/#website\",\"url\":\"https:\/\/www.streppone.it\/cosimo\/blog\/\",\"name\":\"Random hacking\",\"description\":\"Assume nothing. Code defensively. Keep it simple, stupid!\",\"publisher\":{\"@id\":\"https:\/\/www.streppone.it\/cosimo\/blog\/#\/schema\/person\/c443bedbf6ecf99550d6395620801df1\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/www.streppone.it\/cosimo\/blog\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":[\"Person\",\"Organization\"],\"@id\":\"https:\/\/www.streppone.it\/cosimo\/blog\/#\/schema\/person\/c443bedbf6ecf99550d6395620801df1\",\"name\":\"cosimo\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/www.streppone.it\/cosimo\/blog\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/cb1d938720df45a2720724aae99e3bfc?s=96&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/cb1d938720df45a2720724aae99e3bfc?s=96&r=g\",\"caption\":\"cosimo\"},\"logo\":{\"@id\":\"https:\/\/www.streppone.it\/cosimo\/blog\/#\/schema\/person\/image\/\"},\"url\":\"https:\/\/www.streppone.it\/cosimo\/blog\/author\/cosimo\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Primary to secondary master failover - Random hacking","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/www.streppone.it\/cosimo\/blog\/2010\/04\/primary-to-secondary-master-failover\/","og_locale":"en_US","og_type":"article","og_title":"Primary to secondary master failover - Random hacking","og_description":"Here&#39;s how to failover from primary to secondary master. This was written following the My Opera case, and we use MySQL, but should be fairly generic. Disable monitoring checks Pause any Pingdom checks that are running for the affected systems Set downtime or disable notifications in Nagios for the affected systems Record log file and [&hellip;]","og_url":"https:\/\/www.streppone.it\/cosimo\/blog\/2010\/04\/primary-to-secondary-master-failover\/","og_site_name":"Random hacking","article_published_time":"2010-04-16T13:01:50+00:00","author":"cosimo","twitter_card":"summary_large_image","twitter_misc":{"Written by":"cosimo","Est. reading time":"6 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/www.streppone.it\/cosimo\/blog\/2010\/04\/primary-to-secondary-master-failover\/#article","isPartOf":{"@id":"https:\/\/www.streppone.it\/cosimo\/blog\/2010\/04\/primary-to-secondary-master-failover\/"},"author":{"name":"cosimo","@id":"https:\/\/www.streppone.it\/cosimo\/blog\/#\/schema\/person\/c443bedbf6ecf99550d6395620801df1"},"headline":"Primary to secondary master failover","datePublished":"2010-04-16T13:01:50+00:00","dateModified":"2010-04-16T13:01:50+00:00","mainEntityOfPage":{"@id":"https:\/\/www.streppone.it\/cosimo\/blog\/2010\/04\/primary-to-secondary-master-failover\/"},"wordCount":966,"commentCount":0,"publisher":{"@id":"https:\/\/www.streppone.it\/cosimo\/blog\/#\/schema\/person\/c443bedbf6ecf99550d6395620801df1"},"keywords":["database","failover","master-master","replication","servers"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/www.streppone.it\/cosimo\/blog\/2010\/04\/primary-to-secondary-master-failover\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/www.streppone.it\/cosimo\/blog\/2010\/04\/primary-to-secondary-master-failover\/","url":"https:\/\/www.streppone.it\/cosimo\/blog\/2010\/04\/primary-to-secondary-master-failover\/","name":"Primary to secondary master failover - Random hacking","isPartOf":{"@id":"https:\/\/www.streppone.it\/cosimo\/blog\/#website"},"datePublished":"2010-04-16T13:01:50+00:00","dateModified":"2010-04-16T13:01:50+00:00","breadcrumb":{"@id":"https:\/\/www.streppone.it\/cosimo\/blog\/2010\/04\/primary-to-secondary-master-failover\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/www.streppone.it\/cosimo\/blog\/2010\/04\/primary-to-secondary-master-failover\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/www.streppone.it\/cosimo\/blog\/2010\/04\/primary-to-secondary-master-failover\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/www.streppone.it\/cosimo\/blog\/"},{"@type":"ListItem","position":2,"name":"Primary to secondary master failover"}]},{"@type":"WebSite","@id":"https:\/\/www.streppone.it\/cosimo\/blog\/#website","url":"https:\/\/www.streppone.it\/cosimo\/blog\/","name":"Random hacking","description":"Assume nothing. Code defensively. Keep it simple, stupid!","publisher":{"@id":"https:\/\/www.streppone.it\/cosimo\/blog\/#\/schema\/person\/c443bedbf6ecf99550d6395620801df1"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/www.streppone.it\/cosimo\/blog\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":["Person","Organization"],"@id":"https:\/\/www.streppone.it\/cosimo\/blog\/#\/schema\/person\/c443bedbf6ecf99550d6395620801df1","name":"cosimo","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/www.streppone.it\/cosimo\/blog\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/cb1d938720df45a2720724aae99e3bfc?s=96&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/cb1d938720df45a2720724aae99e3bfc?s=96&r=g","caption":"cosimo"},"logo":{"@id":"https:\/\/www.streppone.it\/cosimo\/blog\/#\/schema\/person\/image\/"},"url":"https:\/\/www.streppone.it\/cosimo\/blog\/author\/cosimo\/"}]}},"_links":{"self":[{"href":"https:\/\/www.streppone.it\/cosimo\/blog\/wp-json\/wp\/v2\/posts\/504"}],"collection":[{"href":"https:\/\/www.streppone.it\/cosimo\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.streppone.it\/cosimo\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.streppone.it\/cosimo\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.streppone.it\/cosimo\/blog\/wp-json\/wp\/v2\/comments?post=504"}],"version-history":[{"count":0,"href":"https:\/\/www.streppone.it\/cosimo\/blog\/wp-json\/wp\/v2\/posts\/504\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.streppone.it\/cosimo\/blog\/wp-json\/wp\/v2\/media?parent=504"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.streppone.it\/cosimo\/blog\/wp-json\/wp\/v2\/categories?post=504"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.streppone.it\/cosimo\/blog\/wp-json\/wp\/v2\/tags?post=504"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}