Monthly Archives: April 2010

Pimp my Debian

Have you ever reinstalled your workstation and found out that your Perl scripts need a shit load of modules that you don't have anymore? Or maybe on a server?

I have. In such cases you have to:

  • run your script,
  • find out which module is missing,
  • figure out if there's a debian package for it,
  • install the debian package,
  • GOTO 10

Today I was so annoyed and lazy, that I decided to put an end to this madness. So I wrote pimp-my-debian. It's an innocent script that you can run as follows:

$ pimp-my-debian --command 'perl ./myscript'

It will keep running your command (perl ./myscript), reading its output, and if it contains something like Can't locate module Foo/Bar.pm in @INC, or Base class package "Foo::Bar" is empty, it will try to figure out a suitable debian package, install it, and retry your command.

It worked so well that I so want to use it again… :-)
Get pimp-my-debian here and have fun!

Another quick update to Ubiquity for Opera

I just pushed a small update to my Ubiquity for Opera user javascript. Two tiny changes:

  • Now the ESC key hides the Ubiquity window
  • Fixed the text selection and focus when you reopen the Ubiquity window and you had input a command before.

Thanks to Martin Šrank for contributing the ubiq_focus() fix.

To download the latest version, go to http://github.com/cosimo/ubiquity-opera/. There's also a minified version there.

MySQL alter table in a transaction

I had a quite painful experience yesterday, I would say…

We were about to run live database migrations, when I decided to try it first in a staging database server, to have a more precise idea of how long they would take. So I connected to the MySQL test system:

mysql> use myopera
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> BEGIN WORK;
Query OK, 0 rows affected (0.00 sec)

mysql> ALTER TABLE blah ADD COLUMN xxx ...
...

The ALTER operation completed in a minute or so. Then I wanted to revert it, so I typed:

mysql> ROLLBACK;

So the transaction was rolled back. Just to double check, I verified that the table was indeed left in its original state, and to my surprise, it wasn't. Yes, the table is a InnoDB table, and yes, I had put the ALTER statement in a transaction.

Even after a ROLLBACK the table is left altered… Isn't that surprising? For me, it was. Of course, it is a documented MySQL behavior, and it's filed as bug/feature request.

Handling a file-server workload with varnish – part 2

I wrote about tuning varnish for a file server. I'd like to continue, detailing what we had to change compared to the varnish defaults to achieve a good and stable performance.

These were the main topics I had mentioned:

  • threads related parameters
  • hash bucket size
  • Session related parameters
  • Grace config and health checking

Threads-related parameters

One thing is certain. You'd better bump up the minimum threads number. When varnish starts up, it creates "n" threads. If more threads are needed, it can create as much as "m" threads but no more.

"n" is given by thread_pool_min * thread_pools. The defaults are 2 thread pools, and thread_pool_min is 200, so varnish will create 400 threads when starting up. We found that we need at least 6,000 threads, sometimes peaking at 8,000. In this case, it's better to start up directly with 7-8,000 threads. We set:

  • thread_pools = 8 since we have a 8 cores machine
  • thread_pool_min = 800

With these settings, on start Varnish will create 6,400 threads and keep them running all the time.

We also set a related param, thread_pool_add_delay to 2 ms, instead of the default I believe 20 ms. This allows Varnish to create a lot of threads more quickly when it starts up, or when more threads are needed. Using the default 20 ms value could slow down the threads creation process, and prevent Varnish to serve requests quickly enough.

Hash bucket size

Don't know much about the hashing internals, but I know we have tens of millions of files, even more, so we have to make sure the hash tables used to store cached objects are big enough, to prevent too many hashing collisions.

This is controlled by the -h option to varnishd. The default bucket size is 50023. We set it to 500009 (-h classic,500009). In this way, even if we could keep 10 million files in memory, we would only have 20 entries in each bucket on average. That's not ideal, but it's better than the default.

We didn't experiment with the new hashing algorithms like critbit.

Session-related parameters

Not so much on this particular server, but in general, we had to bump up the sess_workspace parameter. The default is 16kbytes (16384). sess_workspace controls the amount of memory dedicated by varnish to each connection (session in varnish speak), that is used as a working memory for the HTTP header manipulations. We set it to 32k here. On other servers, where we use a more elaborate VCL config, we use 128k as the default value.

Grace and health checking

Varnish can check that your defined backends are "healthy". That means that they respond to queries in the defined time, and they don't miss heartbeats. You enable health checks just by defining a .probe block in your backend definition (search the Varnish wiki for details).

Having health checks is very convenient. You can instruct varnish to extend the grace period when/if your backend is dead. This means: if Varnish detects that your backends are dead or overloaded and they miss some heartbeats, it will keep serving stale objects from its cache, even if they expired (their TTL is already over). You enable this behaviour by saying:

sub vcl_recv {
   set req.backend = mybackend;

   # Default grace period is 10s
   set req.grace = 10s;

   # OMG. Backend dead. Keep serving stuff until we recover them.
   if (! req.backend.healthy) {
      set req.grace = 4h;
   }
    
   ...
}

sub vcl_fetch {

   # Renew cached objects every minute ...
   set obj.ttl = 60s;

   # ... but keep all objects way past their expire date
   # in case we need them because backends died
   set obj.grace = 4h;

   ...

}

That's it. We're continuing to refine our configs and best practices for Varnish servers. If you have feedback, leave a comment or drop me an email.

Primary to secondary master failover

Here's how to failover from primary to secondary master.
This was written following the My Opera case, and we use MySQL, but should be fairly generic.

Disable monitoring checks

  • Pause any Pingdom checks that are running for the affected systems
  • Set downtime or disable notifications in Nagios for the affected systems

Record log file and position

Assuming your secondary master (DB2) is idle, then now you have to record the log file and position by issuing a SHOW MASTER STATUS command:


mysql> SHOW MASTER STATUS G
*************************** 1. row ***************************
            File: mysql-bin.000024    <- MASTER LOG FILE
        Position: 91074774            <- MASTER LOG POSITION
    Binlog_Do_DB:
Binlog_Ignore_DB:

1 row in set (0.00 sec)

Write them down somewhere.
If you need to perform any kind of write/alter query on this host, then you have to issue the show master status command again, because position will change.

Also try repeating this command. You should see that the log file and position do not change between different runs.

Enable maintenance mode

Now is the time to enable your maintenance or downtime mode for the site or service. That will of course depend on your deployment tools.

Stop backend servers

Your backend/application servers might need to stay up and running. For example, in case of the Auth service, we want this, because we're going to serve static responses (html, xml, etc…) to the clients instead of just letting the connections hang.

In other cases, it's fine to just shut down the backends. You may want to do this for 2 reasons:

  • to make sure that nobody is accidentally hitting your master database, from your internal network or otherwise
  • because doing so should close all the connections to your master database. This is actually depending on the wait_timeout variable in the mysql server. The connections won't go away until wait_timeout seconds have passed. This is the normal behaviour, so don't panic if you still see connections after you shut down the backends.

Switch to the new master now

This depends on how you actually perform the switch. I can imagine at least 2 ways to do this:

  • by instructing LVS to direct all connections to the secondary master
  • take over the IP address either manually or using keepalived

On My Opera, we use keepalived with a private group between the two master database servers, so it's just a matter of:

  • stopping keepalived on the primary master database
  • starting keepalived on the secondary master database

There is a quick and dirty bash script that allows to verify who's the master and makes the switch.

#!/bin/sh

DB1=pri-master-hostname
DB2=sec-master-hostname

function toggle_keepalive() {
        host=$1
        if [[ `ssh $host pidof keepalived` == "" ]]; then
                ssh $host /etc/init.d/keepalived start
                if [[ `ssh $host pidof keepalived` == "" ]]; then
                        echo '*** KEEPALIVE START FAILED ***'
                        echo 'Aborting the master failover procedure'
                        exit
                fi
        else
                ssh $host /etc/init.d/keepalived stop
                if [[ `ssh $host pidof keepalived` != "" ]]; then
                        echo '*** KEEPALIVE STOP FAILED ***'
                        echo 'Aborting the master failover procedure'
                        exit
                fi
        fi
}

echo "Master Database failover"
echo

# Find out who's the primary master now, and swap them
if [[ `ssh $DB1 pidof keepalived` == "" ]]; then
        PRIMARY=$DB2
        SECONDARY=$DB1
else
        PRIMARY=$DB1
        SECONDARY=$DB2
fi

echo Primary is $PRIMARY
echo Secondary is $SECONDARY

# Shutdown primary first, then enable secondary
toggle_keepalive $PRIMARY
toggle_keepalive $SECONDARY

As soon as you do that, the secondary master will be promoted to primary master.
Since they are assumed to be already replicating from each other, nothing will change for them. It will however for all the slaves that were replicating from the primary master. We'll see what to do about that later.

Restart backend servers

Now it's the right time to restart the backend servers, and check that they correctly connect to the new primary master.

On My Opera, we're using a virtual address, w-mlb (write-mysql-load-balancer), to refer to the active primary master database. We use this name in the configuration files everywhere.

This means that we don't have to change anything in the backend servers configuration. We just restart them, and they will connect to the new primary master, due to the IP takeover step described above.

Turn off maintenance mode

If the backends are working correctly, they're connecting to the new master db, it's time to remove the maintenance page, so do that.

We're enabling and disabling maintenance mode by enabling and disabling a virtual host configuration in our frontends and reloading or restarting the frontend httpd servers.

From now on, your application is hopefully up and running and receiving client requests, so your downtime window is over.

Check replication lag

The database slaves at this point are still replicating from the former primary master database (DB1).

But DB1 now is not receiving any traffic (queries) anymore, so it's basically idle, and it should be. Any queries happening on DB1 now mean that something is seriously wrong. There might be lingering connections, but no activity.

Then it's important that all the slaves show no replication lag, so issuing a SHOW SLAVE STATUS command should show zero seconds behind.

mysql> SHOW SLAVE STATUS G
*************************** 1. row ***************************
             Slave_IO_State: Waiting for master to send event
                Master_Host: <DB1-ip>
                Master_Port: <port>
              Connect_Retry: 60
            Master_Log_File: mysql-bin.000025
        Read_Master_Log_Pos: 13691126
...
      Seconds_Behind_Master: 0

1 row in set (0.00 sec)

It's important that Seconds Behind Master is zero.
If it's not, it means that the slave needs more time to fully replicate all the past traffic that had been going on on the former primary master, DB1.

Remember that the primary master is now DB2, while DB1 is the secondary master.

Change master on the slaves

Now you can perform the CHANGE MASTER TO command on all the slaves.

Now you have to bring back the notes about MASTER LOG FILE and MASTER LOG POSITION.

First, stop the slave replication.

mysql> STOP SLAVE;

Then the exact command to issue, if nothing else about your replication changed, is:

mysql> CHANGE MASTER TO MASTER_HOST='<DB2-ip>', MASTER_LOG_FILE='<master_log_file>', MASTER_LOG_POSITION='<master_log_position>';

Then restart the slave replication:

mysql> START SLAVE;
mysql> SHOW SLAVE STATUS G

The following SHOW SLAVE STATUS G command should show the replication running, and, depending on how long it took you to change master since the new master took over the ip, the number of seconds of replication lag.

This number should rapidly go down towards zero.

If it's not, then you might have a problem. Go hide now or take the first flight to Australia or something.

We wrote a switch-master Perl script that proved to be very effective and useful. Example:

./switch-master --host <your_slave> --new-master <new_master_ip> --log-file <master_log_file> --log-pos <master_log_position>

This script performs a lot of sanity checks. Before switching master, it checks that replication lag is zero. If it's not, waits a bit and checks again, etc…

It's made to try to prevent disaster from striking. Very useful and quick to use.

Enable monitoring checks

Now verify that everything looks fine, replication lag is zero, your backends are working correctly, try to use your site a bit.

If everything's fine, enable or unpause the monitoring checks.
You have made it!

Handling a file-server workload with varnish

During last couple of months, we have been playing with varnish for files.myopera.com, the My Opera main files server.

I'm not sure this is a typical use case for varnish, maybe it is, but it has a few unique challenges which I'll try to explain here:

  • really high number of connections (in the 10k range)
  • large file set, ~100 millions
  • the longer the TTL, the better (10 days it's the default)
  • really simple or no VCL logic

In other Varnish installations we're maintaining here at Opera, the real challenge is to seamlessly interface with backend application servers, but in this case the "backend" is just another http file server with little more logic.

Searching around, and using the resources I mentioned some blog posts ago, we have found a few critical settings that need to be tuned to achieve consistent performance levels. Some are obvious, some others are not so obvious:

  • threads related parameters
  • hash bucket size
  • Session related parameters
  • Grace config and health checking

I'll explain all the settings we had to change and how they affected us in a later post.