Handling a file-server workload with varnish – part 2

I wrote about tuning varnish for a file server. I'd like to continue, detailing what we had to change compared to the varnish defaults to achieve a good and stable performance.

These were the main topics I had mentioned:

  • threads related parameters
  • hash bucket size
  • Session related parameters
  • Grace config and health checking

Threads-related parameters

One thing is certain. You'd better bump up the minimum threads number. When varnish starts up, it creates "n" threads. If more threads are needed, it can create as much as "m" threads but no more.

"n" is given by thread_pool_min * thread_pools. The defaults are 2 thread pools, and thread_pool_min is 200, so varnish will create 400 threads when starting up. We found that we need at least 6,000 threads, sometimes peaking at 8,000. In this case, it's better to start up directly with 7-8,000 threads. We set:

  • thread_pools = 8 since we have a 8 cores machine
  • thread_pool_min = 800

With these settings, on start Varnish will create 6,400 threads and keep them running all the time.

We also set a related param, thread_pool_add_delay to 2 ms, instead of the default I believe 20 ms. This allows Varnish to create a lot of threads more quickly when it starts up, or when more threads are needed. Using the default 20 ms value could slow down the threads creation process, and prevent Varnish to serve requests quickly enough.

Hash bucket size

Don't know much about the hashing internals, but I know we have tens of millions of files, even more, so we have to make sure the hash tables used to store cached objects are big enough, to prevent too many hashing collisions.

This is controlled by the -h option to varnishd. The default bucket size is 50023. We set it to 500009 (-h classic,500009). In this way, even if we could keep 10 million files in memory, we would only have 20 entries in each bucket on average. That's not ideal, but it's better than the default.

We didn't experiment with the new hashing algorithms like critbit.

Session-related parameters

Not so much on this particular server, but in general, we had to bump up the sess_workspace parameter. The default is 16kbytes (16384). sess_workspace controls the amount of memory dedicated by varnish to each connection (session in varnish speak), that is used as a working memory for the HTTP header manipulations. We set it to 32k here. On other servers, where we use a more elaborate VCL config, we use 128k as the default value.

Grace and health checking

Varnish can check that your defined backends are "healthy". That means that they respond to queries in the defined time, and they don't miss heartbeats. You enable health checks just by defining a .probe block in your backend definition (search the Varnish wiki for details).

Having health checks is very convenient. You can instruct varnish to extend the grace period when/if your backend is dead. This means: if Varnish detects that your backends are dead or overloaded and they miss some heartbeats, it will keep serving stale objects from its cache, even if they expired (their TTL is already over). You enable this behaviour by saying:

sub vcl_recv {
   set req.backend = mybackend;

   # Default grace period is 10s
   set req.grace = 10s;

   # OMG. Backend dead. Keep serving stuff until we recover them.
   if (! req.backend.healthy) {
      set req.grace = 4h;
   }
    
   ...
}

sub vcl_fetch {

   # Renew cached objects every minute ...
   set obj.ttl = 60s;

   # ... but keep all objects way past their expire date
   # in case we need them because backends died
   set obj.grace = 4h;

   ...

}

That's it. We're continuing to refine our configs and best practices for Varnish servers. If you have feedback, leave a comment or drop me an email.

Primary to secondary master failover

Here's how to failover from primary to secondary master.
This was written following the My Opera case, and we use MySQL, but should be fairly generic.

Disable monitoring checks

  • Pause any Pingdom checks that are running for the affected systems
  • Set downtime or disable notifications in Nagios for the affected systems

Record log file and position

Assuming your secondary master (DB2) is idle, then now you have to record the log file and position by issuing a SHOW MASTER STATUS command:


mysql> SHOW MASTER STATUS G
*************************** 1. row ***************************
            File: mysql-bin.000024    <- MASTER LOG FILE
        Position: 91074774            <- MASTER LOG POSITION
    Binlog_Do_DB:
Binlog_Ignore_DB:

1 row in set (0.00 sec)

Write them down somewhere.
If you need to perform any kind of write/alter query on this host, then you have to issue the show master status command again, because position will change.

Also try repeating this command. You should see that the log file and position do not change between different runs.

Enable maintenance mode

Now is the time to enable your maintenance or downtime mode for the site or service. That will of course depend on your deployment tools.

Stop backend servers

Your backend/application servers might need to stay up and running. For example, in case of the Auth service, we want this, because we're going to serve static responses (html, xml, etc…) to the clients instead of just letting the connections hang.

In other cases, it's fine to just shut down the backends. You may want to do this for 2 reasons:

  • to make sure that nobody is accidentally hitting your master database, from your internal network or otherwise
  • because doing so should close all the connections to your master database. This is actually depending on the wait_timeout variable in the mysql server. The connections won't go away until wait_timeout seconds have passed. This is the normal behaviour, so don't panic if you still see connections after you shut down the backends.

Switch to the new master now

This depends on how you actually perform the switch. I can imagine at least 2 ways to do this:

  • by instructing LVS to direct all connections to the secondary master
  • take over the IP address either manually or using keepalived

On My Opera, we use keepalived with a private group between the two master database servers, so it's just a matter of:

  • stopping keepalived on the primary master database
  • starting keepalived on the secondary master database

There is a quick and dirty bash script that allows to verify who's the master and makes the switch.

#!/bin/sh

DB1=pri-master-hostname
DB2=sec-master-hostname

function toggle_keepalive() {
        host=$1
        if [[ `ssh $host pidof keepalived` == "" ]]; then
                ssh $host /etc/init.d/keepalived start
                if [[ `ssh $host pidof keepalived` == "" ]]; then
                        echo '*** KEEPALIVE START FAILED ***'
                        echo 'Aborting the master failover procedure'
                        exit
                fi
        else
                ssh $host /etc/init.d/keepalived stop
                if [[ `ssh $host pidof keepalived` != "" ]]; then
                        echo '*** KEEPALIVE STOP FAILED ***'
                        echo 'Aborting the master failover procedure'
                        exit
                fi
        fi
}

echo "Master Database failover"
echo

# Find out who's the primary master now, and swap them
if [[ `ssh $DB1 pidof keepalived` == "" ]]; then
        PRIMARY=$DB2
        SECONDARY=$DB1
else
        PRIMARY=$DB1
        SECONDARY=$DB2
fi

echo Primary is $PRIMARY
echo Secondary is $SECONDARY

# Shutdown primary first, then enable secondary
toggle_keepalive $PRIMARY
toggle_keepalive $SECONDARY

As soon as you do that, the secondary master will be promoted to primary master.
Since they are assumed to be already replicating from each other, nothing will change for them. It will however for all the slaves that were replicating from the primary master. We'll see what to do about that later.

Restart backend servers

Now it's the right time to restart the backend servers, and check that they correctly connect to the new primary master.

On My Opera, we're using a virtual address, w-mlb (write-mysql-load-balancer), to refer to the active primary master database. We use this name in the configuration files everywhere.

This means that we don't have to change anything in the backend servers configuration. We just restart them, and they will connect to the new primary master, due to the IP takeover step described above.

Turn off maintenance mode

If the backends are working correctly, they're connecting to the new master db, it's time to remove the maintenance page, so do that.

We're enabling and disabling maintenance mode by enabling and disabling a virtual host configuration in our frontends and reloading or restarting the frontend httpd servers.

From now on, your application is hopefully up and running and receiving client requests, so your downtime window is over.

Check replication lag

The database slaves at this point are still replicating from the former primary master database (DB1).

But DB1 now is not receiving any traffic (queries) anymore, so it's basically idle, and it should be. Any queries happening on DB1 now mean that something is seriously wrong. There might be lingering connections, but no activity.

Then it's important that all the slaves show no replication lag, so issuing a SHOW SLAVE STATUS command should show zero seconds behind.

mysql> SHOW SLAVE STATUS G
*************************** 1. row ***************************
             Slave_IO_State: Waiting for master to send event
                Master_Host: <DB1-ip>
                Master_Port: <port>
              Connect_Retry: 60
            Master_Log_File: mysql-bin.000025
        Read_Master_Log_Pos: 13691126
...
      Seconds_Behind_Master: 0

1 row in set (0.00 sec)

It's important that Seconds Behind Master is zero.
If it's not, it means that the slave needs more time to fully replicate all the past traffic that had been going on on the former primary master, DB1.

Remember that the primary master is now DB2, while DB1 is the secondary master.

Change master on the slaves

Now you can perform the CHANGE MASTER TO command on all the slaves.

Now you have to bring back the notes about MASTER LOG FILE and MASTER LOG POSITION.

First, stop the slave replication.

mysql> STOP SLAVE;

Then the exact command to issue, if nothing else about your replication changed, is:

mysql> CHANGE MASTER TO MASTER_HOST='<DB2-ip>', MASTER_LOG_FILE='<master_log_file>', MASTER_LOG_POSITION='<master_log_position>';

Then restart the slave replication:

mysql> START SLAVE;
mysql> SHOW SLAVE STATUS G

The following SHOW SLAVE STATUS G command should show the replication running, and, depending on how long it took you to change master since the new master took over the ip, the number of seconds of replication lag.

This number should rapidly go down towards zero.

If it's not, then you might have a problem. Go hide now or take the first flight to Australia or something.

We wrote a switch-master Perl script that proved to be very effective and useful. Example:

./switch-master --host <your_slave> --new-master <new_master_ip> --log-file <master_log_file> --log-pos <master_log_position>

This script performs a lot of sanity checks. Before switching master, it checks that replication lag is zero. If it's not, waits a bit and checks again, etc…

It's made to try to prevent disaster from striking. Very useful and quick to use.

Enable monitoring checks

Now verify that everything looks fine, replication lag is zero, your backends are working correctly, try to use your site a bit.

If everything's fine, enable or unpause the monitoring checks.
You have made it!

Handling a file-server workload with varnish

During last couple of months, we have been playing with varnish for files.myopera.com, the My Opera main files server.

I'm not sure this is a typical use case for varnish, maybe it is, but it has a few unique challenges which I'll try to explain here:

  • really high number of connections (in the 10k range)
  • large file set, ~100 millions
  • the longer the TTL, the better (10 days it's the default)
  • really simple or no VCL logic

In other Varnish installations we're maintaining here at Opera, the real challenge is to seamlessly interface with backend application servers, but in this case the "backend" is just another http file server with little more logic.

Searching around, and using the resources I mentioned some blog posts ago, we have found a few critical settings that need to be tuned to achieve consistent performance levels. Some are obvious, some others are not so obvious:

  • threads related parameters
  • hash bucket size
  • Session related parameters
  • Grace config and health checking

I'll explain all the settings we had to change and how they affected us in a later post.

Looking at Cassandra

When I start to get interested in a project, I usually join the project users mailing list and read it for a couple of months, to get a general feeling of what's going on, what problems do people have, etc…

I became very interested in Cassandra, "A highly scalable, eventually consistent, distributed, structured key-value store".
So, almost 2 months ago, I joined the cassandra-users mailing list.

Turns out that Cassandra is in production at huge sites like Twitter and Digg. Which doesn't really tell you much if you don't know how they use it, and what they use it for. However, guess what? There's a lot of information out there.

Here's a few links I'd like to share about Cassandra:

The MySQL Sandbox

I learned about the MySQL Sandbox last year in Lisbon for the European Perl conference. I remember I talked about it with Giuseppe Maxia, the original author. I promised him to try it.

I wasn't really convinced about it until the first time I tried it.

That was when on My Opera we switched from master-slave to master-master replication. It was during a weekend last November. I was a bit scared of doing the switch. Then I remembered about MySQL Sandbox, and I tried it on an old laptop.

I was absolutely amazed to discover that I could simulate pretty much any setup, from simple to complex, master-slave, master-master, circular multi-master, etc…
It was also very quick to setup, and it's also very fast to install new sandboxed servers.
You can setup a 5 servers replication setup in less than 30 seconds.

MySQL Sandbox is a Perl tool that you can easily install from CPAN with:

$ sudo cpan MySQL-Sandbox

You get the make_sandbox command that allows you to create new sandboxes. Right now I'm trying it again for a maintainance operation we have to do on My Opera soon. I'm using the master-master setup like this:

make_replication_sandbox --master_master --server-version=/home/cosimo/mysql/mysql-5.0.51a-....tar.gz

so I can simulate the entire operation and try to minimize the risk of messing up the production boxes while also getting a bit more confident about these procedures.

MySQL Sandbox, try it. You won't go back :-)

A geolocating, distributed DNS service, geodns

It's been 2 months now that I formally changed my team from My Opera to Core Services. For most of my day, I still work on My Opera, but I get to work on other projects as well.

One of these projects regarded the browser-ballot screen, even if now it's being used for other purposes as well. It is a very interesting project, named Geodns. It is a DNS server.

Its purpose is not unique or new: create geographically-aware DNS zones. Example (just an example): my.geo.opera.com, a geographically-aware my.opera.com, that sends you to our US-based My Opera servers if your browser presents itself with a US or american ip address, norwegian servers if you are in Norway, etc… So, nothing new or particularly clever. Actually someone argues that DNS systems shouldn't be used in this way. But it's really convenient anyway…

So this DNS server is written in Perl, and it uses the omnipresent GeoIP library to find out every client IP address position on Earth, and then uses this information to send the client to the appropriate server based on some simple matching rules:

  • by specific country
  • by specific continent
  • if none match specifically, extract a random server from the pool of those available to serve requests that don't match any other rule

I also made geodns log to a special file that allows to use our own OpenGL engine to display realtime DNS hits on a photo-realistic 3D Earth.

In this picture you can see blue and red dots. The higher the spikes, the more requests there are. Blue is where our datacenters are, red is where clients are sending requests from.

I'm trying to get this published as open source, even if, as I said, it's not really complex or anything. Just a good out-of-the-box solution. It's been running in production for about 3 weeks now, and it's serving around 300 requests per second on average. Stable and fast, but now we're looking at increasing the traffic. My goal is to reach at least 2000 req/s on a single machine. We'll see… :)

Varnish “sess_workspace” and why it is important

When using Varnish on a high traffic site like opera.com or my.opera.com, it is important to reach a stable and sane configuration (both VCL and general service tuning).

If you're just starting using Varnish now, it's easy to overlook things (like I did, for example :) and later experience some crashes or unexpected problems.

Of course, you should read the Varnish wiki, but I'd suggest you also read at least the following links. I found them to be very useful for me:

A couple of weeks ago, we experienced some random Varnish crashes, 1 per day on average. That happened during a weekend. As usual, we didn't really notice that Varnish was crashing until we looked at our Munin graphs. Once you know that Varnish is crashing, everything is easier :)

Just look at your syslog file. We did, and we found the following error message:

Feb 26 06:58:26 p26-01 varnishd[19110]: Child (27707) died signal=6
Feb 26 06:58:26 p26-01 varnishd[19110]: Child (27707) Panic message: Missing errorhandling code in HSH_Prepare(), cache_hash.c line 188:#012  Condition((p) != 0) not true.  thread = (cache-worker)sp = 0x7f8007c7f008 {#012  fd = 239, id = 239, xid = 1109462166,#012  client = 213.236.208.102:39798,#012  step = STP_LOOKUP,#012  handling = hash,#012  ws = 0x7f8007c7f078 { overflow#012    id = "sess",#012    {s,f,r,e} = {0x7f8007c7f808,,+16369,(nil),+16384},#012  },#012    worker = 0x7f82c94e9be0 {#012    },#012    vcl = {#012      srcname = {#012        "input",#012        "Default",#012        "/etc/varnish/accept-language.vcl",#012      },#012    },#012},#012
Feb 26 06:58:26 p26-01 varnishd[19110]: Child cleanup complete
Feb 26 06:58:26 p26-01 varnishd[19110]: child (3710) Started
Feb 26 06:58:26 p26-01 varnishd[19110]: Child (3710) said Closed fds: 3 4 5 10 11 13 14
Feb 26 06:58:26 p26-01 varnishd[19110]: Child (3710) said Child starts
Feb 26 06:58:26 p26-01 varnishd[19110]: Child (3710) said Ready
Feb 26 18:13:37 p26-01 varnishd[19110]: Child (7327) died signal=6
Feb 26 18:13:37 p26-01 varnishd[19110]: Child (7327) Panic message: Missing errorhandling code in HSH_Prepare(), cache_hash.c line 188:#012  Condition((p) != 0) not true.  thread = (cache-worker)sp = 0x7f8008e84008 {#012  fd = 248, id = 248, xid = 447481155,#012  client = 213.236.208.101:39963,#012  step = STP_LOOKUP,#012  handling = hash,#012  ws = 0x7f8008e84078 { overflow#012    id = "sess",#012    {s,f,r,e} = {0x7f8008e84808,,+16378,(nil),+16384},#012  },#012    worker = 0x7f81a4f5fbe0 {#012    },#012    vcl = {#012      srcname = {#012        "input",#012        "Default",#012        "/etc/varnish/accept-language.vcl",#012      },#012    },#012},#012
Feb 26 18:13:37 p26-01 varnishd[19110]: Child cleanup complete
Feb 26 18:13:37 p26-01 varnishd[19110]: child (30662) Started
Feb 26 18:13:37 p26-01 varnishd[19110]: Child (30662) said Closed fds: 3 4 5 10 11 13 14
Feb 26 18:13:37 p26-01 varnishd[19110]: Child (30662) said Child starts
Feb 26 18:13:37 p26-01 varnishd[19110]: Child (30662) said Ready

A quick research brought me to sess_workspace.

We found out we had to increase the default (16kb), especially since we're doing quite a bit of HTTP header copying and rewriting around. In fact, if you do that, each varnish thread uses a memory space at most sess_workspace bytes.

If you happen to need more space, maybe because clients are sending long HTTP header values, or because you are (like we do) writing lots of additional varnish-specific headers, then Varnish won't be able to allocate enough memory, and will just write the assert condition on syslog and drop the request.

So, we bumped sess_workspace to 256kb by setting the following in the startup file:


-p sess_workspace=262144

And since then we haven't been having crashes anymore.

More varnish, now also on www.opera.com

I have been working on setting up and troubleshooting Varnish installations quite a bit lately. After deploying Varnish on My Opera for many different uses, namely APIs, avatars, user pictures and the frontpage, we also decided to try using Varnish on www.opera.com.

While "www" might seem much simpler than My Opera, it has its own challenges.
It doesn't have logged in users, or user-generated content, but, as with My Opera, a single URL is used to generate many (slightly) different versions of the content. Think about older versions of Opera, or maybe newest (betas, 10.50), mobile browsers, Opera Mini, site in "Mobile view", different languages, etc…

That makes caching with Varnish tricky, because you have to consider all of these variables, and instruct Varnish to cache each of these variations separately. No doubt opera.com in this respect is even more difficult than My Opera.

So, we decided to:

  • cache only the most trafficked pages (for now only the Opera startup page)
  • cache them only for Opera 10.x browsers
  • differentiate caching by specific version (the "x" in 10.x)

We basically used the same Varnish config as My Opera, with the accept-language hack, changing only the URL-specific logic. With this setup, we managed to cut down around 15% of backend requests on opera.com.

URL shortening in Ubiquity for Opera too…

Given the cool recent activity around url shortening in Opera, I thought I could also give my small contribution.

In fact, a url shortening command is missing in Ubiquity for Opera. Or rather, it was missing.

There's a new command, shorten-url, based on bit.ly's API, that allows you, as usual with Ubiquity commands, to shorten the current open tab URL, or shorten any URL you type in the Ubiquity window. Here you can see a screenshot as example:

This new command also uses the amazing ajax-enabling UserJS library by xErath. Another interesting news is that from now on, I'll use YUI Compressor to also ship a minified version of the ubiquity javascript code, that almost halves the size, so that's good, since we're already at ~80kb uncompressed.

As usual, you can download Ubiquity for Opera,
(or the minified version), or go to the Ubiquity for Opera github repository.

Enjoy :-)