Tag Archives: performance

Net::Statsd::Server, a Perl port of Flickr/Etsy’s statsd

If you’re looking for a Perl client to connect to a statsd daemon, checkout Net::Statsd on CPAN, now at version 0.08.

This post is about the server component of statsd.

Tracking metrics: up to now

The idea of statsd started in Flickr by Cal Henderson, and some code is still available, but it’s not very functional or complete.

Since reading about statsd, I found the concept brilliant. I have been using a similar technique long before hearing about statsd though. I learned it from colleagues here at Opera in 2008. They were using it to track application metrics for the Opera Link server. I thought it was great, so I also implemented it, extending it by making it very easy to add metrics and to see the output automatically in Munin. Here’s how it worked basically:

# ...
use Opera::Stats;
# ...
Opera::Stats::count("site.logins");
# ...

The project code would have typically tens or hundreds of these calls. Each call would store/increment a counter in a local or remote memcached. Then a complementary Opera::Stats::Munin module would automatically generate the output needed to implement a full Munin plugin given the metrics to be exposed.

So far, so good. Except there were a few things that didn’t work quite right:

  • Using TCP connections, maybe even to remote machines, even though it was never a problem, could be in case the memcached machines went down
  • Volume was a concern. I had to worry about tracking too many metrics. How would that affect functioning of memcached for regularly stored keys and values? Would those metrics-related keys cause evictions in the regular memcached content?
  • Even though the munin integration made it very easy to have charts, there were still some limitations: creating new charts requires some wrapper plugin with 1 or 2 lines of Perl code. Flexibility was also an issue.

Enter statsd

I have been thinking of replacing this system with statsd for a while. However, I wanted to have a more in-depth look at it before deploying it.

Turns out that statsd is a simple project, which I like, but requires nodejs. Knowing next to nothing about nodejs, I took some time to learn a few things.

I also realized I have been wanting to learn about AnyEvent for a long time.

Net::Statsd::Server

Two weeks ago, I spent a busy weekend reimplementing 95% of statsd in Perl. On Sunday night, I had a functional version of statsd written in Perl with AnyEvent.

AnyEvent stuff is surprising at times. I found especially interesting to debug the cases where your timer (AE::timer) doesn’t fire unless you actually save it to a scalar, as in:

# This won't fire!
AE::timer 10, 10, \&do_something;

# This will though.
# This behaviour is triggered by "defined wantarray"
my $t = AE::timer 10, 10, \&do_something;

Since that weekend, I have spent a few more nights tweaking Net::Statsd::Server. Yesterday I wrote a new piece of functionality (a new “File” backend) that is actually not in the original statsd.

It looks like I might need new backends as well, so I think it’s “an investment with a good ROI”, even though I did it mainly for fun and in my free time.

Performance

I wanted to make sure my statsd server implementation would be fast. I started by bringing up the nodejs statsd and firing my official benchmark script with 1 million iterations, and then comparing the results with my own statsd server.

That didn’t work out very well. Or rather, it worked out brilliantly, showing around 40K requests/s being handled by nodejs-statsd and 50K requests/s by Net::Statsd::Server. Problem is: how do you measure the performance of a UDP server? Or, for that matter, of a UDP client?

I figured out that, being UDP connection-less fire-and-forget, it doesn’t really matter how many packets/s the client fires, as long as you can generate more than your server can handle. Just as a data point, I reached around 73-75k statsd API calls per second (for the gauge API, around 55-58k for counters and timers). What really matters is how many packets reach the server.

BTW, I used another amazing piece of software called Devel::NYTProf to optimize the performance of the incoming packets code path as much as I could.

The test setup

To measure how many packets are received on the server-side, I prepared a test configuration:

{ graphitePort: 2003
, graphiteHost: "graphite.localdomain"
, host: "0.0.0.0"
, port: 8125
, backends: [ "./backends/graphite", "./backends/console" ]
, mgmt_address: "0.0.0.0"
, mgmt_port: 8126
}

The same configuration file for the Perl server becomes:

{ "graphitePort": 2003,
  "graphiteHost": "graphite.localdomain",
  "host" : "0.0.0.0",
  "port": 8125,
  "mgmt_address" : "0.0.0.0",
  "mgmt_port": 8126,
  "backends": [ "Graphite", "Console" ],
  "log" : {
    "backend" : "stdout",
    "level" : "LOG_WARN",
  }
}

Using the benchmark.pl code mentioned above, run with:

$ perl benchmark.pl 1000000

I started up first the nodejs statsd, then the Net::Statsd::Server daemon and captured their output. Both servers are configured to use their Graphite backend and flush to a valid and active graphite host. The Console backend is also active for both servers, so I could capture the output and look at the statsd.packets_received counter and directly measure how many packets are received in the server.

The benchmark utility with first argument = 1000000 generates 5 million statsd API calls, that is, 5 million UDP packets.

Of these 5 million packets, nodejs statsd was able to capture 2106768, 1596275, 1479145 and 1490640 packets over several runs.

Net::Statsd::Server, again in 3 different runs, was able to capture 2106242, 1884810, 1822042 and 1866500 packets.

I have performed more tests, and they had a very low deviation from the last runs (1.5M for etsy’s statsd and 1.8M for Net::Statsd::Server). Removing the 2 peak results of ~2.1Mb, it would seem that the Perl statsd is capable of receiving 22% more packets than the original statsd daemon written in javascript.

Of course, this is just my test. I have tried to run the test on different hardware, but I haven’t got significantly different results. If you try yourself, please let me know what numbers you get. I’d be curious to know :-)

SO_RCVBUF

Given the massive amount of UDP packets that were lost in the tests (50%+ in the best runs), I tried to figure out a way to improve this and I stumbled on SO_RCVBUF.

My understanding was that bumping up SO_RCVBUF on the listening UDP socket would dramatically decrease packet loss. However, I hadn’t been able to prove the theory because I hadn’t seen an improvement in the total number of packets received. At least until I read this article on UDP packet loss on stackoverflow.com, that pointed me to the net.core.rmem_max sysctl.

After modifying net.core.rmem_max, setting it to 100M, just to avoid its effect, and using the following code in Net::Statsd::Server:

# Bump up SO_RCVBUF on UDP socket, to buffer up incoming
# UDP packets, to avoid massive packet loss when load is very high.
setsockopt($self->{server}->fh, SOL_SOCKET, SO_RCVBUF, 1*1024*1024)
or die "Couldn't set SO_RCVBUF: $!";

I can see some very interesting effect.

Re-running the node.js statsd, I could see an increased amount of captured packets (1691700, 1675902, ~10% increase).
Running again the Net::Statsd::Server daemon, I recorded 2678507 and 2477246 packets, for an impressive ~40% increase!

As a last effort, I tried varying the SO_RCVBUF size from 1 to 64Mb to see what effect it had on the amount of captured packets (or UDP packet loss if you prefer).

I haven’t run any scientific set of tests, but I can’t see any statistically significant increase for values greater than 4-8Mb, so I haven’t decided where to set the default in Net::Statsd::Server yet. Any chosen value is likely to need specific sysctl tuning anyway, so YMMV.

Why?

Did I really do it for fun? Yes, mainly, but also because:

  • I don’t like adding node.js to our production stack just to run statsd. I have never operated a node.js server, so I don’t want to take this “risk”. The product we’re building is going live soon! :-) And note that this does apply to anything, it’s not about node.js per se :-)
  • to learn how statsd was put together
  • to learn AnyEvent
  • to learn how to build a high performance UDP server
  • Basically, to learn :-)

Code is up on CPAN, as usual: https://metacpan.org/module/Net::Statsd::Server.

If you happen to use it, please give me some feedback!

Problems with bnx2 kernel module and high traffic

We're seeing an "elevated" level of traffic these days on the My Opera servers. As usual with operations matters, it's difficult to find one exact clear root cause. The rest of the post explains what we found and the fix for it.

TL;DR

You want to try options bnx2 disable_msi=1 in your /etc/modprobe.d/bnx2.conf if:

  • using squeeze and bnx2 version is 2.0.2
  • you see high traffic (10K+ connections)
  • you see errors on public network interface
  • server is dropping packets/connections randomly or it's really slow

The gory details

During last Tuesday the DDoS attack (that is still continuing now) on the My Opera servers ramped up from ~4k req/s/frontend to ~16k+ req/s/frontend. Both frontends were dist-upgraded (including a kernel upgrade) on May 23rd, but not rebooted, so the kernel update was armed but not actually live.

We started seeing these bad problems of dropped connections and general slowness after the frontend servers were rebooted. The reason why there were rebooted is because we have been hitting another really weird problem, the 210 days uptime timer bug. See this and this bug reports for more details.

Anyway, I'm not sure how to verify this, because I didn't restart the boxes myself, but my theory is after they were rebooted, the new bnx2 kernel module version 2.0.2 was loaded.

Then later on we found out about this very specific bnx2 v2.0.2 bug that only triggers in high traffic situations, at least on Debian Squeeze and Ubuntu, that causes network interfaces to stop working correctly, dropping traffic.

Long story short, there's a magic option that prevents this from happening. rmmod'ing and modprobing back the bnx2 module with this option fixed the problem so far.

# /etc/modprobe.d/bnx2.conf
options bnx2 disable_msi=1

Regarding what the option is about, I'm not even going to lie about it. I have no idea… We found it with this search:

https://encrypted.google.com/search?client=opera&rls=en&q=bnx2+debian+2.0.2+traffic&sourceid=opera&ie=utf-8&oe=utf-8&channel=suggest

First hit is our own Sven from sysadmin team:

http://lists.us.dell.com/pipermail/linux-poweredge/2011-October/045485.html

Second hit is the solution we used:

http://ubuntuforums.org/archive/index.php/t-1726045.html

We also did some tweaking for the large amount of TIME_WAIT connections that were resulting from this bnx2 bug, namely bumped up net.sys.ipv4.tcp_max_tw_buckets quite a bit.

Take aways

  1. Before rebooting a machine, check what's going to happen, when was last upgrade etc…, f.ex. /var/log/dpkg.log.
  2. In case you have firewall rules, iptables-save > /root/iptables-rules.YYYYMMDD and later restore if needed with iptables-restore < iptables-rules.YYYYMMDD
  3. Always check if the conntrack module is enabled. Most times you don't need it, and it will cause performance to drop under very high traffic (of course).

In this case what happened is that the conntrack module was accidentally also re-enabled by the reboot. We had previously disabled it, but didn't make the change permanent. This is because on My Opera we're still not using our config management infrastructure… Looking forward to make that happen. Soon. Hopefully :)

My experience at Velocity Europe 2011 in Berlin

TL;DR

This year there was the 1st edition of Velocity Europe. I got to present a talk on a DDoS attack we faced at Opera, and it was really awesome to be there.

The long version…

Around July this year I knew there was going to be a Velocity Conference in Europe, and I decided I would try to propose a few ideas for talks. I didn't have my hopes too high, but I wanted to give it a shot anyways, pushing myself way out of my comfort zone :)

The worst that could happen was that the talks didn't get accepted. After a month or so the crazy thing happened, and I got an invite to speak at Velocity, due in November.

Preparation

The first few months passed while I was slowly gathering material for the talk. The idea was talking about the DDoS attack that struck us in October 2010. Almost a year had passed, so if we hadn't taken notes and collected all sorts of logs and information, we wouldn't have had any chance to reconstruct all the "story" with enough detail to be interesting.

Anyway, weeks went by, and in September I started writing down an outline of the talk. It consisted in describing what happened during the DDoS and how we faced it, what we did, how we figured out what to do, etc… but I didn't have a clear idea of what to convey with the actual presentation. What would be the core message, if any?

If I learned anything out of all this, is that writing an outline is absolutely the best favor you can do to yourself to avoid so many problems later on. Just write it down as a text, a blog post, a story. Mind maps are also useful for me.

Last 3-4 weeks flew away while I was trying to put together a decent deck of slides.

In Berlin: pre-conference

"Birds of feathers" was the pre-conference event that took place on Monday 7th (November 2011, if you're reading this in the future), put together by a local team led by Schlomo Schapiro, which I got to talk to also during the conference. It was a good event, Steve Souders and John Allspaw and many other conference attendees were already there. There were sponsor companies presenting their products.

The most interesting sessions of the pre-conference IMO were:

  • 100ms: Steve Souders pushed everyone to think about the next level of web performance. How to bring down the "loading time" of web pages to 100ms. There was an interesting discussion about that. My point was that loading time really needs to be divided into at least dns resolution, server processing, network transfers, client rendering. So there's at least 4 totally different chunks that make up the load time and all of them can be optimized, but with varying levels of gain and complexity.
  • Dyn inc presentation about their product dashboard, that led to a better productivity and communication between teams. Cory van Wollerstein explained their mash-up of Jira and Confluence, used to automatically pull information from the tickets db and provide high-level overviews to executive teams. Very cool. He also argued whether having product managers is a good thing for a company.

The rest of the day I was busy polishing my presentation, and trying to rehearse at the hotel. A month before the conference, I had bought The Naked Presenter (ebook edition), hoping that it would help me do a decent presentation. The book of course recommended to rehearse. It felt very weird and embarassing, but I'm *so* glad I did it. I managed to streamline the presentation, and memorize the sequence of slides.

The Conference – Day 1

Schedule:

http://velocityconf.com/velocityeu/public/schedule/grid/2011-11-08

Keynotes

Opening remarks, plus Theo Schlossnagle, one of the minds behind SurgeCon, on how good operations dudes are usually generalists and need to have a wide spectrum knowledge instead of being "(Perl|Python|Ruby|Java) developers". I really recognize myself in this more generalist role than, for example, the Ruby-on-Rails guy.

Lightning demos

These were lightning demos during the first morning:

Rest of Day 1 went to hell

I had to convert all my slides to 4:3 and test again with the on-site equipment. I was also freaking out at the same time, so I missed everything else until my talk. Sorry :)

Most talks have been recorded and are already up on the Velocity site. Particularly interesting IMO, but video not available yet, are:

My talk

As I said, it was about the DDoS attack to my.opera.com of October 2010. I basically talked about how we found out we were under DDoS, and how we struggled to find our way to keep the site up and running despite the traffic. This was a mid-scale DDoS with around 18k distinct IPs attacking us. We had a hard time, but it was also very much fun in retrospect :) We learned quite a lot in the process, about HTTP and TCP/IP, nkiller2 and the TCP zero-window exploit. Most importantly, we learned to make better use of old and new tools to do troubleshooting. You will find all of this in the slides.

I did my best, and I think it was well received by the audience. While on stage, I really had the feeling that people enjoyed it, plus several folks came to say hello afterwards. One of the most frequent comments I heard was that people found my talk honest. That is the single thing I appreciate the most, because that had been one of my goals since the start. To tell an honest and detailed story of how things went, without pretending to be the super awesome heroes that know everything and can fix anything in no time.

Unfortunately, after the conference I was informed that there had been no recording of the talk. That is really sad. However, since there's no recording, I can pretend I was a nice speaker, given the ratings :). Seriously, if you have a picture or video recording, contact me :)

Here's the slides if you're interested:

http://velocityconf.com/velocityeu/public/schedule/detail/21653

The Conference – Day 2

Schedule:

http://velocityconf.com/velocityeu/public/schedule/grid/2011-11-09

Keynotes

Very inspirational talk by Jeff Veen, Typekit.com

Very well presented, great visuals. Great overall. How to create conditions for teams to work and work well.

http://velocityconf.com/velocityeu/public/schedule/detail/21788

Anticipation: What could possibly go wrong? by John Allspaw, Etsy

A great talk about how to prevent, analyze, respond to Operations problems. I very much like John's style, I think he's a pioneer, at least he introduced me to many great ideas, one above all, continuous deployment. I also like his many references to aviation, aerospace and military engineering fields.

http://velocityconf.com/velocityeu/public/schedule/detail/22258

Full stack awareness, Artur Bergman, Fastly

He's Artur Bergman. Listen to him :-) If anything, because he's really authentic.

http://velocityconf.com/velocityeu/public/schedule/detail/22914

Lightning demos

Another session of lightning demos, for our pleasure:

Browser performance track

This was a track in itself. I lost all of it, since I mainly followed the Operations track, but this was really interesting I heard. Recent speed enhancements in Opera, Chrome, Firefox and Javascript in general were explained in detail.

Afternoon talks

Deploying large payloads at scale, Ramon van Alteren (hyves.nl)

Biggest social network in the Netherlands (4M daily active users, ~10M total users). Ramon is a very cool guy. They have 3.5K servers, and their main application consists of 750Mb compiled php binaries to deploy. And they are experimenting with bittorrent tools to do that :)

I had a few hours of engaging talk with Ramon at one of the social events that followed the conference. We found lots of similarities in how we're dealing with infrastructural growth, scaling, etc… We both use config management tools like puppet extensively in our organizations. We promised each other to remain in touch about deployment matters.

http://velocityconf.com/velocityeu/public/schedule/detail/21571

HTTP connection management from 10 users to 100 millions, Bradley Heilbrun, YouTube

Really interesting dive into YouTube early (2005-2007) architecture with Apache, load balancers, GSLB.

I met Bradley later on that day and we had a quick chat. Turns out they use(d) PowerDNS with its pipe backend for geographic load balancing, much like as we do in Opera with GeoDNS. That made my day :-) It's a pity that companies like YouTube don't talk much about their current technology. They usually tell you about 2-3 years old architectures. That's still very valuable, of course.

http://velocityconf.com/velocityeu/public/schedule/detail/21708

Conclusions

If you're even remotely interested in operations, devops, running a service, scaling, performance, infrastructure, then Velocity is the conference. Surge is another one, probably even better, more hardcore-engineering focused. From my perspective, there's a couple of things that could be improved:

  • while I understand that sponsors are what makes conferences like Velocity possible, some sponsors took too much time out of the actual talk tracks. One or two talks were very promotional in nature, and it was clear to everyone that these companies were pushing their products or themselves. Maybe it wasn't their intention, but to me and to others I talked to, it came out that way.
    I think Velocity needs to screen better this type of talks and separate them from the authentic content that people want, the "stories from the trenches". As a counter-example, Google, among other companies, were doing sponsoring (and recruiting!) activities in a separate hall. That worked very well for everyone. Please let's keep it that way.
  • the on-site technical team wasn't fully prepared to handle presentations made with Open Office. That is not acceptable if you ask me, even if the majority of speakers have a Mac. It's 2011 (2012 now even), so you really need to be prepared to read OpenOffice files. I realize that wasn't Velocity organizers' fault, but I think it's something to consider for next time.

That said, I'm really really happy about my experience at Velocity Europe, both as a speaker and as attendee. It was really awesome, and worth every moment I spent working to prepare for it. Thank you O'Reilly, and I hope to be able to participate again some day :)

How to detect TCP retransmit timeouts in your network

Some months ago, while investigating on a problem in our infrastructure, I put together a small tool to help detecting TCP retransmits happening during HTTP requests.

TCP retransmissions can happen, for example, when a client sends a SYN packet to the server, the server responds with a SYN-ACK but, for any reason, the client never receives the SYN-ACK. In this case, the client correctly waits for a given time, called the TCP Retransmission Timeout. In the usual case, this time is set to 3 seconds.

There's probably a million reasons why the client may never receive a SYN-ACK. The one I've seen more often is packet loss, which in turn can have lots of reasons, for example a malfunctioning or misconfigured network switch.

However, you can immediately spot if your timeout/hang problems are caused by TCP retransmission because they happen to cause response times that are unusually frequently distributed around 3, 9 and 21 seconds (and on, of course).

In fact, the TCP retransmission timeout starts at 3 seconds, but if the client tries to resend after a timeout and still receives no answer, it doubles the wait to 6 s, so the total response time will be 9 seconds, assuming that the client now finally receives the SYN-ACK. Otherwise, 3 + 6 + 12 = 21, then 3 + 6 + 12 + 24 = 45 s and so on and so forth.

So, this little tool fires a quick batch of HTTP requests to a given server and measures the response times, highlighting slow responses (> 0.5s). If you see that the reported response times are 3.002s, 9.005s or similar, then you are probably in presence of TCP retransmission and/or packet loss.

Finally, here it is:


#!/usr/bin/env perl
#
# https://gist.github.com/1101500
#
# Fires HTTP request batches at the specified hostname
# and analyzes the response times.
#
# If you have suspicious frequency of 3.00x, 9.00x, 21.00x
# seconds, then most probably you have a problem of packet loss
# in your network.
#
# cosimo@opera.com, sometime in 2011
#

use strict;
use LWP::UserAgent ();
use Time::HiRes ();

$| = 1;

my $ua = LWP::UserAgent->new();
$ua->agent("$0/0.01");

# Tests this hostname
my $server = $ARGV[0] || die "Usage: $0 <hostname>n";

# Picks the URLs to test in this list, one after the other
my @url_pool = qw(
	/ping.html
);

my $total_reqs = 0;
my $total_elapsed = 0.0;
my $n_pick = 0;
my $url_to_fire;

my $max_elapsed = 0.0;
my $max_elapsed_when = '';
my $failed_reqs = 0;
my $slow_responses = 0;
my $terminate_now = 0;

sub output_report {
	print "Report for:            $server at " . localtime() . "n";
	printf "Total requests:        %d in %.3f sn", $total_reqs, $total_elapsed;
	print "Failed requests:       $failed_reqsn";
	print "Slow responses (>1s):  $slow_responses (slowest $max_elapsed s at $max_elapsed_when)n";
	printf "Average response time: %.3f s (%.3f req/s)n", $total_elapsed / $total_reqs, $total_reqs / $total_elapsed;
	print "--------------------------------------------------------------------n";
	sleep 1;
	return;
}

$SIG{INT} = sub { $terminate_now = 1 };

while (not $terminate_now) {

	$url_to_fire = "http://" . $server . $url_pool[$n_pick];

	my $t0 = [ Time::HiRes::gettimeofday() ];
	my $resp = $ua->get($url_to_fire);
	my $elapsed = Time::HiRes::tv_interval($t0);

	$failed_reqs++ if ! $resp->is_success;

	$total_reqs++;
	$total_elapsed += $elapsed;

	if ($elapsed > $max_elapsed) {
		$max_elapsed = $elapsed;
		$max_elapsed_when = scalar localtime;
		printf "[SLOW] %s, %s served in %.3f sn", $max_elapsed_when, $url_to_fire, $max_elapsed;
	}

	$slow_responses++ if $elapsed >= 0.5;
	$n_pick = 0       if ++$n_pick > $#url_pool;
	output_report()   if $total_reqs > 0 and ($total_reqs % 1000 == 0);

}
continue {
    Time::HiRes::usleep(100000);
}

output_report();

# End

It's also published here on Github, https://gist.github.com/1101500. Have fun!

Handling a file-server workload with varnish – part 2

I wrote about tuning varnish for a file server. I'd like to continue, detailing what we had to change compared to the varnish defaults to achieve a good and stable performance.

These were the main topics I had mentioned:

  • threads related parameters
  • hash bucket size
  • Session related parameters
  • Grace config and health checking

Threads-related parameters

One thing is certain. You'd better bump up the minimum threads number. When varnish starts up, it creates "n" threads. If more threads are needed, it can create as much as "m" threads but no more.

"n" is given by thread_pool_min * thread_pools. The defaults are 2 thread pools, and thread_pool_min is 200, so varnish will create 400 threads when starting up. We found that we need at least 6,000 threads, sometimes peaking at 8,000. In this case, it's better to start up directly with 7-8,000 threads. We set:

  • thread_pools = 8 since we have a 8 cores machine
  • thread_pool_min = 800

With these settings, on start Varnish will create 6,400 threads and keep them running all the time.

We also set a related param, thread_pool_add_delay to 2 ms, instead of the default I believe 20 ms. This allows Varnish to create a lot of threads more quickly when it starts up, or when more threads are needed. Using the default 20 ms value could slow down the threads creation process, and prevent Varnish to serve requests quickly enough.

Hash bucket size

Don't know much about the hashing internals, but I know we have tens of millions of files, even more, so we have to make sure the hash tables used to store cached objects are big enough, to prevent too many hashing collisions.

This is controlled by the -h option to varnishd. The default bucket size is 50023. We set it to 500009 (-h classic,500009). In this way, even if we could keep 10 million files in memory, we would only have 20 entries in each bucket on average. That's not ideal, but it's better than the default.

We didn't experiment with the new hashing algorithms like critbit.

Session-related parameters

Not so much on this particular server, but in general, we had to bump up the sess_workspace parameter. The default is 16kbytes (16384). sess_workspace controls the amount of memory dedicated by varnish to each connection (session in varnish speak), that is used as a working memory for the HTTP header manipulations. We set it to 32k here. On other servers, where we use a more elaborate VCL config, we use 128k as the default value.

Grace and health checking

Varnish can check that your defined backends are "healthy". That means that they respond to queries in the defined time, and they don't miss heartbeats. You enable health checks just by defining a .probe block in your backend definition (search the Varnish wiki for details).

Having health checks is very convenient. You can instruct varnish to extend the grace period when/if your backend is dead. This means: if Varnish detects that your backends are dead or overloaded and they miss some heartbeats, it will keep serving stale objects from its cache, even if they expired (their TTL is already over). You enable this behaviour by saying:

sub vcl_recv {
   set req.backend = mybackend;

   # Default grace period is 10s
   set req.grace = 10s;

   # OMG. Backend dead. Keep serving stuff until we recover them.
   if (! req.backend.healthy) {
      set req.grace = 4h;
   }
    
   ...
}

sub vcl_fetch {

   # Renew cached objects every minute ...
   set obj.ttl = 60s;

   # ... but keep all objects way past their expire date
   # in case we need them because backends died
   set obj.grace = 4h;

   ...

}

That's it. We're continuing to refine our configs and best practices for Varnish servers. If you have feedback, leave a comment or drop me an email.

Handling a file-server workload with varnish

During last couple of months, we have been playing with varnish for files.myopera.com, the My Opera main files server.

I'm not sure this is a typical use case for varnish, maybe it is, but it has a few unique challenges which I'll try to explain here:

  • really high number of connections (in the 10k range)
  • large file set, ~100 millions
  • the longer the TTL, the better (10 days it's the default)
  • really simple or no VCL logic

In other Varnish installations we're maintaining here at Opera, the real challenge is to seamlessly interface with backend application servers, but in this case the "backend" is just another http file server with little more logic.

Searching around, and using the resources I mentioned some blog posts ago, we have found a few critical settings that need to be tuned to achieve consistent performance levels. Some are obvious, some others are not so obvious:

  • threads related parameters
  • hash bucket size
  • Session related parameters
  • Grace config and health checking

I'll explain all the settings we had to change and how they affected us in a later post.

Varnish “sess_workspace” and why it is important

When using Varnish on a high traffic site like opera.com or my.opera.com, it is important to reach a stable and sane configuration (both VCL and general service tuning).

If you're just starting using Varnish now, it's easy to overlook things (like I did, for example :) and later experience some crashes or unexpected problems.

Of course, you should read the Varnish wiki, but I'd suggest you also read at least the following links. I found them to be very useful for me:

A couple of weeks ago, we experienced some random Varnish crashes, 1 per day on average. That happened during a weekend. As usual, we didn't really notice that Varnish was crashing until we looked at our Munin graphs. Once you know that Varnish is crashing, everything is easier :)

Just look at your syslog file. We did, and we found the following error message:

Feb 26 06:58:26 p26-01 varnishd[19110]: Child (27707) died signal=6
Feb 26 06:58:26 p26-01 varnishd[19110]: Child (27707) Panic message: Missing errorhandling code in HSH_Prepare(), cache_hash.c line 188:#012  Condition((p) != 0) not true.  thread = (cache-worker)sp = 0x7f8007c7f008 {#012  fd = 239, id = 239, xid = 1109462166,#012  client = 213.236.208.102:39798,#012  step = STP_LOOKUP,#012  handling = hash,#012  ws = 0x7f8007c7f078 { overflow#012    id = "sess",#012    {s,f,r,e} = {0x7f8007c7f808,,+16369,(nil),+16384},#012  },#012    worker = 0x7f82c94e9be0 {#012    },#012    vcl = {#012      srcname = {#012        "input",#012        "Default",#012        "/etc/varnish/accept-language.vcl",#012      },#012    },#012},#012
Feb 26 06:58:26 p26-01 varnishd[19110]: Child cleanup complete
Feb 26 06:58:26 p26-01 varnishd[19110]: child (3710) Started
Feb 26 06:58:26 p26-01 varnishd[19110]: Child (3710) said Closed fds: 3 4 5 10 11 13 14
Feb 26 06:58:26 p26-01 varnishd[19110]: Child (3710) said Child starts
Feb 26 06:58:26 p26-01 varnishd[19110]: Child (3710) said Ready
Feb 26 18:13:37 p26-01 varnishd[19110]: Child (7327) died signal=6
Feb 26 18:13:37 p26-01 varnishd[19110]: Child (7327) Panic message: Missing errorhandling code in HSH_Prepare(), cache_hash.c line 188:#012  Condition((p) != 0) not true.  thread = (cache-worker)sp = 0x7f8008e84008 {#012  fd = 248, id = 248, xid = 447481155,#012  client = 213.236.208.101:39963,#012  step = STP_LOOKUP,#012  handling = hash,#012  ws = 0x7f8008e84078 { overflow#012    id = "sess",#012    {s,f,r,e} = {0x7f8008e84808,,+16378,(nil),+16384},#012  },#012    worker = 0x7f81a4f5fbe0 {#012    },#012    vcl = {#012      srcname = {#012        "input",#012        "Default",#012        "/etc/varnish/accept-language.vcl",#012      },#012    },#012},#012
Feb 26 18:13:37 p26-01 varnishd[19110]: Child cleanup complete
Feb 26 18:13:37 p26-01 varnishd[19110]: child (30662) Started
Feb 26 18:13:37 p26-01 varnishd[19110]: Child (30662) said Closed fds: 3 4 5 10 11 13 14
Feb 26 18:13:37 p26-01 varnishd[19110]: Child (30662) said Child starts
Feb 26 18:13:37 p26-01 varnishd[19110]: Child (30662) said Ready

A quick research brought me to sess_workspace.

We found out we had to increase the default (16kb), especially since we're doing quite a bit of HTTP header copying and rewriting around. In fact, if you do that, each varnish thread uses a memory space at most sess_workspace bytes.

If you happen to need more space, maybe because clients are sending long HTTP header values, or because you are (like we do) writing lots of additional varnish-specific headers, then Varnish won't be able to allocate enough memory, and will just write the assert condition on syslog and drop the request.

So, we bumped sess_workspace to 256kb by setting the following in the startup file:


-p sess_workspace=262144

And since then we haven't been having crashes anymore.

More varnish, now also on www.opera.com

I have been working on setting up and troubleshooting Varnish installations quite a bit lately. After deploying Varnish on My Opera for many different uses, namely APIs, avatars, user pictures and the frontpage, we also decided to try using Varnish on www.opera.com.

While "www" might seem much simpler than My Opera, it has its own challenges.
It doesn't have logged in users, or user-generated content, but, as with My Opera, a single URL is used to generate many (slightly) different versions of the content. Think about older versions of Opera, or maybe newest (betas, 10.50), mobile browsers, Opera Mini, site in "Mobile view", different languages, etc…

That makes caching with Varnish tricky, because you have to consider all of these variables, and instruct Varnish to cache each of these variations separately. No doubt opera.com in this respect is even more difficult than My Opera.

So, we decided to:

  • cache only the most trafficked pages (for now only the Opera startup page)
  • cache them only for Opera 10.x browsers
  • differentiate caching by specific version (the "x" in 10.x)

We basically used the same Varnish config as My Opera, with the accept-language hack, changing only the URL-specific logic. With this setup, we managed to cut down around 15% of backend requests on opera.com.

My Opera front page caching and Varnish hacking

The My Opera front page

According to our internal statistics, the front page of My Opera makes up for a consistent part of the entire traffic we get on our servers. So it's normal we have been working to optimize it for a very long time.

When we knew that Opera Mini 5.0 would be released with our front page as one of the preloaded speed dials, then we started to study the situation in more depth and plan what to do (and quickly!).

Mini 5 is already out, and used by lots of people, and during the last months, we have been getting more and more front page views than ever. What I'm going to tell you is the last (final?) step of the front page performance optimizations we worked on. If it works well, we could be able to apply it to other heavy parts of the site.

Enter Varnish…

Varnish is a reverse proxy cache software.
If you know Varnish already, I suggest you take a look at this great presentation from OSCON 2009.

During October 2009, we deployed our first Varnish server for My Opera, for some very specific and mostly static content. At that time, for me it was very experimental. I hardly knew anything about Varnish :) and in fact, we had some problems here and there. Then we gradually acquired some experience, and so we thought of using Varnish also, and for the first time, for a dynamic request.

Front page caching

Caching a full HTML page presents more challenges than caching a picture. For pictures, you can ignore the User-Agent and the cookies. At least in our case. You can ignore user language preferences. You can also ignore the Accept-Language HTTP header. For My Opera, we also have the Mobile view feature.

All of that means that if you're going to cache, say, the front page of My Opera, you can have:

  • 4 main types of browsers: Opera Mini, Opera Mobile, IE and the standards compliant;
  • 18 different languages, the ones in the language selector at the bottom, from Bulgarian to LOLCAT and Simplified Chinese, selected by either the sticky "language" cookie or by the Accept-Language header.
  • 2 views, mobile and full/desktop view

That makes a grand total of nearly 100 different versions of one single page.
Of course all of this is just for the logged out users. We don't want (and couldn't either) cache each single logged in user version of the frontpage (with the activity feed and all the rest).

Reducing the variations

For the caching to work properly, and be effective, we needed to find a way to reduce the possible number of versions of the front page. So in the Varnish VCL file, we match the User-Agent string, to reduce it to any of 4 predefined strings like "operamini", "operamobile", "msie", or "nomatch". So instead of having &inf; user agent strings, we get only 4.

Then another similar problem is the Accept-Language header. This header can be quite complex, depending on your browser settings, and there's no easy method to "figure-out" what language you want. From a string such as:

de-DE,tr;q=0.999,en;q=0.75,fr;q=0.9,it;q=0.8,ja;q=0.2,ru;q=0.1

you have to build a list of prioritized language preferences and match them against the languages your site can offer.

Failing to do that means, by default, having a different version of the frontpage for each different Accept-Language header, which is very variable across clients, even if there are very common values. A brief statistics gathering session showed 500 distinct values in about 10,000 browser requests.

accept-language.vcl

Varnish allows you to embed C code inside a VCL file. This is a pretty advanced feature that is not very much talked about. Given that using regexp to massage Accept-Language appeared to be messy, we discussed another crazy idea. Writing a C function to parse Accept-Language, and then embed that function into the Varnish VCL config.

Let's say that your site has English and Japanese. Your user browsers will send every possible Accept-Language header on Earth. If you enable Vary: Accept-Language on Varnish or on your backends (and you should)
the cache hit ratio will rapidly drop, because of the huge variations in Accept-Language contents. Varnish will store one version of the page for every different accept language string. That's bad.

With this hack, the Accept-Language header will be "rewritten" to just "en" or "ja", depending on your client settings. If no match occurs, a default language will be set ("en"). This brings the language variants down to exactly 2, the number of languages your site supports. In our case it's 18 versions, so down from ~500 to 18.

It seems a bit weird that we're the only ones having this problem :)

Most probably we're trying to solve this problem directly in Varnish, while usually this is dealt with at the backend level. Solving this inside Varnish is very nice, because it allows to scale more easily to other pages as well, with no modifications to the backends config or code.

If you think this might be useful for you too, you're welcome to get the code and try it out. It's on Github:

http://github.com/cosimo/varnish-accept-language/

Pay attention! It's experimental stuff, don't try it in production without extensive testing. And let me know how it goes :)