Net::Statsd::Server, a Perl port of Flickr/Etsy’s statsd

If you’re looking for a Perl client to connect to a statsd daemon, checkout Net::Statsd on CPAN, now at version 0.08.

This post is about the server component of statsd.

Tracking metrics: up to now

The idea of statsd started in Flickr by Cal Henderson, and some code is still available, but it’s not very functional or complete.

Since reading about statsd, I found the concept brilliant. I have been using a similar technique long before hearing about statsd though. I learned it from colleagues here at Opera in 2008. They were using it to track application metrics for the Opera Link server. I thought it was great, so I also implemented it, extending it by making it very easy to add metrics and to see the output automatically in Munin. Here’s how it worked basically:

# ...
use Opera::Stats;
# ...
Opera::Stats::count("site.logins");
# ...

The project code would have typically tens or hundreds of these calls. Each call would store/increment a counter in a local or remote memcached. Then a complementary Opera::Stats::Munin module would automatically generate the output needed to implement a full Munin plugin given the metrics to be exposed.

So far, so good. Except there were a few things that didn’t work quite right:

  • Using TCP connections, maybe even to remote machines, even though it was never a problem, could be in case the memcached machines went down
  • Volume was a concern. I had to worry about tracking too many metrics. How would that affect functioning of memcached for regularly stored keys and values? Would those metrics-related keys cause evictions in the regular memcached content?
  • Even though the munin integration made it very easy to have charts, there were still some limitations: creating new charts requires some wrapper plugin with 1 or 2 lines of Perl code. Flexibility was also an issue.

Enter statsd

I have been thinking of replacing this system with statsd for a while. However, I wanted to have a more in-depth look at it before deploying it.

Turns out that statsd is a simple project, which I like, but requires nodejs. Knowing next to nothing about nodejs, I took some time to learn a few things.

I also realized I have been wanting to learn about AnyEvent for a long time.

Net::Statsd::Server

Two weeks ago, I spent a busy weekend reimplementing 95% of statsd in Perl. On Sunday night, I had a functional version of statsd written in Perl with AnyEvent.

AnyEvent stuff is surprising at times. I found especially interesting to debug the cases where your timer (AE::timer) doesn’t fire unless you actually save it to a scalar, as in:

# This won't fire!
AE::timer 10, 10, \&do_something;

# This will though.
# This behaviour is triggered by "defined wantarray"
my $t = AE::timer 10, 10, \&do_something;

Since that weekend, I have spent a few more nights tweaking Net::Statsd::Server. Yesterday I wrote a new piece of functionality (a new “File” backend) that is actually not in the original statsd.

It looks like I might need new backends as well, so I think it’s “an investment with a good ROI”, even though I did it mainly for fun and in my free time.

Performance

I wanted to make sure my statsd server implementation would be fast. I started by bringing up the nodejs statsd and firing my official benchmark script with 1 million iterations, and then comparing the results with my own statsd server.

That didn’t work out very well. Or rather, it worked out brilliantly, showing around 40K requests/s being handled by nodejs-statsd and 50K requests/s by Net::Statsd::Server. Problem is: how do you measure the performance of a UDP server? Or, for that matter, of a UDP client?

I figured out that, being UDP connection-less fire-and-forget, it doesn’t really matter how many packets/s the client fires, as long as you can generate more than your server can handle. Just as a data point, I reached around 73-75k statsd API calls per second (for the gauge API, around 55-58k for counters and timers). What really matters is how many packets reach the server.

BTW, I used another amazing piece of software called Devel::NYTProf to optimize the performance of the incoming packets code path as much as I could.

The test setup

To measure how many packets are received on the server-side, I prepared a test configuration:

{ graphitePort: 2003
, graphiteHost: "graphite.localdomain"
, host: "0.0.0.0"
, port: 8125
, backends: [ "./backends/graphite", "./backends/console" ]
, mgmt_address: "0.0.0.0"
, mgmt_port: 8126
}

The same configuration file for the Perl server becomes:

{ "graphitePort": 2003,
  "graphiteHost": "graphite.localdomain",
  "host" : "0.0.0.0",
  "port": 8125,
  "mgmt_address" : "0.0.0.0",
  "mgmt_port": 8126,
  "backends": [ "Graphite", "Console" ],
  "log" : {
    "backend" : "stdout",
    "level" : "LOG_WARN",
  }
}

Using the benchmark.pl code mentioned above, run with:

$ perl benchmark.pl 1000000

I started up first the nodejs statsd, then the Net::Statsd::Server daemon and captured their output. Both servers are configured to use their Graphite backend and flush to a valid and active graphite host. The Console backend is also active for both servers, so I could capture the output and look at the statsd.packets_received counter and directly measure how many packets are received in the server.

The benchmark utility with first argument = 1000000 generates 5 million statsd API calls, that is, 5 million UDP packets.

Of these 5 million packets, nodejs statsd was able to capture 2106768, 1596275, 1479145 and 1490640 packets over several runs.

Net::Statsd::Server, again in 3 different runs, was able to capture 2106242, 1884810, 1822042 and 1866500 packets.

I have performed more tests, and they had a very low deviation from the last runs (1.5M for etsy’s statsd and 1.8M for Net::Statsd::Server). Removing the 2 peak results of ~2.1Mb, it would seem that the Perl statsd is capable of receiving 22% more packets than the original statsd daemon written in javascript.

Of course, this is just my test. I have tried to run the test on different hardware, but I haven’t got significantly different results. If you try yourself, please let me know what numbers you get. I’d be curious to know :-)

SO_RCVBUF

Given the massive amount of UDP packets that were lost in the tests (50%+ in the best runs), I tried to figure out a way to improve this and I stumbled on SO_RCVBUF.

My understanding was that bumping up SO_RCVBUF on the listening UDP socket would dramatically decrease packet loss. However, I hadn’t been able to prove the theory because I hadn’t seen an improvement in the total number of packets received. At least until I read this article on UDP packet loss on stackoverflow.com, that pointed me to the net.core.rmem_max sysctl.

After modifying net.core.rmem_max, setting it to 100M, just to avoid its effect, and using the following code in Net::Statsd::Server:

# Bump up SO_RCVBUF on UDP socket, to buffer up incoming
# UDP packets, to avoid massive packet loss when load is very high.
setsockopt($self->{server}->fh, SOL_SOCKET, SO_RCVBUF, 1*1024*1024)
or die "Couldn't set SO_RCVBUF: $!";

I can see some very interesting effect.

Re-running the node.js statsd, I could see an increased amount of captured packets (1691700, 1675902, ~10% increase).
Running again the Net::Statsd::Server daemon, I recorded 2678507 and 2477246 packets, for an impressive ~40% increase!

As a last effort, I tried varying the SO_RCVBUF size from 1 to 64Mb to see what effect it had on the amount of captured packets (or UDP packet loss if you prefer).

I haven’t run any scientific set of tests, but I can’t see any statistically significant increase for values greater than 4-8Mb, so I haven’t decided where to set the default in Net::Statsd::Server yet. Any chosen value is likely to need specific sysctl tuning anyway, so YMMV.

Why?

Did I really do it for fun? Yes, mainly, but also because:

  • I don’t like adding node.js to our production stack just to run statsd. I have never operated a node.js server, so I don’t want to take this “risk”. The product we’re building is going live soon! :-) And note that this does apply to anything, it’s not about node.js per se :-)
  • to learn how statsd was put together
  • to learn AnyEvent
  • to learn how to build a high performance UDP server
  • Basically, to learn :-)

Code is up on CPAN, as usual: https://metacpan.org/module/Net::Statsd::Server.

If you happen to use it, please give me some feedback!

Using Perl and Google Chromium’s CLD to identify the language of a text

For a new project I'm working on, given a body of text, I need to identify which language it's written in (English, Russian, Chinese, etc…).

I'm not exactly the first person on Earth to do this, so it turns out there's Google's CLD library. Surprisingly, several people around here didn't know it. The library is open source and very good too, so I immediately looked for Perl bindings for it.

There is a great Perl module on CPAN called Lingua::Identify::CLD. This module bundles a copy of the CLD library, and fully automates build and link steps too. So I gave it a shot.

How to use Lingua::Identify::CLD

It's amazingly easy to use. Here's a sample of the code:


#!/usr/bin/perl

use strict;
use Lingua::Identify::CLD ();

my $text;
while (readline) { $text .= $_ }
chomp $text;

# In my case, the content is HTML
my $cld = Lingua::Identify::CLD->new(isPlainText => 0);

# Example: (ENGLISH, en, 64)
my @lang = $cld->identify($text);
say "Language: $lang[0]";

Failing tests

I decided to start using this module into my project. The build phase went fine (perl ./Build), while the tests were failing (./Build test). Here's the log of a failed test run:


$ ./Build test
cc -I/usr/lib/perl/5.14/CORE -fPIC -c -D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fstack-protector -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -O2 -g -o /tmp/gAc_glZta2/library.o /tmp/gAc_glZta2/library.c
cc -I/usr/lib/perl/5.14/CORE -fPIC -c -D_REENTRANT -D_GNU_SOURCE -DDEBIAN -fstack-protector -fno-strict-aliasing -pipe -I/usr/local/include -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -O2 -g -o /tmp/gAc_glZta2/test.o /tmp/gAc_glZta2/test.c
cc -shared -L/usr/local/lib -fstack-protector -o /tmp/gAc_glZta2/libfoo.so /tmp/gAc_glZta2/library.o
cc -fstack-protector -L/usr/local/lib -o /tmp/gAc_glZta2/foo /tmp/gAc_glZta2/test.o -L/tmp/gAc_glZta2 -lfoo

** Preparing XS code
t/00-load.t ....... 1/1 Bailout called.  Further testing stopped:  

#   Failed test 'use Lingua::Identify::CLD;'
#   at t/00-load.t line 6.
#     Tried to use 'Lingua::Identify::CLD'.
#     Error:  Not a CODE reference at /usr/lib/perl/5.14/DynaLoader.pm line 207.
# END failed--call queue aborted at .../Lingua-Identify-CLD-0.05/blib/lib/Lingua/Identify/CLD.pm line 207.
# BEGIN failed--compilation aborted at .../Lingua-Identify-CLD-0.05/blib/lib/Lingua/Identify/CLD.pm line 24.
# Compilation failed in require at (eval 4) line 2.
# BEGIN failed--compilation aborted at (eval 4) line 2.
Use of uninitialized value $Lingua::Identify::CLD::VERSION in concatenation (.) or string at t/00-load.t line 9.
# Testing Lingua::Identify::CLD , Perl 5.014002, /usr/bin/perl
# Looks like you failed 1 test of 1.
FAILED--Further testing stopped.

Just the day before I had successfully compiled and run the tests for the same version of the module, but on Ubuntu 11.10, which I was using. Then I decided to upgrade to 12.10, and that's where I got this failed test run.

Contacting the author

Then I decided to contact the author of the module. Being Alberto quite a known author, with lots of CPAN contributions, I hoped he would answer my query within 2-3 days. That would give me some time to do other stuff, and hopefully would give him time to analyze the failure.

As usual with the best CPAN authors ;-) he answered in a couple of hours, which was fantastic for me. He had already identified a few failures like mine thanks to another awesome resource we have in the Perl community, the CPAN Testers service.

CPAN Testers

CPAN testers is a group of users that regularly (or not) report back the build/test status of everything that's released to CPAN in a multitude of platforms and versions of Perl. I think this is one of the most underestimated awesome features we have in the Perl community. The CPAN testers status of Lingua::Identify::CLD shows one report that looks exactly the same as the failure I experienced. This is on Ubuntu 12.10 with the stock perl 5.14.2.

The ugly patch

I tried to analyze the problem, apparently located in DynaLoader, and came up with a shotgun-debugging-driven patch that I copy/paste here for reference:

@@ -18,10 +18,23 @@ Version 0.05

 our $VERSION = '0.05';
 
-use XSLoader;
-BEGIN {
+eval {
+
+    require XSLoader;
     XSLoader::load('Lingua::Identify::CLD', $VERSION);
-}
+
+} or do {
+
+    # This warning triggers on Ubuntu 12.10 with the
+    # stock perl 5.14.2. Strangely enough, this doesn't
+    # seem to affect the tests at all.
+    #
+    # Not a CODE reference at /usr/lib/perl/5.14/DynaLoader.pm line 207.
+    # END failed--call queue aborted at .../blib/lib/Lingua/Identify/CLD.pm line 207.
+    # ) at .../blib/lib/Lingua/Identify/CLD.pm line 28."
+    #
+    #warn "Something's wrong with XSLoader? ($@)";
+};
 
 =head1 SYNOPSIS

It's shotgun debugging because I don't really know what's going on, I just came up with this patch because of the assumptions and information I gathered during the years on how DynaLoader/XSLoader and BEGIN {} blocks work or interact with the rest of the code :-)

Anyway, it makes the tests pass again, even with a weird warning. I agree with Alberto that it's not wise to incorporate this patch into Lingua::Identify::CLD, until we have understood why the original code fails, and why just for 2 people in the world.

All this blah-blah, to say: please do help! If you have seen the same problem, help us figure out what it is. My repository with the forked/patched code is on Github:

https://github.com/cosimo/Lingua-Identify-CLD

Have fun!

My first attempt at a responsive layout: OSQA

The so called responsive layouts are all the rage these days, and frankly, I feel ashamed that some of my personal projects I do at home are not responsive yet. Partly that's because I don't have that much spare time to dedicate to them, but partly it's also because I have no idea how to transform a "fixed" or traditional layout into a responsive one.

I tried to remedy this by studying a few responsive layouts I stumbled upon. However, I didn't find it very easy to just stare at the code and understand what's going on. Modern CSS has quite some dose of magic for me. I searched a bit around the web to find some responsive layout guides and tried to read them. I remembered we must have had quite some information on responsive layouts on Dev Opera.

A search for "responsive" turns up lots of good results, including Love your devices: adaptive web design with media queries and more… by colleague Chris Mills.

Recently I heard that Chris published the "Practical CSS 3" book, so I just dived into the article, eager to learn everything about responsive layouts, adapting to devices, etc… The occasion for it is another tiny personal project I'm working on during nights and weekends.

It's yet another stack-overflow clone, but for parenting, newborns, pregnancy, etc… powered by the open source Q&A software OSQA. I'm in the process of splitting the existing web site with articles, comments, questions and answers into two different sites, a pure blog with articles and comments, and another site with just questions and answers. The latter is what I'm talking about in this article.

When you install OSQA, by default it looks like this:

It's not bad, but I like the default stack-overflow layout much better. So I spent a few days learning about OSQA and importing the existing content. I found it robust and well designed. It has everything I need, including themes or skins that you can build, and a custom-css functionality:
you can stack your custom CSS content on top of the selected skin, much like what we have in My Opera too.

After a bit of CSS fiddling, I came up with the following layout:

Unfortunately, the default OSQA layout is not responsive at all, and it looks terrible on mobile devices (initial-scale, anyone?). So I started this journey into unknown territory, guided by Chris Mills' article, to discover how to make a layout responsive from scratch. Now, I'm sure there's a crapload of useless/harmful stuff in my custom CSS, but the final result left me really satisfied:

… apart from the "Cerca"/Search button and a few minor things. In the end, I had to duplicate the default skin to make a few very small changes, but apart from that, all the rest is accomplished by the custom CSS snippet. Here it is. Of course the heart of it is the media query for mobile devices:

Please tell me where I screwed up, KTHXBYE :-)

Displaying realtime memcached traffic on a backend

Sometimes I like to write down posts like this, to remind myself how to do something, sort of a mental note.
Suppose you have a few application servers that use 1+ memcached servers, and you want some way to display the outbound traffic, providing some insights on what are the most used keys, counters, etc…

Here's a quick way to do that, assuming you're using the memcached text protocol:

tcpflow -ce dst port 11211 
    | cut -b53- 
    | grep ^get 
    | pipestat --clear --runtime 60 --field 2 --time 1 --limit 40

What this does is:

  • Use tcpflow to capture all outbound traffic to destination port 11211, default memcached port.
  • Remove the first 53 bytes from each line, to filter out source and destination ip/ports
  • Only display get requests (alternatively, use set, incr, …)
  • Feed the resulting data to pipestat, a simple but great Perl tool that aggregates the data, displaying the most frequent ones. The specific options I used are good if you want to display quick statistics like other tools as top, mytop, or varnishstat.

It goes without saying that these tools are automatically installed on all servers that our Devops team here at Opera manages. I couldn't work without them :)

How to find unused CSS selectors, a quick solution

Was talking to a colleague today, and he mentioned the problem he was working on: trying to find site-wide unused CSS selectors. That is, having a static CSS file on disk, try to go through all the selectors in there and see if there's some matching elements in an entire site, crawling it page by page.

I thought it was a really interesting problem, so I gave it a quick shot by glueing together CSS::Tiny, Mojo::UserAgent and Mojo::DOM::CSS.

This is what came out of it. I'd say a decent first quick solution:

So I also learned about this deadweight project, that apparently also can crawl a site by logging in, kind of WWW::Mechanize style. Would be interesting to improve this initial solution :-)

Dist::Zilla, Y U suddenly no work anymore? [FIXED!]

I'm trying to understand why Dist::Zilla doesn't work anymore on my laptop. Here's the epic wall of warnings I get when running dzil test:


$ dzil test
Could not create the 'reader' method for zilla because : The method '_inline_store' was not found in the inheritance hierarchy for Moose::Meta::Class::__ANON__::SERIAL::9 at /usr/local/lib/perl/5.10.1/Class/MOP/Class.pm line 1053
	Class::MOP::Class::__ANON_Moose::Meta::Class=HASH(0x3556088) called at /usr/local/lib/perl/5.10.1/Class/MOP/Class.pm line 1098
	Class::MOP::Class::add_around_method_modifier('Moose::Meta::Class=HASH(0x3556088)', '_inline_store', 'CODE(0x351cea8)') called at /usr/local/lib/perl/5.10.1/Moose/Meta/Role/Application/ToClass.pm line 231
	Moose::Meta::Role::Application::ToClass::apply_method_modifiers('Moose::Meta::Role::Application::ToClass=HASH(0x3556b40)', 'around', 'Moose::Meta::Role=HASH(0x351dc28)', 'Moose::Meta::Class=HASH(0x3556088)') called at /usr/local/lib/perl/5.10.1/Moose/Meta/Role/Application.pm line 78
	Moose::Meta::Role::Application::apply_around_method_modifiers('Moose::Meta::Role::Application::ToClass=HASH(0x3556b40)', 'Moose::Meta::Role=HASH(0x351dc28)', 'Moose::Meta::Class=HASH(0x3556088)') called at /usr/local/lib/perl/5.10.1/Moose/Meta/Role/Application.pm line 64
	Moose::Meta::Role::Application::apply('Moose::Meta::Role::Application::ToClass=HASH(0x3556b40)', 'Moose::Meta::Role=HASH(0x351dc28)', 'Moose::Meta::Class=HASH(0x3556088)') called at /usr/local/lib/perl/5.10.1/Moose/Meta/Role/Application/ToClass.pm line 36
	Moose::Meta::Role::Application::ToClass::apply('Moose::Meta::Role::Application::ToClass=HASH(0x3556b40)', 'Moose::Meta::Role=HASH(0x351dc28)', 'Moose::Meta::Class=HASH(0x3556088)', 'HASH(0x354ce50)') called at /usr/local/lib/perl/5.10.1/Moose/Meta/Role.pm line 470
	Moose::Meta::Role::apply('Moose::Meta::Role=HASH(0x351dc28)', 'Moose::Meta::Class=HASH(0x3556088)') called at /usr/local/lib/perl/5.10.1/Moose/Util.pm line 160
	Moose::Util::_apply_all_roles('Moose::Meta::Class=HASH(0x3556088)', undef, 'MooseX::SetOnce::Accessor') called at /usr/local/lib/perl/5.10.1/Moose/Util.pm line 99
	Moose::Util::apply_all_roles('Moose::Meta::Class=HASH(0x3556088)', 'MooseX::SetOnce::Accessor') called at /usr/local/lib/perl/5.10.1/Moose/Meta/Class.pm line 104
	Moose::Meta::Class::create('Moose::Meta::Class', 'Moose::Meta::Class::__ANON__::SERIAL::9', 'roles', 'ARRAY(0x33e50d8)', 'weaken', '', 'superclasses', 'ARRAY(0x353a7e8)') called at /usr/local/lib/perl/5.10.1/Class/MOP/Package.pm line 120
	Class::MOP::Package::create_anon('Moose::Meta::Class', 'superclasses', 'ARRAY(0x353a7e8)', 'roles', 'ARRAY(0x33e50d8)', 'cache', 1) called at /usr/local/lib/perl/5.10.1/Class/MOP/Class.pm line 474
	Class::MOP::Class::create_anon_class('Moose::Meta::Class', 'superclasses', 'ARRAY(0x353a7e8)', 'roles', 'ARRAY(0x33e50d8)', 'cache', 1) called at /usr/share/perl5/MooseX/SetOnce.pm line 27
	Class::MOP::Class:::around('CODE(0x1c87bf0)', 'Moose::Meta::Class::__ANON__::SERIAL::8=HASH(0x3556a50)') called at /usr/local/lib/perl/5.10.1/Class/MOP/Method/Wrapped.pm line 162
	Class::MOP::Method::Wrapped::__ANON_Moose::Meta::Class::__ANON__::SERIAL::8=HASH(0x3556a50) called at /usr/local/lib/perl/5.10.1/Class/MOP/Method/Wrapped.pm line 91
	Moose::Meta::Class::__ANON__::SERIAL::8::accessor_metaclass('Moose::Meta::Class::__ANON__::SERIAL::8=HASH(0x3556a50)') called at /usr/local/lib/perl/5.10.1/Class/MOP/Attribute.pm line 389
	Class::MOP::Attribute::__ANON__() called at /usr/share/perl5/Try/Tiny.pm line 76
	eval {...} called at /usr/share/perl5/Try/Tiny.pm line 67
	Try::Tiny::try('CODE(0x3543bb8)', 'Try::Tiny::Catch=REF(0x354c718)') called at /usr/local/lib/perl/5.10.1/Class/MOP/Attribute.pm line 401
	Class::MOP::Attribute::_process_accessors('Moose::Meta::Class::__ANON__::SERIAL::8=HASH(0x3556a50)', 'reader', 'zilla', undef) called at /usr/local/lib/perl/5.10.1/Moose/Meta/Attribute.pm line 1074
	Moose::Meta::Attribute::_process_accessors('Moose::Meta::Class::__ANON__::SERIAL::8=HASH(0x3556a50)', 'reader', 'zilla', undef) called at /usr/local/lib/perl/5.10.1/Class/MOP/Attribute.pm line 428
	Class::MOP::Attribute::install_accessors('Moose::Meta::Class::__ANON__::SERIAL::8=HASH(0x3556a50)') called at /usr/local/lib/perl/5.10.1/Moose/Meta/Attribute.pm line 1013
	Moose::Meta::Attribute::install_accessors('Moose::Meta::Class::__ANON__::SERIAL::8=HASH(0x3556a50)') called at /usr/local/lib/perl/5.10.1/Class/MOP/Class.pm line 891
	Class::MOP::Class::__ANON__() called at /usr/share/perl5/Try/Tiny.pm line 76
	eval {...} called at /usr/share/perl5/Try/Tiny.pm line 67
	Try::Tiny::try('CODE(0x354c5b0)', 'Try::Tiny::Catch=REF(0x3435780)') called at /usr/local/lib/perl/5.10.1/Class/MOP/Class.pm line 896
	Class::MOP::Class::_post_add_attribute('Moose::Meta::Class=HASH(0x35122a0)', 'Moose::Meta::Class::__ANON__::SERIAL::8=HASH(0x3556a50)') called at /usr/local/lib/perl/5.10.1/Class/MOP/Mixin/HasAttributes.pm line 44
	Class::MOP::Mixin::HasAttributes::add_attribute('Moose::Meta::Class=HASH(0x35122a0)', 'Moose::Meta::Class::__ANON__::SERIAL::8=HASH(0x3556a50)') called at /usr/local/lib/perl/5.10.1/Moose/Meta/Class.pm line 570
	Moose::Meta::Class::add_attribute('Moose::Meta::Class=HASH(0x35122a0)', 'zilla', 'is', 'ro', 'writer', 'set_zilla', 'lazy_required', 1, 'isa', ...) called at /usr/local/lib/perl/5.10.1/Moose.pm line 79
	Moose::has('Moose::Meta::Class=HASH(0x35122a0)', 'zilla', 'is', 'ro', 'isa', 'Moose::Meta::TypeConstraint::Class=HASH(0x3092830)', 'traits', 'ARRAY(0x350d590)', 'writer', ...) called at /usr/local/lib/perl/5.10.1/Moose/Exporter.pm line 382
	Moose::has('zilla', 'is', 'ro', 'isa', 'Moose::Meta::TypeConstraint::Class=HASH(0x3092830)', 'traits', 'ARRAY(0x350d590)', 'writer', 'set_zilla', ...) called at /usr/local/share/perl/5.10.1/Dist/Zilla/MVP/RootSection.pm line 22
	require Dist/Zilla/MVP/RootSection.pm called at /usr/local/share/perl/5.10.1/Dist/Zilla/MVP/Assembler/Zilla.pm line 13
	Dist::Zilla::MVP::Assembler::Zilla::BEGIN() called at /usr/local/share/perl/5.10.1/Dist/Zilla/MVP/RootSection.pm line 0
	eval {...} called at /usr/local/share/perl/5.10.1/Dist/Zilla/MVP/RootSection.pm line 0
	require Dist/Zilla/MVP/Assembler/Zilla.pm called at /usr/local/share/perl/5.10.1/Dist/Zilla/Dist/Builder.pm line 204
	Dist::Zilla::Dist::Builder::_load_config('Dist::Zilla::Dist::Builder', 'HASH(0x342fe00)') called at /usr/local/share/perl/5.10.1/Dist/Zilla/Dist/Builder.pm line 27
	Dist::Zilla::Dist::Builder::from_config('Dist::Zilla::Dist::Builder', 'HASH(0x33e2608)') called at /usr/local/share/perl/5.10.1/Dist/Zilla/App.pm line 112
	Dist::Zilla::App::__ANON__() called at /usr/share/perl5/Try/Tiny.pm line 76
	eval {...} called at /usr/share/perl5/Try/Tiny.pm line 67
	Try::Tiny::try('CODE(0x3084e60)', 'Try::Tiny::Catch=REF(0x33a8848)') called at /usr/local/share/perl/5.10.1/Dist/Zilla/App.pm line 120
	Dist::Zilla::App::zilla('Dist::Zilla::App=HASH(0x204eb48)') called at /usr/local/share/perl/5.10.1/Dist/Zilla/App/Command.pm line 13
	Dist::Zilla::App::Command::zilla('Dist::Zilla::App::Command::test=HASH(0x280b910)') called at /usr/local/share/perl/5.10.1/Dist/Zilla/App/Command/test.pm line 28
	Dist::Zilla::App::Command::test::execute('Dist::Zilla::App::Command::test=HASH(0x280b910)', 'Getopt::Long::Descriptive::Opts::__OPT__::2=HASH(0x291d7c0)', 'ARRAY(0x13bef10)') called at /usr/share/perl5/App/Cmd.pm line 220
	App::Cmd::execute_command('Dist::Zilla::App=HASH(0x204eb48)', 'Dist::Zilla::App::Command::test=HASH(0x280b910)', 'Getopt::Long::Descriptive::Opts::__OPT__::2=HASH(0x291d7c0)') called at /usr/share/perl5/App/Cmd.pm line 159
	App::Cmd::run('Dist::Zilla::App') called at /usr/bin/dzil line 11
 at /usr/local/lib/perl/5.10.1/Class/MOP/Attribute.pm line 400
	Class::MOP::Attribute::__ANON_The method '_inline_store' was not found in the inheritance... called at /usr/share/perl5/Try/Tiny.pm line 100
	Try::Tiny::try('CODE(0x3543bb8)', 'Try::Tiny::Catch=REF(0x354c718)') called at /usr/local/lib/perl/5.10.1/Class/MOP/Attribute.pm line 401
	Class::MOP::Attribute::_process_accessors('Moose::Meta::Class::__ANON__::SERIAL::8=HASH(0x3556a50)', 'reader', 'zilla', undef) called at /usr/local/lib/perl/5.10.1/Moose/Meta/Attribute.pm line 1074
	Moose::Meta::Attribute::_process_accessors('Moose::Meta::Class::__ANON__::SERIAL::8=HASH(0x3556a50)', 'reader', 'zilla', undef) called at /usr/local/lib/perl/5.10.1/Class/MOP/Attribute.pm line 428
	Class::MOP::Attribute::install_accessors('Moose::Meta::Class::__ANON__::SERIAL::8=HASH(0x3556a50)') called at /usr/local/lib/perl/5.10.1/Moose/Meta/Attribute.pm line 1013
	Moose::Meta::Attribute::install_accessors('Moose::Meta::Class::__ANON__::SERIAL::8=HASH(0x3556a50)') called at /usr/local/lib/perl/5.10.1/Class/MOP/Class.pm line 891
	Class::MOP::Class::__ANON__() called at /usr/share/perl5/Try/Tiny.pm line 76
	eval {...} called at /usr/share/perl5/Try/Tiny.pm line 67
	Try::Tiny::try('CODE(0x354c5b0)', 'Try::Tiny::Catch=REF(0x3435780)') called at /usr/local/lib/perl/5.10.1/Class/MOP/Class.pm line 896
	Class::MOP::Class::_post_add_attribute('Moose::Meta::Class=HASH(0x35122a0)', 'Moose::Meta::Class::__ANON__::SERIAL::8=HASH(0x3556a50)') called at /usr/local/lib/perl/5.10.1/Class/MOP/Mixin/HasAttributes.pm line 44
	Class::MOP::Mixin::HasAttributes::add_attribute('Moose::Meta::Class=HASH(0x35122a0)', 'Moose::Meta::Class::__ANON__::SERIAL::8=HASH(0x3556a50)') called at /usr/local/lib/perl/5.10.1/Moose/Meta/Class.pm line 570
	Moose::Meta::Class::add_attribute('Moose::Meta::Class=HASH(0x35122a0)', 'zilla', 'is', 'ro', 'writer', 'set_zilla', 'lazy_required', 1, 'isa', ...) called at /usr/local/lib/perl/5.10.1/Moose.pm line 79
	Moose::has('Moose::Meta::Class=HASH(0x35122a0)', 'zilla', 'is', 'ro', 'isa', 'Moose::Meta::TypeConstraint::Class=HASH(0x3092830)', 'traits', 'ARRAY(0x350d590)', 'writer', ...) called at /usr/local/lib/perl/5.10.1/Moose/Exporter.pm line 382
	Moose::has('zilla', 'is', 'ro', 'isa', 'Moose::Meta::TypeConstraint::Class=HASH(0x3092830)', 'traits', 'ARRAY(0x350d590)', 'writer', 'set_zilla', ...) called at /usr/local/share/perl/5.10.1/Dist/Zilla/MVP/RootSection.pm line 22
	require Dist/Zilla/MVP/RootSection.pm called at /usr/local/share/perl/5.10.1/Dist/Zilla/MVP/Assembler/Zilla.pm line 13
	Dist::Zilla::MVP::Assembler::Zilla::BEGIN() called at /usr/local/share/perl/5.10.1/Dist/Zilla/MVP/RootSection.pm line 0
	eval {...} called at /usr/local/share/perl/5.10.1/Dist/Zilla/MVP/RootSection.pm line 0
	require Dist/Zilla/MVP/Assembler/Zilla.pm called at /usr/local/share/perl/5.10.1/Dist/Zilla/Dist/Builder.pm line 204
	Dist::Zilla::Dist::Builder::_load_config('Dist::Zilla::Dist::Builder', 'HASH(0x342fe00)') called at /usr/local/share/perl/5.10.1/Dist/Zilla/Dist/Builder.pm line 27
	Dist::Zilla::Dist::Builder::from_config('Dist::Zilla::Dist::Builder', 'HASH(0x33e2608)') called at /usr/local/share/perl/5.10.1/Dist/Zilla/App.pm line 112
	Dist::Zilla::App::__ANON__() called at /usr/share/perl5/Try/Tiny.pm line 76
	eval {...} called at /usr/share/perl5/Try/Tiny.pm line 67
	Try::Tiny::try('CODE(0x3084e60)', 'Try::Tiny::Catch=REF(0x33a8848)') called at /usr/local/share/perl/5.10.1/Dist/Zilla/App.pm line 120
	Dist::Zilla::App::zilla('Dist::Zilla::App=HASH(0x204eb48)') called at /usr/local/share/perl/5.10.1/Dist/Zilla/App/Command.pm line 13
	Dist::Zilla::App::Command::zilla('Dist::Zilla::App::Command::test=HASH(0x280b910)') called at /usr/local/share/perl/5.10.1/Dist/Zilla/App/Command/test.pm line 28
	Dist::Zilla::App::Command::test::execute('Dist::Zilla::App::Command::test=HASH(0x280b910)', 'Getopt::Long::Descriptive::Opts::__OPT__::2=HASH(0x291d7c0)', 'ARRAY(0x13bef10)') called at /usr/share/perl5/App/Cmd.pm line 220
	App::Cmd::execute_command('Dist::Zilla::App=HASH(0x204eb48)', 'Dist::Zilla::App::Command::test=HASH(0x280b910)', 'Getopt::Long::Descriptive::Opts::__OPT__::2=HASH(0x291d7c0)') called at /usr/share/perl5/App/Cmd.pm line 159
	App::Cmd::run('Dist::Zilla::App') called at /usr/bin/dzil line 11
Compilation failed in require at /usr/local/share/perl/5.10.1/Dist/Zilla/MVP/Assembler/Zilla.pm line 13.
BEGIN failed--compilation aborted at /usr/local/share/perl/5.10.1/Dist/Zilla/MVP/Assembler/Zilla.pm line 13.
Compilation failed in require at /usr/local/share/perl/5.10.1/Dist/Zilla/Dist/Builder.pm line 204.

Due to chronic lack of time, I blindly tried to upgrade Moose, MooseX::Types, Dist::Zilla, Config::MVP, but no luck.

Before I start dealing with this madness… any idea?

EDIT: thanks to the comments, I found out about moose-outdated, a script that reports the Moose(X) modules that have newer versions up on CPAN. Running moose-outdated I got back the following list:

$ moose-outdated
MooseX::LazyRequire
MooseX::Role::Parameterized
MooseX::SetOnce

Then I just run:

$ cpanm MooseX::LazyRequire MooseX::Role::Parameterized MooseX::SetOnce

After doing this, dzil started working again. Thanks everyone for your comments and help!

Problems with bnx2 kernel module and high traffic

We're seeing an "elevated" level of traffic these days on the My Opera servers. As usual with operations matters, it's difficult to find one exact clear root cause. The rest of the post explains what we found and the fix for it.

TL;DR

You want to try options bnx2 disable_msi=1 in your /etc/modprobe.d/bnx2.conf if:

  • using squeeze and bnx2 version is 2.0.2
  • you see high traffic (10K+ connections)
  • you see errors on public network interface
  • server is dropping packets/connections randomly or it's really slow

The gory details

During last Tuesday the DDoS attack (that is still continuing now) on the My Opera servers ramped up from ~4k req/s/frontend to ~16k+ req/s/frontend. Both frontends were dist-upgraded (including a kernel upgrade) on May 23rd, but not rebooted, so the kernel update was armed but not actually live.

We started seeing these bad problems of dropped connections and general slowness after the frontend servers were rebooted. The reason why there were rebooted is because we have been hitting another really weird problem, the 210 days uptime timer bug. See this and this bug reports for more details.

Anyway, I'm not sure how to verify this, because I didn't restart the boxes myself, but my theory is after they were rebooted, the new bnx2 kernel module version 2.0.2 was loaded.

Then later on we found out about this very specific bnx2 v2.0.2 bug that only triggers in high traffic situations, at least on Debian Squeeze and Ubuntu, that causes network interfaces to stop working correctly, dropping traffic.

Long story short, there's a magic option that prevents this from happening. rmmod'ing and modprobing back the bnx2 module with this option fixed the problem so far.

# /etc/modprobe.d/bnx2.conf
options bnx2 disable_msi=1

Regarding what the option is about, I'm not even going to lie about it. I have no idea… We found it with this search:

https://encrypted.google.com/search?client=opera&rls=en&q=bnx2+debian+2.0.2+traffic&sourceid=opera&ie=utf-8&oe=utf-8&channel=suggest

First hit is our own Sven from sysadmin team:

http://lists.us.dell.com/pipermail/linux-poweredge/2011-October/045485.html

Second hit is the solution we used:

http://ubuntuforums.org/archive/index.php/t-1726045.html

We also did some tweaking for the large amount of TIME_WAIT connections that were resulting from this bnx2 bug, namely bumped up net.sys.ipv4.tcp_max_tw_buckets quite a bit.

Take aways

  1. Before rebooting a machine, check what's going to happen, when was last upgrade etc…, f.ex. /var/log/dpkg.log.
  2. In case you have firewall rules, iptables-save > /root/iptables-rules.YYYYMMDD and later restore if needed with iptables-restore < iptables-rules.YYYYMMDD
  3. Always check if the conntrack module is enabled. Most times you don't need it, and it will cause performance to drop under very high traffic (of course).

In this case what happened is that the conntrack module was accidentally also re-enabled by the reboot. We had previously disabled it, but didn't make the change permanent. This is because on My Opera we're still not using our config management infrastructure… Looking forward to make that happen. Soon. Hopefully :)

Verifying MySQL behaviour with automated test suites and mytap

You know everything about how MySQL treats UTF8 and LATIN1 charsets and how the collation table impacts on selection and insertion of data, right?

Great, then stop reading :)

I don't and since I'm in the process of setting up a new version of the Opera accounts database, I really don't want to screw up things. I tried to fully understand how MySQL works in this respect (charsets, collations, etc…) but reading documentation and memorizing it wasn't very easy. Plus, there's a thousands blog posts on the matter, not always 100% accurate.

So I thought I'd better get hands on and I wrote a kind of database test suite.

Now this test suite is hooked up to the main project builds on Jenkins. Here's a sample output:


[...]
[workspace] $ /bin/sh -xe /tmp/hudson3255767718598715423.sh
+ ./bin/run-dbtest-suite
basedir=/var/lib/jenkins/jobs/auth-db/workspace
/var/lib/jenkins/jobs/auth-db/workspace/t/database-tests/__initdb__.my ........................................... 
1..2
ok 1 - Using utf8tests database
ok 2 - Server charset is latin1
ok
/var/lib/jenkins/jobs/auth-db/workspace/t/database-tests/collation-utf8_bin.my ................................... 
1..6
ok 1 - All our records are there. No duplicate key error.
ok 2 - utf8_bin collation does not collate a/â/à/A/...
ok 3 - utf8_bin collation does not collate a/â/à/A/...
ok 4 - utf8_bin collation does not collate a/â/à/A/...
ok 5 - Query for mixed-case username does not return lowercase username
ok 6 - Query for upper-case username does not return lowercase username
ok
/var/lib/jenkins/jobs/auth-db/workspace/t/database-tests/collation-utf8_general_ci.my ............................ 
1..7
ok 1 - Collation for t007 is utf8_general_ci
ok 2 - utf8_general_ci collation normalizes accents, diacritics and the like
ok 3 - A and Ã… are collated to the same character in the utf8_general_ci table
ok 4 - å and Å are collated to the same character in the utf8_general_ci table
ok 5 - lower/upper case chars are collated in the utf8_general_ci table
ok 6 - lower/upper case chars are collated in the utf8_general_ci table
ok 7 - We are allowed to insert all records just because there is no unique constraint
ok
/var/lib/jenkins/jobs/auth-db/workspace/t/database-tests/collation-utf8_unicode_ci.my ............................ 
1..7
ok 1 - Collation for t005 is utf8_unicode_ci
ok 2 - utf8_unicode_ci collation normalizes accents, diacritics and the like
ok 3 - A and Ã… are collated to the same character in the utf8_unicode_ci table
ok 4 - å and Å are collated to the same character in the utf8_unicode_ci table
ok 5 - lower/upper case chars are collated in the utf8_unicode_ci table
ok 6 - lower/upper case chars are collated in the utf8_unicode_ci table
ok 7 - We are allowed to insert all records just because there is no unique constraint
ok
/var/lib/jenkins/jobs/auth-db/workspace/t/database-tests/default-table-charset.my ................................ 
1..3
ok 1 - Default character set is utf8 when no charset is specified (from server)
ok 2 - Default character set is utf8 when "CHARSET utf8" specified in the CREATE TABLE
ok 3 - Default character set is utf8 when "CHARSET utf8" and "COLLATE" specified in the CREATE TABLE
ok
...
/var/lib/jenkins/jobs/auth-db/workspace/t/database-tests/username-with-utf8-chars.my ............................. 
1..5
ok 1 - We have some UTF-8 encoded string in our hands (hex)
ok 2 - We have some UTF-8 encoded string in our hands (charset)
ok 3 - Can select back UTF-8 content from a CHARSET utf8 table
ok 4 - Given string is exactly 24 bytes long (length)
ok 5 - Given string is exactly 8 (wide) characters long (char_length)
ok
All tests successful.
Files=11, Tests=80, 0.739731 wallclock secs ( 0.05 usr  0.02 sys +  0.07 cusr  0.01 csys =  0.15 CPU)
Result: PASS
Recording test results
Finished: SUCCESS

And here's an example of "sanity check" test case, which doesn't do much:


   1 -- Check that we can insert and retrieve UTF-8 content correctly
   2 
   3 BEGIN;
   4 
   5 SET NAMES utf8;
   6 
   7 SELECT tap.plan(5);
   8 
   9 USE auth_utf8tests;
  10 
  11 SET @username = '今日话题今日话题';
  12 SET @encoded  = 'C1BB8AE697A5E8AF9DE9A298E4BB8AE697A5E8AF9DE9A298';
  13 
  14 SELECT tap.eq(
  15     HEX(@username),
  16     @encoded,
  17     'We have some UTF-8 encoded string in our hands (hex)'
  18 );
  19 
  20 SELECT tap.eq(
  21     CHARSET(@username),
  22     'utf8',
  23     'We have some UTF-8 encoded string in our hands (charset)'
  24 );
  25 
  26 INSERT INTO t001 (f1) VALUES (@username);
  27 
  28 SELECT tap.eq(
  29     (SELECT HEX(f1) FROM t001 WHERE f1 = @username),
  30     @encoded,
  31     'Can select back UTF-8 content from a CHARSET utf8 table'
  32 );
  33 
  34 SELECT tap.eq(
  35     (SELECT LENGTH(f1) FROM t001 WHERE f1 = @username),
  36     24,
  37     'Given string is exactly 24 bytes long (length)'
  38 );
  39 
  40 SELECT tap.eq(
  41     (SELECT CHAR_LENGTH(f1) FROM t001 WHERE f1 = @username),
  42     8,
  43     'Given string is exactly 8 (wide) characters long (char_length)'
  44 );
  45 
  46 -- Finish the tests and clean up.
  47 CALL tap.finish();
  48 ROLLBACK;

This SQL test code uses mytap. You can see how the SELECT tap.* calls are just the equivalents of the TAP testing framework of Perl. SELECT tap.eq() is the equivalent of Test::More::is(), and so on.

Another, more interesting test case, is the following:


   1 --
   2 -- Verify how the utf8_unicode_ci collation works
   3 --
   4 
   5 BEGIN;
   6 
   7 SET NAMES utf8;
   8 
   9 SELECT tap.plan(12);
  10 
  11 USE auth_utf8tests;
  12 

  [...]

  40 SELECT tap.eq(
  41     (SELECT TABLE_COLLATION FROM information_schema.TABLES WHERE TABLE_SCHEMA=SCHEMA() AND TABLE_NAME='t015'),
  42     'utf8_unicode_ci',
  43     'Collation for t015 is utf8_unicode_ci'
  44 );

  [...]

  48 
  49 SELECT tap.eq(
  50     (SELECT GROUP_CONCAT(id) FROM t015 WHERE username = 'testuser1a' ORDER BY id),
  51     '10',
  52     'utf8_unicode_ci collation normalizes accents, diacritics and the like'
  53 );
  54 
  55 SELECT tap.eq(
  56     (SELECT GROUP_CONCAT(id) FROM t015 WHERE username = 'testuser1Ã…' ORDER BY id),
  57     '10',
  58     'A and Ã… are collated to the same character in the utf8_unicode_ci table'
  59 );
  60 
  61 SELECT tap.eq(
  62     (SELECT GROUP_CONCAT(id) FROM t015 WHERE username = 'testuser1Ã¥' ORDER BY id),
  63     '10',
  64     'Ã¥ and Ã… are collated to the same character in the utf8_unicode_ci table'
  65 );
  66 
  67 SELECT tap.eq(
  68     (SELECT GROUP_CONCAT(id) FROM t015 WHERE username = 'TestUser1A' ORDER BY id),
  69     '10',
  70     'lower/upper case chars are collated in the utf8_unicode_ci table'
  71 );
  72 
  73 SELECT tap.eq(
  74     (SELECT GROUP_CONCAT(id) FROM t015 WHERE username = 'TESTUSER1A' ORDER BY id),
  75     '10',
  76     'lower/upper case chars are collated in the utf8_unicode_ci table'
  77 );
  78 
  79 SELECT tap.eq(
  80     (SELECT COUNT(*) FROM t015),
  81     1,
  82     'We are allowed to insert only 1 record, because the others collate to the same string'
  83 );
  84 
  85 -- Finish the tests and clean up.
  86 CALL tap.finish();
  87 ROLLBACK;

An interesting thing that I didn't know how to do in the beginning is how to trap errors. I left out that part from the test code to simplify, but here it is:


  13 DELIMITER //
  14 
  15 DROP PROCEDURE IF EXISTS populate_table //
  16 
  17 CREATE PROCEDURE populate_table ()
  18 BEGIN
  19 
  20     DECLARE CONTINUE HANDLER FOR SQLSTATE '23000' BEGIN
  21         SELECT tap.ok(
  22             1,
  23             'We should get dupkey errors when inserting data with collation utf8_unicode_ci'
  24         );
  25     END;
  26 
  27     INSERT INTO t015 (id,username,note) VALUES (10, 'testuser1a', 'plain');
  28     INSERT INTO t015 (id,username,note) VALUES (20, 'testuser1â', 'circumflex a');
  29     INSERT INTO t015 (id,username,note) VALUES (30, 'testuser1à', 'a grave');
  30     INSERT INTO t015 (id,username,note) VALUES (40, 'testuser1Ã…', 'A circ');
  31     INSERT INTO t015 (id,username,note) VALUES (50, 'TestUser1A', 'mixed case');
  32     INSERT INTO t015 (id,username,note) VALUES (60, 'TESTUSER1A', 'upper case');
  33 
  34 END;
  35 
  36 //
  37 
  38 DELIMITER ;
  39 
  46 /* Should generate 5 dupkey errors (taken as successful tests) */
  47 CALL populate_table;

It's a bit convoluted. To trap errors you have use the DECLARE HANDLER statement. DECLARE CONTINUE HANDLER FOR SQLSTATE '23000' means that whenever SQLSTATE is '23000', and that corresponds to a duplicate key error, then execute this block of code. All of that must necessarily be wrapped into a stored procedure. Handlers outside of stored procedures are not allowed.

In this particular tests, the table uses the utf8_unicode_ci collation table, so we are expecting a duplicate key error on username whenever we insert the string 'testuser1à' or 'TESTUSER1A', because 'testuser1a' was already inserted at the beginning. Of all the INSERT statements, only the first one is bound to succeed, so I put a SELECT tap.ok(1) for the duplicate key HANDLER and I expect 5 tests when I make the CALL populate_table;.

This of course may seem trivial. And I guess it is, but for me it's a much better way of learning than scouring through the manuals or the many blog posts out there that may or may not reflect the environment I'm working with.

Routinely running this kind of test suite makes it possible and easy to verify the database behaviour:

  • instantly
  • after upgrades (5.1 -> 5.5? -> 6?) or storage engine changes
  • after mysql configuration changes. For example, I discovered in this way that adding default-charset=utf8 in my MySQL config breaks everything.

I consider this my live documentation on how MySQL works. I would really appreciate if you have any feedback on this. Have fun!

Report from the Varnish Users Group (VUG5) meeting in Paris – Day 1

Last week I attended the VUG5 meeting (https://www.varnish-cache.org/vug5). The following is my report of the conference Day 1, the "Users" day.

TL;DR

I learned a lot on (for me) gray areas of Varnish like 3.0, VMODs, ESI and various corner cases. My presentation on how we use Varnish at Opera sparked a lot of interest especially in our thumbnail service.

Day 1, VUG5 users day

Day 1 was held at La Défense, a mega business district just outside of Paris. All day was filled with presentations by Varnish Software people and a few other companies. On with the list, and my notes on the side.

Keynote: Varnish in 2020 by Poul Henning Kamp, Varnish Software

Poul runs thttpd, he's not a varnish user, so welcomes feeback from all users. That's why of the VUGs.

Varnish today is "The HTTP delivery engine". And in 2020? Hard to predict. PHK usually predicts things really badly. What we _can_ see is:

  • HTTP/2.0 Last call status just a few weeks ago
  • Google's SPDY support in Varnish? Most likely. Depends on future development and what/how many clients pick it up
  • HTTP over UDP? Lots of interest in this lately

Most likely future work on varnish:

  • Clearer split of transport and semantics
    (could speak HTTP no matter whether over UDP, TCP or SPDY)
  • Generic pluggable protocols (SPDY, f.ex.)
  • Decouple client protocol and backend protocol. Talk SPDY to client, talk HTTP to backend.

SSL in Varnish? Unlikely, just use Pound or nginx or whatever. Pound is simple and robust.

Varnish Book by Kristian Lyngstøl, Varnish Software

Expanding and improving on the existing training course material, Kristian and some contributors created a "Varnish Book", to help people starting up with Varnish. It will be is freely available at https://www.varnish-software.com/static/book/. Now there's only a cute bunny though.

Varnish + Escenic by Richard Zuidhof, Escenic?

Richard explained how he used Varnish to migrate away from the Apache/Squid/Apache sandwich and made it better/faster and his company saved a lot of money in the process.

Interesting points:

50x errors received from the backends are served doing a restart in vcl_fetch() but hitting a "dummy" backend, a sort of static version of a real backend. Something like:


  sub vcl_recv {
     ...
     if (req.restarts > 0) {
         set req.backend = dummy;
     }
  }
 
  sub vcl_fetch {
     if (beresp.status == 500) {
        return (restart); # Or whatever this is
     }
  }

Also talked about various timeouts, like:

 
  {
    .first_byte_timeout = 1s;
    .between_bytes_timeout = 1s;
  }

and how he needed to reset them back to 120s/180s for some of their pages to work.

He said: a timeout event from backend should cause Varnish to fall back to stale content. Not the case currently.
Varnish will abort the fetch operation. So pay attention.

Mobile device detection by Lasse Karsten, Varnish Software

Talked about various libraries and ways to detect mobile devices, including:

  • libvarnish-deviceatlas
  • WURFL
  • … others I didn't write down in time

Basically it was a way to survey how many people
use this technology and say that Varnish Software has a
commercial solution but they are going to open source
it Soon(tm), or something along these lines.

I was a bit distracted because I was having problems with the laptop and my presentation
was coming up, so… I plan to go back to this presentation once the slides are up.

ESI and Varnish by Federico Schwindt, RBS

Summary of how RBS is using ESI for an internal website used by RBS employees.

Basically the service is composed of various "boxes", small windows in the page with some information that depends on location, department or other things, and they use Varnish to cache those small boxes and ESI to compose the final page.

Problems:

  • They can't find a way to also keep the fully composed page as a cache object.
  • Invalidation logic is complex because of inter-dependent content between different boxes.

Interesting: they use a HTTP header sent by the backend to instruct Varnish on when to do ESI processing, so ESI is not a on/off as a whole, but it can be triggered on specific pages. This is very cool because it could also solve the development/production setup problem I had always feared when using ESI. With that I mean the complication of using development environments with ESI, where every dev installation needs a ESI-aware varnish.

Varnish at Opera by me

I talked about how we use Varnish in our projects. I mentioned a few Varnish extensions I worked on, including varnish-accept-language and varnish-geoip, plus other tools like http-cuke.

There were plenty of real world examples of VCL configuration we use in the various projects. I also talked about the varnish puppet module we wrote, that comes with a bunch of interesting customizations and fixes, included in the puppet-modules repository on Github.

If you're interested, slides are published here:

http://www.slideshare.net/cstrep/vug5-varnish-at-opera-software

I got lots of feedback and questions about our picture thumbnail service, so I'll probably write more about it soon.

Security with VCL by Kacper Wysocki, Redpill Linpro

Easily one of the best talks of the day. Kacper explained his security.vcl project. Here's a few highlights, but it's really interesting, I hope slides will be up soon.

  • Wrote modsec rules parser and converter to VCL
  • Eduardo Scarpellini, Master thesis, OWASP, worked on a varnish-firewall project, similar in scope, and did a in-depth research, finding that out of the OWASP top broken apps, he could automatically block 73% of XSS and SQL injections.
  • security.vcl is now used in ~10 sites with lots of traffic
  • Drawback compared to mod-security is that no POST data can be analyzed (yet)
  • In the future, we will see a merge of security.vcl and varnish-firewall projects.

Varnish modules by Kristian Lyngstøl, Varnish Software

I don't remember much, but I think Kristian basically tried to get more people to use VMODs, and said there's now a nice page where a list of known VMODs is kept:

http://www.varnish-cache.org/vmods

and you can register your own VMODs and have them listed.

Stay tuned for the "Day 2, Developers day" part.

Using hypnotoad in production, anyone?

So, you're using hypnotoad in production. And it works perfectly for you. Maybe you have an Nginx or Apache in front of it configured as reverse proxy. Everything's great. Right? Right. Then I have a zillion questions for you.

Maybe I don't understand how it works, but I'm having the following problems:

  • "sometimes" hypnotoad won't stop. I usually try to stop it with:
    hypnotoad --stop /path/to/my/script
  • I use symlinks to deploy applications, so for example I deploy in /opt/myapp and each new deployment gets a timestamped folder, /opt/myapp/releases/20120224-180801.

    Then there's a symlink that always points to the last deployed version, /opt/myapp/current/opt/myapp/releases/{whatever-datetime}. Now, using hypnotoad --stop /opt/myapp/current doesn't work, because hypnotoad probably uses the actual filename, not the symlink, to identify the running application.

    That's fine, but then how can I stop it reliably? I wish it had a hypnotoad --force-stop mode or something.

  • Last problem, when I push a new deployment, and stop and restart hypnotoad, often the application doesn't work properly, it only generates exceptions for unknown reasons. Stopping and restarting again manually usually fixes the problems…
  • I was a bit frustrated today, so I decided to switch back to starman. I have never ever had a problem with it, so I will stick to it for now. But I would still be interested to know whether you use hypnotoad in production and how well it works. Write in the small box below, you don't need to register. Thanks :)