If you’re looking for a Perl client to connect to a statsd daemon, checkout Net::Statsd on CPAN, now at version 0.08.
This post is about the server component of statsd.
Tracking metrics: up to now
The idea of statsd started in Flickr by Cal Henderson, and some code is still available, but it’s not very functional or complete.
Since reading about statsd, I found the concept brilliant. I have been using a similar technique long before hearing about statsd though. I learned it from colleagues here at Opera in 2008. They were using it to track application metrics for the Opera Link server. I thought it was great, so I also implemented it, extending it by making it very easy to add metrics and to see the output automatically in Munin. Here’s how it worked basically:
# ...
use Opera::Stats;
# ...
Opera::Stats::count("site.logins");
# ...
The project code would have typically tens or hundreds of these calls. Each call would store/increment a counter in a local or remote memcached. Then a complementary Opera::Stats::Munin
module would automatically generate the output needed to implement a full Munin plugin given the metrics to be exposed.
So far, so good. Except there were a few things that didn’t work quite right:
- Using TCP connections, maybe even to remote machines, even though it was never a problem, could be in case the memcached machines went down
- Volume was a concern. I had to worry about tracking too many metrics. How would that affect functioning of memcached for regularly stored keys and values? Would those metrics-related keys cause evictions in the regular memcached content?
- Even though the munin integration made it very easy to have charts, there were still some limitations: creating new charts requires some wrapper plugin with 1 or 2 lines of Perl code. Flexibility was also an issue.
Enter statsd
I have been thinking of replacing this system with statsd for a while. However, I wanted to have a more in-depth look at it before deploying it.
Turns out that statsd is a simple project, which I like, but requires nodejs. Knowing next to nothing about nodejs, I took some time to learn a few things.
I also realized I have been wanting to learn about AnyEvent for a long time.
Net::Statsd::Server
Two weeks ago, I spent a busy weekend reimplementing 95% of statsd in Perl. On Sunday night, I had a functional version of statsd written in Perl with AnyEvent.
AnyEvent stuff is surprising at times. I found especially interesting to debug the cases where your timer (AE::timer
) doesn’t fire unless you actually save it to a scalar, as in:
# This won't fire!
AE::timer 10, 10, \&do_something;
# This will though.
# This behaviour is triggered by "defined wantarray"
my $t = AE::timer 10, 10, \&do_something;
Since that weekend, I have spent a few more nights tweaking Net::Statsd::Server. Yesterday I wrote a new piece of functionality (a new “File” backend) that is actually not in the original statsd.
It looks like I might need new backends as well, so I think it’s “an investment with a good ROI”, even though I did it mainly for fun and in my free time.
Performance
I wanted to make sure my statsd server implementation would be fast. I started by bringing up the nodejs statsd and firing my official benchmark script with 1 million iterations, and then comparing the results with my own statsd server.
That didn’t work out very well. Or rather, it worked out brilliantly, showing around 40K requests/s being handled by nodejs-statsd and 50K requests/s by Net::Statsd::Server
. Problem is: how do you measure the performance of a UDP server? Or, for that matter, of a UDP client?
I figured out that, being UDP connection-less fire-and-forget, it doesn’t really matter how many packets/s the client fires, as long as you can generate more than your server can handle. Just as a data point, I reached around 73-75k statsd API calls per second (for the gauge
API, around 55-58k for counters and timers). What really matters is how many packets reach the server.
BTW, I used another amazing piece of software called Devel::NYTProf to optimize the performance of the incoming packets code path as much as I could.
The test setup
To measure how many packets are received on the server-side, I prepared a test configuration:
{ graphitePort: 2003
, graphiteHost: "graphite.localdomain"
, host: "0.0.0.0"
, port: 8125
, backends: [ "./backends/graphite", "./backends/console" ]
, mgmt_address: "0.0.0.0"
, mgmt_port: 8126
}
The same configuration file for the Perl server becomes:
{ "graphitePort": 2003,
"graphiteHost": "graphite.localdomain",
"host" : "0.0.0.0",
"port": 8125,
"mgmt_address" : "0.0.0.0",
"mgmt_port": 8126,
"backends": [ "Graphite", "Console" ],
"log" : {
"backend" : "stdout",
"level" : "LOG_WARN",
}
}
Using the benchmark.pl
code mentioned above, run with:
$ perl benchmark.pl 1000000
I started up first the nodejs statsd, then the Net::Statsd::Server daemon and captured their output. Both servers are configured to use their Graphite backend and flush to a valid and active graphite host. The Console backend is also active for both servers, so I could capture the output and look at the statsd.packets_received
counter and directly measure how many packets are received in the server.
The benchmark utility with first argument = 1000000 generates 5 million statsd API calls, that is, 5 million UDP packets.
Of these 5 million packets, nodejs statsd was able to capture 2106768, 1596275, 1479145 and 1490640 packets over several runs.
Net::Statsd::Server, again in 3 different runs, was able to capture 2106242, 1884810, 1822042 and 1866500 packets.
I have performed more tests, and they had a very low deviation from the last runs (1.5M for etsy’s statsd and 1.8M for Net::Statsd::Server). Removing the 2 peak results of ~2.1Mb, it would seem that the Perl statsd is capable of receiving 22% more packets than the original statsd daemon written in javascript.
Of course, this is just my test. I have tried to run the test on different hardware, but I haven’t got significantly different results. If you try yourself, please let me know what numbers you get. I’d be curious to know :-)
SO_RCVBUF
Given the massive amount of UDP packets that were lost in the tests (50%+ in the best runs), I tried to figure out a way to improve this and I stumbled on SO_RCVBUF
.
My understanding was that bumping up SO_RCVBUF
on the listening UDP socket would dramatically decrease packet loss. However, I hadn’t been able to prove the theory because I hadn’t seen an improvement in the total number of packets received. At least until I read this article on UDP packet loss on stackoverflow.com, that pointed me to the net.core.rmem_max
sysctl.
After modifying net.core.rmem_max
, setting it to 100M, just to avoid its effect, and using the following code in Net::Statsd::Server
:
# Bump up SO_RCVBUF on UDP socket, to buffer up incoming
# UDP packets, to avoid massive packet loss when load is very high.
setsockopt($self->{server}->fh, SOL_SOCKET, SO_RCVBUF, 1*1024*1024)
or die "Couldn't set SO_RCVBUF: $!";
I can see some very interesting effect.
Re-running the node.js statsd, I could see an increased amount of captured packets (1691700, 1675902, ~10% increase).
Running again the Net::Statsd::Server
daemon, I recorded 2678507 and 2477246 packets, for an impressive ~40% increase!
As a last effort, I tried varying the SO_RCVBUF
size from 1 to 64Mb to see what effect it had on the amount of captured packets (or UDP packet loss if you prefer).
I haven’t run any scientific set of tests, but I can’t see any statistically significant increase for values greater than 4-8Mb, so I haven’t decided where to set the default in Net::Statsd::Server
yet. Any chosen value is likely to need specific sysctl tuning anyway, so YMMV.
Why?
Did I really do it for fun? Yes, mainly, but also because:
- I don’t like adding node.js to our production stack just to run statsd. I have never operated a node.js server, so I don’t want to take this “risk”. The product we’re building is going live soon! :-) And note that this does apply to anything, it’s not about node.js per se :-)
- to learn how statsd was put together
- to learn AnyEvent
- to learn how to build a high performance UDP server
- Basically, to learn :-)
Code is up on CPAN, as usual: https://metacpan.org/module/Net::Statsd::Server.
If you happen to use it, please give me some feedback!