Tag Archives: devops

MySQL (Percona XtraDB) slave replication crash resilience settings

It’s been a geological age since my last blog post!

Oh, so many things happened in the meantime. For the past four years, I worked on the development and operations side of the news recommendation system that powered Opera Discover. With enough energy, I have planned to write a recommender systems “primer” series. We’ll see.

Meanwhile, I’d like to keep these notes here. They’ve been useful to make MySQL replication recover gracefully from network instability, abrupt disconnections and generally datacenter failures. Here they are.

Coming MySQL 5.6 on Debian Wheezy, we began to experience mysql replication breakages after abrupt shutdowns or sudden machine crashes. When systems came back up, more frequently then not, mysql replication would stop due to corrupted slave relay logs.

I started investigating this problem and soon found documentation and blog posts describing the log corruption issues and how mysql development addressed that. Here’s the pages I used as references:

Additionally, we had (I believe unrelated) problems with some mysql meta tables that couldn’t be queried, even though they were listed as existing in the mysql shell and in the filesystem.
We solved this problem with the following steps:

DROP TABLE innodb_table_stats;
ALTER TABLE innodb_table_stats DISCARD TABLESPACE;
stop mysql
rm -rf /var/lib/mysql/mysql/innodb_table_stats.*
restart mysql

These steps have to be executed in this order, even if altering a table after having dropped it may seem nonsensical. It is nonsensical, as sometimes mysql things are.

Crash safe replication settings

We’ve distilled a set of standalone replication settings that will provide years and years of unlimited crash-safe replication fun (maybe). Here they are:

# More resilient slave crash recovery
master-info-repository = TABLE
relay-log-info-repository = TABLE
relay-log-recovery = ON
sync-master-info = 1
sync-relay-log-info = 1

Let’s see what each of these settings does.

master-info-repository=TABLE and relay-log-info-repository=TABLE instruct mysql to store master and relay log information into the mysql database rather than in separated *.info files in the /var/lib/mysql folder.
This is important because in case of crashes, we would like to ensure that master/relay log information is subject to the same ACID properties that the database itself provides. Corollary: make sure the relevant meta tables have InnoDB as storage engine.
For example, a SHOW CREATE TABLE slave_master_info should say Engine=InnoDB.

relay-log-recovery=ON is critical in case of corruption of relay log files on a slave system. When MySQL encounters corrupted relay log files during startup, by default it will drop the ball and halt. This option set to ON, will cause mysql to attempt refetching the relay log files from the master database. The master should then be configured to keep its binlogs for a suitable amount of time (often I use 2 weeks, but really depends on the volume of database changes). As a test, it’s possible to replace the current relay log file with a corrupted copy (from /dev/urandom for example). MySQL will discard the corrupted log file and attempt download from the master, after which a regular startup will be carried out. Fully automatic recovery!

sync-master-info=1 and sync-relay-log-info=1 enable the synchronized commit of both master and relay log information to the database with every transaction commit. This is again something that must be evaluated in each single application. Most probably if you have a high volume of writes, you don’t want to enable it. However, if the writes rate is low enough, this option won’t cost any additional performance and should instead make sure that the slave_master_info and slave_relay_log_info tables are always consistent with the state of the replication and of the rest of the database.

That is all. I’d love to hear any feedback or corrections to this information.

Display and filter traffic at the varnish level: vlogdump

Haven’t written much in the last few months. The reason is that I’ve been at work building the Opera Discover service backend, that we launched on Opera mobile for Android just a few days ago.

A few weeks before, during the first stress test sessions of Discover, I wrote this little tool called vlogdump that Opera allowed me to put up on github. The main purpose, besides learning awk :-) is to display and filter traffic coming into your varnish daemon..

vlogdump is not meant to replace varnishlog but I know that sometimes varnishlog gives me too much output to deal with, especially if I want to pinpoint a single client or a single request. I know that the varnishlog that ships with varnish 3.0.x is way better in this regard, but we’re using 2.1.x, and that version of varnishlog is not as capable.

vlogdump is easier to look at than varnishlog, but at the same time it conveys much more information than varnishncsa or the typical access.log format.

Here’s an example of output:

$ varnishlog | vlogdump -v only_misses=1
172.22.0.15 => GET /assets/e85ed0a7b1b87120a0a2bfa025531c6733a48802 HTTP/1.0 MISS
            <= 200 OK 28.432 ms 172.22.0.18 => GET /assets/5a9e9440c5c85e8dc5d65e03e15c95e390901fa7 HTTP/1.0 MISS
            <= 200 OK 36.905 ms 172.22.0.18 => GET /icons/categories/te/icon32x32-technology.png HTTP/1.0 MISS
            <= 304 Not Modified 0.589 ms 172.22.0.15 => GET /api/fetch/article-preview/?client=2&language=en-GB HTTP/1.1 MISS
            <= 301 MOVED PERMANENTLY 8.381 ms 172.22.0.18 => GET /assets/c3830e95b717761005e26ce49ebab253e0ccb40b HTTP/1.0 MISS
            <= 200 OK 291.354 ms 172.22.0.18 => GET /api/category?client=2&language=en-GB HTTP/1.1 MISS
            <= 200 OK 58.025 ms   ...

Another interesting example.

Show request and response headers of transactions that resulted in cache hits and had request headers (any of them) matching "Android":

$ varnishlog | vlogdump -v show_req_headers=1 -v show_resp_headers=1 -v req_headers_match=Android -v only_hits=1
83.149.37.122 => GET /api/category/?... HTTP/1.1 HIT
            <= 200 OK         0.088 ms
   req.http.Accept = application/json;version=1
   req.http.Accept-Encoding = gzip
   req.http.Host = ...opera.com
   req.http.Connection = Keep-Alive
   req.http.User-Agent = Mozilla/5.0 (Linux; Android 4.1.2; GT-N7100 Build/JZO54K) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.58 Mobile Safari/537.31 OPR/14.0.1074.57768
 
   beresp.http.Server = Apache
   beresp.http.Content-Encoding = gzip
   beresp.http.Content-Type = application/json
   beresp.http.Vary = Accept-Encoding, Origin
   beresp.http.Content-Length = 4217
   beresp.http.Date = Sat, 25 May 2013 07:59:53 GMT
   beresp.http.X-Varnish = 1611090407 1611007435
   beresp.http.Age = 267
   beresp.http.Via = 1.1 varnish
   beresp.http.Connection = keep-alive

Now that you're eager to try it :-), you can do so in a few commands, and assuming you have the right awk installed:

wget -q -Ovlogdump https://raw.github.com/cosimo/vlogdump/master/vlogdump
varnishlog | ./vlogdump [options]

The documentation lists all the available options.

You can do more interesting things:

  • display request or response headers for each transaction
    (-v show_req_headers=1, -v show_resp_headers=1)
  • show only requests slower than 200ms (vlogdump -v only_slow=200)
  • show only cache misses or hits (-v only_hits=1 or only_misses=1)
  • show only transactions where the URL matches regexp X
    (-v url_match='X' or -v url_match='!X' for negative match)
  • show only transactions where the HTTP status code was X
    (-v only_status=X)
  • show only transactions where the request or response headers match a given regular expression (-v req_headers_match=Blah, -v resp_headers_match=Error)

You can also combine most of these options together. That is very useful when you are interested in a small fraction of the traffic, but you want to see the whole in-flight transactions.

One recommendation though. It is my first (last?) significant awk script :-) I know it works well, and I'm using it, but due to the way it works, I wouldn't leave it running for long periods of time, as it will slowly eat your memory keeping track of all transactions and clients.

If you have feedback or questions, feel free to comment on github or send me an email.

My experience at Velocity Europe 2011 in Berlin

TL;DR

This year there was the 1st edition of Velocity Europe. I got to present a talk on a DDoS attack we faced at Opera, and it was really awesome to be there.

The long version…

Around July this year I knew there was going to be a Velocity Conference in Europe, and I decided I would try to propose a few ideas for talks. I didn't have my hopes too high, but I wanted to give it a shot anyways, pushing myself way out of my comfort zone :)

The worst that could happen was that the talks didn't get accepted. After a month or so the crazy thing happened, and I got an invite to speak at Velocity, due in November.

Preparation

The first few months passed while I was slowly gathering material for the talk. The idea was talking about the DDoS attack that struck us in October 2010. Almost a year had passed, so if we hadn't taken notes and collected all sorts of logs and information, we wouldn't have had any chance to reconstruct all the "story" with enough detail to be interesting.

Anyway, weeks went by, and in September I started writing down an outline of the talk. It consisted in describing what happened during the DDoS and how we faced it, what we did, how we figured out what to do, etc… but I didn't have a clear idea of what to convey with the actual presentation. What would be the core message, if any?

If I learned anything out of all this, is that writing an outline is absolutely the best favor you can do to yourself to avoid so many problems later on. Just write it down as a text, a blog post, a story. Mind maps are also useful for me.

Last 3-4 weeks flew away while I was trying to put together a decent deck of slides.

In Berlin: pre-conference

"Birds of feathers" was the pre-conference event that took place on Monday 7th (November 2011, if you're reading this in the future), put together by a local team led by Schlomo Schapiro, which I got to talk to also during the conference. It was a good event, Steve Souders and John Allspaw and many other conference attendees were already there. There were sponsor companies presenting their products.

The most interesting sessions of the pre-conference IMO were:

  • 100ms: Steve Souders pushed everyone to think about the next level of web performance. How to bring down the "loading time" of web pages to 100ms. There was an interesting discussion about that. My point was that loading time really needs to be divided into at least dns resolution, server processing, network transfers, client rendering. So there's at least 4 totally different chunks that make up the load time and all of them can be optimized, but with varying levels of gain and complexity.
  • Dyn inc presentation about their product dashboard, that led to a better productivity and communication between teams. Cory van Wollerstein explained their mash-up of Jira and Confluence, used to automatically pull information from the tickets db and provide high-level overviews to executive teams. Very cool. He also argued whether having product managers is a good thing for a company.

The rest of the day I was busy polishing my presentation, and trying to rehearse at the hotel. A month before the conference, I had bought The Naked Presenter (ebook edition), hoping that it would help me do a decent presentation. The book of course recommended to rehearse. It felt very weird and embarassing, but I'm *so* glad I did it. I managed to streamline the presentation, and memorize the sequence of slides.

The Conference – Day 1

Schedule:

http://velocityconf.com/velocityeu/public/schedule/grid/2011-11-08

Keynotes

Opening remarks, plus Theo Schlossnagle, one of the minds behind SurgeCon, on how good operations dudes are usually generalists and need to have a wide spectrum knowledge instead of being "(Perl|Python|Ruby|Java) developers". I really recognize myself in this more generalist role than, for example, the Ruby-on-Rails guy.

Lightning demos

These were lightning demos during the first morning:

Rest of Day 1 went to hell

I had to convert all my slides to 4:3 and test again with the on-site equipment. I was also freaking out at the same time, so I missed everything else until my talk. Sorry :)

Most talks have been recorded and are already up on the Velocity site. Particularly interesting IMO, but video not available yet, are:

My talk

As I said, it was about the DDoS attack to my.opera.com of October 2010. I basically talked about how we found out we were under DDoS, and how we struggled to find our way to keep the site up and running despite the traffic. This was a mid-scale DDoS with around 18k distinct IPs attacking us. We had a hard time, but it was also very much fun in retrospect :) We learned quite a lot in the process, about HTTP and TCP/IP, nkiller2 and the TCP zero-window exploit. Most importantly, we learned to make better use of old and new tools to do troubleshooting. You will find all of this in the slides.

I did my best, and I think it was well received by the audience. While on stage, I really had the feeling that people enjoyed it, plus several folks came to say hello afterwards. One of the most frequent comments I heard was that people found my talk honest. That is the single thing I appreciate the most, because that had been one of my goals since the start. To tell an honest and detailed story of how things went, without pretending to be the super awesome heroes that know everything and can fix anything in no time.

Unfortunately, after the conference I was informed that there had been no recording of the talk. That is really sad. However, since there's no recording, I can pretend I was a nice speaker, given the ratings :). Seriously, if you have a picture or video recording, contact me :)

Here's the slides if you're interested:

http://velocityconf.com/velocityeu/public/schedule/detail/21653

The Conference – Day 2

Schedule:

http://velocityconf.com/velocityeu/public/schedule/grid/2011-11-09

Keynotes

Very inspirational talk by Jeff Veen, Typekit.com

Very well presented, great visuals. Great overall. How to create conditions for teams to work and work well.

http://velocityconf.com/velocityeu/public/schedule/detail/21788

Anticipation: What could possibly go wrong? by John Allspaw, Etsy

A great talk about how to prevent, analyze, respond to Operations problems. I very much like John's style, I think he's a pioneer, at least he introduced me to many great ideas, one above all, continuous deployment. I also like his many references to aviation, aerospace and military engineering fields.

http://velocityconf.com/velocityeu/public/schedule/detail/22258

Full stack awareness, Artur Bergman, Fastly

He's Artur Bergman. Listen to him :-) If anything, because he's really authentic.

http://velocityconf.com/velocityeu/public/schedule/detail/22914

Lightning demos

Another session of lightning demos, for our pleasure:

Browser performance track

This was a track in itself. I lost all of it, since I mainly followed the Operations track, but this was really interesting I heard. Recent speed enhancements in Opera, Chrome, Firefox and Javascript in general were explained in detail.

Afternoon talks

Deploying large payloads at scale, Ramon van Alteren (hyves.nl)

Biggest social network in the Netherlands (4M daily active users, ~10M total users). Ramon is a very cool guy. They have 3.5K servers, and their main application consists of 750Mb compiled php binaries to deploy. And they are experimenting with bittorrent tools to do that :)

I had a few hours of engaging talk with Ramon at one of the social events that followed the conference. We found lots of similarities in how we're dealing with infrastructural growth, scaling, etc… We both use config management tools like puppet extensively in our organizations. We promised each other to remain in touch about deployment matters.

http://velocityconf.com/velocityeu/public/schedule/detail/21571

HTTP connection management from 10 users to 100 millions, Bradley Heilbrun, YouTube

Really interesting dive into YouTube early (2005-2007) architecture with Apache, load balancers, GSLB.

I met Bradley later on that day and we had a quick chat. Turns out they use(d) PowerDNS with its pipe backend for geographic load balancing, much like as we do in Opera with GeoDNS. That made my day :-) It's a pity that companies like YouTube don't talk much about their current technology. They usually tell you about 2-3 years old architectures. That's still very valuable, of course.

http://velocityconf.com/velocityeu/public/schedule/detail/21708

Conclusions

If you're even remotely interested in operations, devops, running a service, scaling, performance, infrastructure, then Velocity is the conference. Surge is another one, probably even better, more hardcore-engineering focused. From my perspective, there's a couple of things that could be improved:

  • while I understand that sponsors are what makes conferences like Velocity possible, some sponsors took too much time out of the actual talk tracks. One or two talks were very promotional in nature, and it was clear to everyone that these companies were pushing their products or themselves. Maybe it wasn't their intention, but to me and to others I talked to, it came out that way.
    I think Velocity needs to screen better this type of talks and separate them from the authentic content that people want, the "stories from the trenches". As a counter-example, Google, among other companies, were doing sponsoring (and recruiting!) activities in a separate hall. That worked very well for everyone. Please let's keep it that way.
  • the on-site technical team wasn't fully prepared to handle presentations made with Open Office. That is not acceptable if you ask me, even if the majority of speakers have a Mac. It's 2011 (2012 now even), so you really need to be prepared to read OpenOffice files. I realize that wasn't Velocity organizers' fault, but I think it's something to consider for next time.

That said, I'm really really happy about my experience at Velocity Europe, both as a speaker and as attendee. It was really awesome, and worth every moment I spent working to prepare for it. Thank you O'Reilly, and I hope to be able to participate again some day :)

How to detect the Debian version of a server without logging in

As Ops team, we're slowly taking over operations for several other teams here at Opera. One of our first tasks is to:

First idea to check whether a server is Debian Lenny or Squeeze was to login and cat /etc/debian_version. However, if you haven't accessed that machine before, and your ssh keys are not there, you can't do that. In our case, we have to file a request for it, and it can take time. Wondering if there was a quicker way, I came up with this trick:

#!/bin/sh
#
# Tells the Debian version reading the OpenSSH banner
# Requires OpenSSH to be running and ssh port to be open.
#
# Usage: $0 <hostname>
#
# Cosimo, 23/11/2011

HOST=$1

if [ "x$HOST" = "x" ]; then
    echo "Usage: $0 <hostname>"
fi

OPENSSH_BANNER=$(echo "n" | nc ${HOST} 22 | head -1)

#echo "OPENSSH_BANNER=$OPENSSH_BANNER"

IS_SQUEEZE=$(echo $OPENSSH_BANNER | egrep '^SSH-.*OpenSSH_5.*Debian-6')
IS_LENNY=$(echo $OPENSSH_BANNER   | egrep '^SSH-.*OpenSSH_5.*Debian-5')
IS_ETCH=$(echo $OPENSSH_BANNER    | egrep '^SSH-.*OpenSSH_4.*Debian-9')

# SSH-2.0-OpenSSH_5.1p1 Debian-5
# SSH-2.0-OpenSSH_4.3p2 Debian-9etch3
# SSH-2.0-OpenSSH_5.5p1 Debian-6+squeeze1

#echo "Squeeze: $IS_SQUEEZE"
#echo "Lenny: $IS_LENNY"
#echo "Etch: $IS_ETCH"

if [ "x$IS_SQUEEZE" != "x" ]; then
    echo "$HOST is Debian 6.x (squeeze)"
    exit 0
fi

if [ "x$IS_LENNY" != "x" ]; then
    echo "$HOST is Debian 5.x (lenny)"
    exit 0
fi

if [ "x$IS_ETCH" != "x" ]; then
    echo "$HOST is Debian 4.x (etch)"
    exit 0
fi

echo "I don't know what $HOST is."
echo "Here's the openssh banner: '$OPENSSH_BANNER'"

exit 1

It reads the OpenSSH server banner to determine the major Debian version (Etch, Lenny, Squeeze). It's really fast, it's very simple and hopefully reliable too. Enjoy. Download from https://gist.github.com/1389206/.

Migration of VCL configuration from Varnish 2.0 to 2.1

Recently we migrated most of our services from Varnish 2.0 to 2.1.
I'd like to explain what we changed with code (VCL) examples side by side,
in case anyone still needs to migrate to 2.1 and needs some help as well :-)

req., bereq., beresp., and obj.

Usually this naming difference in VCL is not really explained. They say "x has been renamed to y"
and you should change the name. That's kind of annoying. In reality, yes, the names changed, and at first
it is annoying, but trying to understand why they changed allows them to stick
in your mind very easily.

In vcl_fetch(), obj. is now beresp.. Why?
Because vcl_fetch() is the part of the request stage where Varnish has
already performed a request against a backend and got a response from it. That means that
if you refer to obj. in vcl_fetch(), it really means that your
touching the backend response, hence beresp..

Similarly in vcl_pipe(), that is executed when the result of vcl_recv()
is to switch to pipe mode. In that case, however, Varnish hasn't made the request
to the backend yet, so if you used obj. in vcl_pipe() you really meant
to change the request that was going to be made to the backend, hence bereq..

Let's see the changes we had to make:

 sub vcl_fetch {
 
-    set obj.ttl = 88s;
-    set obj.grace = 10m;
-    set obj.http.X-My-Opera = "http://youtube.com/watch?v=br79xGSpgF4";
+    set beresp.ttl = 88s;
+    set beresp.grace = 60m;
+    set beresp.http.X-dramatic = "http://www.youtube.com/watch?v=a1Y73sPHKxw";
 
 }

And:

 sub vcl_pipe {
     # Streaming files too (see related vcl_recv() rule).
     # We need to close the request, or varnish remains in pipe
     # mode for the entire session with that client.
-    set req.http.connection = "close";
+    set bereq.http.connection = "close";
 }

Backend probes and .initial

Backend probing allows Varnish to detect that backends are either "healthy"
or "sick". The probe VCL config block allows to tweak how this should work. In
particular, .threshold is the number of successful probes that are necessary
for Varnish to consider a backend healthy. .interval is the number of seconds
between one probe and the following one.

As an example, you can define that a backend should be considered working
(healthy) when it answers successfully to at least 3 probes, with an interval of
10 seconds between each probe. In Varnish 2.0.4, this means that if restarted,
Varnish will wait 3 times 10 = 30 seconds before serving any requests
from that backend
, because all backends were considered dead (sick) at startup.

In 2.1 this limitation is removed by introducing an .initial attribute
in the probe block. .initial is the number of probes considered successful
when the service is started, or the backend is added, and there's no information about it.
The default value is assumed to be equal to .threshold, so backends are considered
healthy as soon as they are introduced.

I think you can understand from these tiny details how well Varnish is engineered.
This just makes sense, doesn't it? :-) Here's the diff from 2.0 to 2.1:

 backend nginx {
     .host = "localhost";
     .port = "8080";
-
-    # Disabled to avoid the 15s startup
-    # 2.0.4-5 doesn't have .initial
-    #
-    #.probe = {
-    #  .url = "/ping.html";
-    #    .interval = 5s;
-    #    .timeout = 1s;
-    #    .window = 5;
-    #    .threshold = 3;
-    #}
+    .probe = {
+        .url = "/ping.html";
+        .interval = 10s;
+        .timeout = 2s;
+        .window = 10;
+        .threshold = 3;
+        .initial = 3;
+    }
 }

And in vcl_recv():

 sub vcl_recv {

 [...]

-    #----------
-    # DISABLED: Only enable when .probe block above is enabled
-    #----------
     # Detect broken backends and keep serving requests if possible
-    #if (! req.backend.healthy) {
-    #    set req.grace = 10m;
-    #} else {
-    #    set req.grace = 5s;
-    #}
+    if (! req.backend.healthy) {
+        set req.grace = 60m;
+    } else {
+        set req.grace = 5s;
+    }

Regular expression matching

Another "big" difference is the use in 2.1 of a Perl-compatible regular expression engine,
(PCRE) instead of the POSIX-style regex matching that used to be in 2.0.
This is a good change for me, as I'm pretty much used to Perl regex and I know next to nothing
about POSIX.

This change actually created a subtle problem that I caught only with a thorough testing
of our configurations. We use regex matching in a few places in our VCL configuration,
usually to analyze cookies
and set special "flags" that are then used to force
a HTTP Vary header, to make Varnish store different cached versions of the same
URL.

One of these cases is the language cookie, where we store a sticky
user preference about site language. Here's how the code changed:

  # STD: Sticky language cookie
  if (req.http.Cookie ~ "language=") {
      set req.http.X-Language =
-         regsub(req.http.Cookie, "^.*?language=([^;]*?);*.*$", "1");
+         regsub(req.http.Cookie, "^.*?language=([^;]*);*.*$", "1");
  }

  ...

  # Mobile view cookie
  if (req.http.Cookie ~ "mobile=") {
-     set req.http.X-Mobile = 
-         regsub(req.http.Cookie, "^.*?mobile=([^;]*?);*.*$", "1");
+     set req.http.X-Mobile =
+         regsub(req.http.Cookie, "^.*?mobile=([^;]*);*.*$", "1");
  }

In case you find it difficult to spot the change, it's the removal of the *?
(non-greedy star) operator. Non-greedy matching was used in 2.0, POSIX matching, to make
sure that the * didn't match too many characters, and thus eat part of other cookies. Except
POSIX regex matching does NOT have a non-greedy star operator. I just
didn't know that, and it's of course a bug, but it had worked perfectly so far… WTF???

For even more weirdness, why did I take the non-greedy star (*?) away now that it should
be supported with PCRE-matching? I removed it because otherwise the result of those
regsub() expressions are always empty!

Believe it or not, it looks exactly like 2.0 had PCRE and 2.1 has POSIX, which is
obviously not what's happening. If you know more about this and you can shed some light,
please contact me or leave a comment below.

Hope you liked this 2.0 -> 2.1 migration journey. I'm looking forward to 2.1 -> 3.0!
It's a bit more work there, because I will need to migrate my
my accept-language C extension
to the new vmod system, which I already started working on :-)

Have fun!

Surge 2010 scalability conference in Baltimore, USA – DAY 2

This is a summary of day 2 of the Surge conference that took place in Baltimore, USA, 30th of September and 1st of October 2010. For a quite comprehensive blog post about day 1, you can read my previous post.

Here comes the list of talks I attended during Day 2.

Brian Cantryll – failures in commodity hardware

What happens when commodity hardware is used in an "enterprise" hardware project? Brian guided the audience through this industrial hw project. There was no recorded video of this talk, due to the content being potentially "sensitive". Very interesting talk, and Brian is IMO a very good speaker.

Benjamin Black – FastIP

Benjamin presented a – for me – new way to analyze metrics of a network, named "Flow". The flow-based network metrics can represent a network activity in a way that is completely different and much more accurate than what's usually done by operations and sysadmin departments. The downside is that is generates a lot of data. The advantage is that you can analyze and even replay? any traffic that took place between any two nodes of the network. I'm sure I didn't understand correctly because this would be amazing.

There's products out there that offer flow-based network analysis: Cisco Netflow, Ntop NProbe, etc… There's also a IETF working group about flow. We couldn't see any example/demo because there was a problem with the slides, IIRC.

FastIP also offers a related service. I contacted Benjamin about this after his talk. Maybe we'll be able to try something out or at least have a demonstration.

My TODO list:

Gavin Roy – Scaling MyYearBook.com

One of the most interesting talks in this conference IMO. MyYearbook is a Postgres shop, among the top 25 trafficked sites in the USA.

Gavin talked about many things they did to scale their site as the traffic was growing. Here's some of the things I remember:

  • DB connection pooling very important for them. Made a world of difference. They use PgBouncer and pgPool2
  • DB Horizontal scaling with pIProxy. TODO: look it up
  • DB Replication w/ Londiste, Slony, Bucardo
  • Postgresql 9.0 based standby to increase read-only capacity, and for hot-standby.
  • Partitioned the database by table, feature available since Pg 8.1

They have a primary-to-secondary master failover procedure. They looked into automating it, but a tech judgement is really necessary in case something goes wrong, so they will keep it manual. This was a question I asked to Gavin, since we've thought about automating our failover procedure for MySQL, but it's not so easy to just decide when to trigger the failover…

For user storage, they use Isilon IQ Series, apparently a FreeBSD appliance with on-board NFS. For DB servers, they looked at different solutions, but they keep coming back to direct attached storage. Their man db server, they have a massively powerful machine, IIRC, 512Gb of RAM and 128 cores machine. I have to double check this because it seems really impressive.

John Allspaw – Go or No Go

Another great talk by John, well presented and with great content. Not easy to summarize. The main topic was the "Go or no-go meeting", a 10 minutes get together of all involved parties before releasing changes or launching any new feature live.

This meeting basically consists of Yes/No questions:

  • Have you tested enough to deploy? QA still needed?
  • Has the feature being communicated (blog/forum/…)?
  • Does everyone know: when it will go live? who will push the feature?
  • Has the feature been in production for staff (or beta users)? That can be tricky to implement if the new feature implies social interactions (beta user tagging non-beta user)
  • Is it possible to dark launch this feature? Will we?
  • Is it possible to turn on this feature on a % of users? Will we?
  • Does it involve new infrastructure? If so: is there monitoring in place? (BLOCKER)
  • On/Off switch in the code/config is in place? Is it documented?
  • Are all the relevant people available for communication and launch?
  • Is there a place for users to provide feedback about the feature?
  • Post-launch "it's all done" time agreed?
  • Contingency checklist done and everytime reviewed it? (BLOCKER)

The "Contingency list" should answer the question: "What could possibly go wrong? What will we do about it?", with a list of potential issues and how to solve them in case shit hits the fan.

Apart from the Go/No-go meeting, which would be, also according to my past experience, a great way to avoid problems, there's at least a couple more really nice things to keep in mind when developing or launching a new feature:

  • "Dark launches": a dark launch is essentially a full launch of the new feature, but in such a way that is invisible to users. So if you're making db queries and processing stuff, you keep doing all that, you just throw the data away. You will be able to realize the (almost) full impact of the new feature on your application and compensate accordingly.
  • Feature "sampling" (% of users): you just enable the full feature for a small, and then growing, percentage of your user base. You can gradually grow to 100% and test the effect of the changes.

Great stuff.

Neil Gunther – Quantifying scalability

Here I was a bit too excited, due to my talk coming next, so unfortunately I didn't pay too much attention. It's a full analysis of scalability seen as a mathematical function, as capacity of your system as the load increases.

Cosimo Streppone – Scaling challenges of my.opera.com

I think

I used 5 minutes to show a live demo of the My Opera realtime monitor application that we built and afterwards I got very interesting questions, and also some nice twitter messages about it.

I also talked about how we've experimented in distributing requests across the different datacenters with our little geodns tool.

All in all, for me it was a fantastic experience. Practice will make me better, so I look forward to a next time :-)

Baron Schwartz – Scaling without sharding

Baron works for Percona. I had read some talks of his. I think he's a really good speaker. He explained in detail the scenarios that arise when dealing with database scaling, the typical characteristics of reads and writes, single server vs multiple servers deployments.

Basically what the talk tries to suggest is that very few situations require to shard your database. Single server setups can go very far, by optimizing the way the db works. Quote: "Sharding should be your last resort". Sharding should be enforced when write demand exceeds write capacity, so avoid sharding if you can, try to buffer/collate writes, defer update work, etc..

Closing day 2

Theo Schlossnagle closed the conference with a plenary keynote about a semi-serious "brief history of computing". Much fun, and a goodbye to next year's Surge.

For a glimpse of what happened live at the conference, you can also check out the Twitter stream for #surgecon.

Definitely a great conference. Stay tuned for videos and slides on the official site, http://omniti.com/surge/2010.

Surge 2010 scalability conference in Baltimore, USA – DAY 1

This was the first year the Surge conference took place, in Baltimore, USA. OmniTI is the company that organized it.

30" summary (TL;DR)

The conference was amazing. Main topic was scalability. Met a lot of people. 2 days, 2 tracks and 20+ speakers. Several interesting new products and technologies to evaluate.

The long story

The conference topics were scalability, databases and web operations. It took place over two days filled with high-level talks about experiences, failures, and advice on scaling web sites.

The only downside is that I had to miss half of the talks, being alone :). The good thing is that all videos and slides will be up on the conference website Soon™

Lots of things to be mentioned but I'll try to summarize what happened in Day 1.

John Allspaw – Web Engineering

First keynote session by John Allspaw, former Flickr dev, now Etsy.com.

Summary: Web engineering (aka Web Operations) is still a young field. We must set out to achieve much higher goals, be more scientific. We don't need to invent anything. We should be able to get inspiration and prior art from other fields like aerospace, civil engineering, etc…

He had lots of examples in his slides. I want to go through this talk again. Really inspiring.

Theo Schlossnagle – Scalable Design Patterns

Theo's message was clear. Tools can work no matter what technology. Bend technologies to your needs. You don't need the shiniest/awesomest/webscalest. Monitoring is key. Tie metrics to your business. Be relevant to your business people.

Ronald Bradford – Most common MySQL scalability mistakes

If you're starting with MySQL, or don't have too much experience, then you definitely want to listen to Ronald's talk. Will save you a few years of frustration. :)

Companion website, monitoring-mysql.com.

Ruslan Belkin – Scaling LinkedIn

Ruslan is very prepared and technical, but maybe I expected a slightly different type of content. I must read again the slides when they're up. LinkedIn is a mostly ("99%") Java, uses Lucene as main search tier. Very interesting: they mentioned that since 2005-2006, they have been using several specific services (friends, groups, profiles, etc…) instead of one big database. This allows them to scale better and more predictably.

They also seem to use a really vast array of different technologies, like Voldemort, and many others I don't remember the names right now.

Robert Treat – Database scalability patterns

Robert is a very experienced DBA with no doubt. He talked about all different types of MySQL configurations available to developers in need of scaling their apps, explaining them and providing examples: horizontal/vertical partitioning, h/v scaling, etc…

I was late for this talk so I only got the final part.

Tom Cook – A day in the life of Facebook operations

I listened to the first 10-15 minutes of this talk, and I had the impression that this was probably the 3rd time I listen to the same talk, that tells us how big Facebook is, upload numbers, status updates, etc… without going into specific details. This of course is very impressive, but it's the low-level stuff that's more interesting, at least for me.

Last time I had attended this talk was in Brussels for Fosdem. I was a bit disappointed so I left early. According to some later tweets, the last part was the most interesting. Have to go back on this one, and watch the video. Well… at least I got to listen to the last part of…

Arthur Bergman – Scaling Wikia

Lots of Varnish knowledge (and more) in this talk!

I had read some earlier talks by Artur, always about Varnish, and I have learnt a lot from him.
I strongly suggest to go through his talks if you're interested in Varnish.

They "abused" Urchin tracker (Google Analytics) javascript code to measure their own statistics about server errors and client-side page loading times. Another cool trick is the use of a custom made-up X-Vary-URL HTTP header to keep all linked URLs (view/edit/etc.. regarding a single wiki page) in one varnish hash slot. In this case, with a single purge command you can get rid of all relevant pages linked to the same content.

They use SSDs extensively. A typical Wikia server (Varnish and/or DB) has got:

  • 2 x 6 cores westmere processor
  • 6 x Intel X25 SSD (~ $2000)
  • 2 x spinning drives for transaction logs (db)

"SSD allows you JOINs with no performance degradation."

Peak speeds reached (this is random not sequential: amazing!):

  • 500 Mbyte/s random read with avg latency of 0.2 ms
  • 220 Mbyte/s random writes

They use their own CDN based on Dynect (I think a Dyn Inc. service, see below).
Still using Akamai for a minor part of their static content.

Wikia is looking into using Riak, and a Riak-based filesystem to hook up directly to Varnish for really fast file serving.

Mike Malone – SimpleGeo

SimpleGeo implemented a geographic database over apache cassandra, able to answer spatial queries. They researched into using PostGIS (postgres-based GIS DB, very common product), but wasn't as flexible as they needed (don't remember exactly why).

TODO: look into "Distributed indexes over-DHT". He indicated it as prior art for their system.
This talk was a bit complicated for me to follow, so I'll have to watch it again.

Closing day 1

At the end of the day, there was a SQL vs NoSQL panel, which I skipped entirely. Maybe it was interesting :) The after-hours event that closed day 1 was organized by Dyn Inc. It was fantastic. Lots of good beer, martinis, and good food. I went to bed early, since I was still jetlagged. Day 2 started at 9 AM.

Time for a break :)

And then on to Day 2:

http://my.opera.com/cstrep/blog/2010/10/07/surge-2010-scalability-conference-in-baltimore-usa-day-2