Tag Archives: upgrade

Migration of VCL configuration from Varnish 2.0 to 2.1

Recently we migrated most of our services from Varnish 2.0 to 2.1.
I'd like to explain what we changed with code (VCL) examples side by side,
in case anyone still needs to migrate to 2.1 and needs some help as well :-)

req., bereq., beresp., and obj.

Usually this naming difference in VCL is not really explained. They say "x has been renamed to y"
and you should change the name. That's kind of annoying. In reality, yes, the names changed, and at first
it is annoying, but trying to understand why they changed allows them to stick
in your mind very easily.

In vcl_fetch(), obj. is now beresp.. Why?
Because vcl_fetch() is the part of the request stage where Varnish has
already performed a request against a backend and got a response from it. That means that
if you refer to obj. in vcl_fetch(), it really means that your
touching the backend response, hence beresp..

Similarly in vcl_pipe(), that is executed when the result of vcl_recv()
is to switch to pipe mode. In that case, however, Varnish hasn't made the request
to the backend yet, so if you used obj. in vcl_pipe() you really meant
to change the request that was going to be made to the backend, hence bereq..

Let's see the changes we had to make:

 sub vcl_fetch {
 
-    set obj.ttl = 88s;
-    set obj.grace = 10m;
-    set obj.http.X-My-Opera = "http://youtube.com/watch?v=br79xGSpgF4";
+    set beresp.ttl = 88s;
+    set beresp.grace = 60m;
+    set beresp.http.X-dramatic = "http://www.youtube.com/watch?v=a1Y73sPHKxw";
 
 }

And:

 sub vcl_pipe {
     # Streaming files too (see related vcl_recv() rule).
     # We need to close the request, or varnish remains in pipe
     # mode for the entire session with that client.
-    set req.http.connection = "close";
+    set bereq.http.connection = "close";
 }

Backend probes and .initial

Backend probing allows Varnish to detect that backends are either "healthy"
or "sick". The probe VCL config block allows to tweak how this should work. In
particular, .threshold is the number of successful probes that are necessary
for Varnish to consider a backend healthy. .interval is the number of seconds
between one probe and the following one.

As an example, you can define that a backend should be considered working
(healthy) when it answers successfully to at least 3 probes, with an interval of
10 seconds between each probe. In Varnish 2.0.4, this means that if restarted,
Varnish will wait 3 times 10 = 30 seconds before serving any requests
from that backend
, because all backends were considered dead (sick) at startup.

In 2.1 this limitation is removed by introducing an .initial attribute
in the probe block. .initial is the number of probes considered successful
when the service is started, or the backend is added, and there's no information about it.
The default value is assumed to be equal to .threshold, so backends are considered
healthy as soon as they are introduced.

I think you can understand from these tiny details how well Varnish is engineered.
This just makes sense, doesn't it? :-) Here's the diff from 2.0 to 2.1:

 backend nginx {
     .host = "localhost";
     .port = "8080";
-
-    # Disabled to avoid the 15s startup
-    # 2.0.4-5 doesn't have .initial
-    #
-    #.probe = {
-    #  .url = "/ping.html";
-    #    .interval = 5s;
-    #    .timeout = 1s;
-    #    .window = 5;
-    #    .threshold = 3;
-    #}
+    .probe = {
+        .url = "/ping.html";
+        .interval = 10s;
+        .timeout = 2s;
+        .window = 10;
+        .threshold = 3;
+        .initial = 3;
+    }
 }

And in vcl_recv():

 sub vcl_recv {

 [...]

-    #----------
-    # DISABLED: Only enable when .probe block above is enabled
-    #----------
     # Detect broken backends and keep serving requests if possible
-    #if (! req.backend.healthy) {
-    #    set req.grace = 10m;
-    #} else {
-    #    set req.grace = 5s;
-    #}
+    if (! req.backend.healthy) {
+        set req.grace = 60m;
+    } else {
+        set req.grace = 5s;
+    }

Regular expression matching

Another "big" difference is the use in 2.1 of a Perl-compatible regular expression engine,
(PCRE) instead of the POSIX-style regex matching that used to be in 2.0.
This is a good change for me, as I'm pretty much used to Perl regex and I know next to nothing
about POSIX.

This change actually created a subtle problem that I caught only with a thorough testing
of our configurations. We use regex matching in a few places in our VCL configuration,
usually to analyze cookies
and set special "flags" that are then used to force
a HTTP Vary header, to make Varnish store different cached versions of the same
URL.

One of these cases is the language cookie, where we store a sticky
user preference about site language. Here's how the code changed:

  # STD: Sticky language cookie
  if (req.http.Cookie ~ "language=") {
      set req.http.X-Language =
-         regsub(req.http.Cookie, "^.*?language=([^;]*?);*.*$", "1");
+         regsub(req.http.Cookie, "^.*?language=([^;]*);*.*$", "1");
  }

  ...

  # Mobile view cookie
  if (req.http.Cookie ~ "mobile=") {
-     set req.http.X-Mobile = 
-         regsub(req.http.Cookie, "^.*?mobile=([^;]*?);*.*$", "1");
+     set req.http.X-Mobile =
+         regsub(req.http.Cookie, "^.*?mobile=([^;]*);*.*$", "1");
  }

In case you find it difficult to spot the change, it's the removal of the *?
(non-greedy star) operator. Non-greedy matching was used in 2.0, POSIX matching, to make
sure that the * didn't match too many characters, and thus eat part of other cookies. Except
POSIX regex matching does NOT have a non-greedy star operator. I just
didn't know that, and it's of course a bug, but it had worked perfectly so far… WTF???

For even more weirdness, why did I take the non-greedy star (*?) away now that it should
be supported with PCRE-matching? I removed it because otherwise the result of those
regsub() expressions are always empty!

Believe it or not, it looks exactly like 2.0 had PCRE and 2.1 has POSIX, which is
obviously not what's happening. If you know more about this and you can shed some light,
please contact me or leave a comment below.

Hope you liked this 2.0 -> 2.1 migration journey. I'm looking forward to 2.1 -> 3.0!
It's a bit more work there, because I will need to migrate my
my accept-language C extension
to the new vmod system, which I already started working on :-)

Have fun!

Puppet external nodes classifier script in Perl

The upgrade to puppet 2.6.2 worked out fine. Coming from 0.24.5, I noticed a really welcome speed improvement. However, I had a tricky problem.

While upgrading to 2.6, I also decided to switch to external nodes classifier script. If you don't know about it, it's a nice puppet feature that I planned to use since the start. It allows you to write a small script in any language you want, to basically tell the puppetmaster, given the hostname, what you want that machine to be.

Puppet calls your script with one argument that is the hostname of the machine that is asking for its catalog of resources. In your script, you have to output something like the following:

---
classes:
  - perl
  - apache
  - ntp
  - munin
environment: production
parameters:
  puppet_server: my.puppetmaster.com

You can specify all the "classes" of that machine, so basically what you want puppet to install (or repair) on that machine. So far so good. So my classifier script looks into some preset directories for some project-specific JSON files, and then checks if any of these JSON files contain the name of the machine that puppet is asking for. Code follows:

#!/usr/bin/env perl
#
# http://docs.puppetlabs.com/guides/external_nodes.html
#

use strict;
use warnings;
use File::Basename ();
use File::Slurp ();
use JSON::XS ();
use YAML ();

our $nodes_dir = '/etc/puppet/manifests/nodes';

our %default_params = (
    puppet_server => 'my.puppetmaster.com',
);

# ...
# A few very simple subs
# omitted for brevity
# ...

# The hostname puppet asks for
my $wanted = $ARGV[0];

# The data structure found in the JSON file
my $node_info = search_into_json_files_and_find($wanted);

my $puppet_classifier_info = {
    classes => $node_info->{puppet_classes},
    environment => 'production',
    parameters => %default_params,
};

print YAML->Dump($puppet_classifier_info);

Now, I don't know if you can immediately spot the problem here, but I didn't, so I wasted a good part of an afternoon chasing a bug I didn't even know existed. The resulting YAML (puppet wants YAML) was this one:

--- YAML 
--- 
classes: 
    - geodns::production::backend 
environment: production 
name: z01-06-02 
parameters: 
    puppet_server: z01-06-02

The problem with this, is that it looks innocent and valid, and in fact is valid, but it's two YAML documents, not one. So puppet will parse the --- YAML line since that is one single complete YAML document, and ignore the rest.

And why is that happening in the first place? Because of the YAML->Dump() call I wrote, instead of the correct YAML::Dump()… Eh :) So the correct code is:

print YAML::Dump($puppet_classifier_info);

Never use YAML->Something()

Upgrade of puppet from 0.24.5 to 2.6.2: “Got nil value for content”

Something I should have probably done a while ago. Upgrading our puppet clients and master from 0.24.5 (current Debian stable) to 2.6.2 (current backports). I've been advised to go all the way to 2.6.4, that contains an important security fix. I'll probably do that soon.

So, first error I did was to upgrade one client before the puppetmaster. In my mental model, it's always ok to upgrade the server first (puppetmaster), but I remember I was kind of surprised when I learned that puppet needs upgrading the clients first, so that is something I seemed to remember well: first upgrade clients, then the puppetmaster.

So I did, and that didn't work at all. The client couldn't retrieve the catalog. I discovered I had to update the puppetmaster first. Another very common thing with puppet: if something doesn't work, wipe out the SSL certificates. Almost every puppet user, maybe unexperienced like me, will tell you to remove the entire /var/lib/puppet/ssl directory, and restart your client. Good. So I did.

After fiddling for a while with client and server-side SSL certificates, removing, --waitforcert and puppetca invocations, I was able to finally connect the client with the puppetmaster and have them talk to each other correctly, downloading the manifests and applying the changes. However, one problem remained…

Digression: custom facter plugins

I wrote a specific facter plugin that provides a datacenter fact. Just as an example, it works like this:

#
# Provide an additional 'datacenter' fact
# to use in generic modules to provide datacenter
# specific settings, such as resolv.conf
#
# Cosimo, 03/Aug/2010
#
# $Id: datacenter.rb 15240 2011-01-03 14:27:44Z cosimo $

Facter.add("datacenter") do
    setcode do

        datacenter = "unknown"

        # Get current ip address from Facter's own database
        ipaddr = Facter.value(:ipaddress)

        if ipaddr.match("^12.34.56.")
            datacenter = "dc1"
        elsif ipaddr.match("^34.56.78.")
            datacenter = "dc2"
        elsif ipaddr.match("^56.78.12.")
            datacenter = "dc3"
        # etc ...
        # etc ...
        end

        datacenter
    end
end

This allows for very cool things, like specific DNS resolver settings by datacenter:

# Data-center based settings
case $datacenter {
    "dc1" :  { include opera::datacenters::dc1 }
    "dc2" :  { include opera::datacenters::dc2 }
    "dc3" :  { include opera::datacenters::dc3 }
    # ...
    default: { include opera::datacenters::dc1 }
}

where each opera::datacenter::dc class contains all the datacenter-dependent settings. In my case, just the resolver for now.

class opera::datacenters::dc1 {
    resolver::dns { "dc1-ns":
        domain => "opera.com",
        nameservers => [ "1.2.3.4", "5.6.7.8" ],
    }
}

resolver::dns is in turn a small define in a resolver module I wrote to generate resolv.conf contents from a small template. It's so small I can copy/paste it here:

# "resolver/manifests/init.pp"
define resolver::dns ($nameservers, $search="", $domain="") {
    file { "/etc/resolv.conf":
        ensure => "present",
        owner  => "root",
        group  => "root",
        mode   => 644,
        content => template("resolver/resolv.conf.erb"),
    }
}

# "resolver/templates/resolv.conf.erb"
# File managed by puppet <%= puppetversion %> on <%= fqdn %>
# Data center: <%= name %>
<% if domain != "" %>domain <%= domain %>
<% end %>
<% if search != "" %>search <%= search %>
<% end %>
<% nameservers.each do |ns| %>nameserver <%= ns %>
<% end %>

One problem remained…

Let's get back to the remaining problem. Every client run was spitting this error string:

info: Retrieving plugin
err: /File[/var/lib/puppet/lib]: Could not evaluate: Got nil value for content

After a lot of searching and reading through similar (but not quite the same) messages, I found this thread where one guy was having a very similar problem, but with a different filename, /root/.gitconfig that he obviously specified in his manifests.

My problem happened with /var/lib/puppet/lib, which is never specified in any of my manifests. But the symlink bit got me thinking. At some point, to have the specific facter plugin work, and having read about differences between older and 2.6.x versions of puppet, I had put it into my module's lib/facter folder, but creating also a symlink called plugins (required by the new puppet). Doing so, I thought, would probably prevent problems, as the new puppet could read the file from the new folder, while the older version could read it from "lib".

Well, turns out that puppet will not be able to read custom facter plugins from a plugins folder that is a symlink. Removing the symlink and making {module}/plugins/facter a real directory solved that problem too.

I really hope to save someone else the time I spent on this. Cheers :)