Fast VCL checks for personalized backend responses

I’d like to talk about a problem I encountered a few years ago and one possible solution to it. This particular problem stuck with me for a long time for several reasons.
The first one is that at the time I considered the problem basically unsolvable. It would be like having a cake and eating it too, as the proverb goes. Another reason is that this problem had me spinning my wheels thinking about a solution for a good while.

Without any pretense of this being a particularly clever solution or anything like that, I’d like to illustrate what the general problem is and a possible solution I came up with. Hopefully this will be useful to you.

The general problem

Suppose you have a backend request of some sort, an API or a particular web page. In my case it was a json-based recommendations API, which returned a list of recommended news articles to read. The specific purpose of the request is not terribly important. What’s more important is the fact that this request can be personalized depending on the user that makes the request. I believe this is a quite common scenario.

In a recommendations context, it’s also common for a user not to be signed in to the service, or to be invoking the API for the first time. In this case, the recommendations engine does not have any previous information about the user, also called the cold start case.

In this specific project, we had operated in a “permanent cold start mode”, meaning the recommendations we were offering were never differentiated per user. There were a few knobs and settings to influence which type of recommendations one would get from the system (f.ex. less Sports articles and more Arts or Travel), but the system would not learn over time or change its recommendations based on user signals like articles read.

Among other things, this mode of operation allowed us to serve our entire userbase (around 90M monthly active users, around 10M weekly) with only two servers per data-center, also thanks to a very aggressive caching strategy.

When we started experimenting with personalized recommendations, it was immediately clear that we would not be able to handle the additional backend load caused by all the per-user requests. We estimated that, given the cache hit ratio drop, we would need something ridiculous like 50x the amount of servers. For each API request, we would have to:

  1. fetch the distinct user profile
  2. check if the profile contained any information about previously read articles or otherwise useful information to personalize the offered recommendations
  3. compute and return the personalized recommendations

These steps can only be performed by the recommendations engine backend. This implies that we would not be having any help from our caching in Varnish, which made personalized recommendations much harder to implement for us, at least without employing inordinate amounts of servers and having to significantly rebuild our system infrastructure.

You could very well say that that is a problem in itself, and it probably is :-)

A possible solution

I remember spending quite some time thinking about this, not seeing any possible solution. One day I attended a meetup. One of the engineers there talked about the Varnish API engine. The API Engine is a commercial Varnish add-on that can implement authentication and paywalls directly in the caching layer. The person talking about this mentioned how API engine embedded the SQLite3 database, and how this was crucial to the performance of it, since the caching layer is effectively the first bottleneck of a system.

I connected the dots almost immediately and I realized I had a possible way forward to solve my problem. This is how I imagined I could approach the problem:

  • organize user signals collection (what articles each user is reading, etc…) and user profile building as a completely separate batch activity
  • every x number of hours, build a sqlite database with a single table, user_profiles, consisting of two columns, a user_id string and a has_profile boolean. With such table in place, looking up whether we can build a significantly personalized recommendations set for a user is a only an SQL primary-key lookup away.
  • Using the excellent SQLite3 vmod, implement this SQL lookup in our existing Varnish VCL layer. Make sure that for every possible case this code never fails. For example, if the database file does not exist, or the file is for some reason corrupt, etc… we want to behave as if the particular user for the running request had no personalized profile.
  • Ensure that we would be able to update the SQLite database file at any time, without stopping Varnish, and the new file would be visible to the SQL queries immediately or at least after a short delay.

We tested the whole assembly and it seemed to work correctly. The final step consisted in actually computing the personalized profiles, building the real SQLite database, syncing it to the backend systems, and performing the dispatch logic in the VCL layer.

This is more or less the final logic I used:

  • If the request was for an anonymous user, don’t even perform the user profiles SQL lookup, and return the generic recommendations cached payload.
  • If the request comes from a user that has no personalized profile, that is, no record is present in the SQLite table, also return the generic recommendations payload.
  • If the user profiles lookup is positive, that is, a record exists in the user profiles table in SQLite and its has_profile flag is true, then pass the request on to the backend. We know it is a request that must be personalized and only the backend can do that.

Using such logic allows to serve the majority of your user base, which presumably has not logged in, or does not have any significant user profile yet, caching as much as possible. But it also allows personalized recommendations for all users that do have a profile.

We are shifting the critical decision as early in the chain as possible, that is, in your caching layer, either Varnish or similar, before the backend service is even consulted. Taking the decision to the backend service would not be feasible for the reasons already discussed.

The actual code

We used Puppet as configuration management tool back then, with a custom varnish module. I extended the existing manifest to add a new user_profiles.vcl file and to install by default the sqlite3 vmod for Varnish.

The existing VCL code was also modified to:

  • perform the personalized profile SQL query
  • decide whether to pass the request based on the result of the SQL query

The following code illustrates those two steps:

diff --git a/config.vcl b/config.vcl
index 8e25a8a..50c70ce 100644
--- a/config.vcl
+++ b/config.vcl
@@ -1,22 +1,23 @@
 # Recommender system VCL config

 include "/etc/varnish/accept-encoding.vcl";
 include "/etc/varnish/purge.vcl";
 include "/etc/varnish/x-forwarded-for.vcl";
 include "/etc/varnish/auth.vcl";
 include "/etc/varnish/stats.vcl";
+include "/etc/varnish/user_profiles.vcl";
 include "/etc/varnish/strip-tracking-cookies.vcl";

 backend apache {
     .host  = "127.0.0.1";
     .port  = "8000";
     .probe = {
         .url       = "/ping.html";
         .interval  = 10s;
         .timeout   = 5s;
         .window    = 20;
         .threshold = 3;
         .initial   = 3;
     }
 }
@@ -147,45 +148,49 @@ sub vcl_recv {
     if (req.backend.healthy && req.http.User-Agent ~ "McHammer") {
         return (pass);
     }

     # Client clicks must go through the backend (*with* client-id cookie)
     if (req.url ~ "^/api/1\.0/feedback/") {
         return (pass);
     }

     call check_authorization;
+    call check_user_profile;
     call accept_encoding_normalize;

+    # Users with tracking cookies can be served personalized results
+    if (req.http.X-Profile == "1") {
+        std.log("User has customized profile. Rolling the dice.");
+        # Initially keep the percentage of PASS very low, to test the
waters.
+        if (std.random(0, 100) < 1.0) {
+            std.log("User has customized profile and within 1.0%.
Passing.");
+            return (pass);
+        }
+    }

 }

The new user_profiles.vcl file consisted of the following code:

#-----------------------------------------------------------------------------
# Fast check for personalized user profiles
#-----------------------------------------------------------------------------
#
# The general idea is to use this fast check to send users who we know
# have a personalized user profile to the backend without caching, while
# retaining the ability to send cached objects for everyone else.
#
# Uses a SQLite3 database and libvmod-sqlite3 by Federico Schwindt:
# https://github.com/fgsch/libvmod-sqlite3
#
# Extracts the `clientId' from the HTTP Cookie header.
# Looks up the profile_id key having value equal to the `clientId' cookie.
# The underlying schema is very simple:
#
#   CREATE TABLE user_profiles (
#       profile_id char(100) PRIMARY KEY NOT NULL,
#       data text
#   );
#
# At least initially we will not use the data column.

import sqlite3;

sub vcl_init {
    sqlite3.open("/etc/varnish/user_profiles.db", "|;");
}

sub check_user_profile {

    # Quick yes/no test for the clientId cookie
    if (req.http.Cookie ~ "userId=") {

        # Extract a userId value from the Cookie header,
        # which remains untouched. Make sure we can still extract a clientId
        # value even if there's other cookies before/after ours.
        #
        # XXX Not sure what happens when client sends multiple Cookie lines.
        set req.http.X-Profile-Id = regsub(req.http.Cookie,
            "(?:^|.*;\s*)(?:userId=(.*?))\s*(?:;.*|$)", "\1");

        # No need to do anything if userId hasn't been found
        if (req.http.X-Profile-Id != "") {
            #std.log("Checking profile_id: " + req.http.X-Profile-Id);

            # First case of VCL-injection vulnerability :-)
            set req.http.X-Profile = sqlite3.exec(
                "SELECT 1 FROM user_profiles WHERE profile_id='"
                + req.http.X-Profile-Id
                + "'");

            # req.http.X-Profile !~ "^SQL" to catch errors like missing DB,
            # but seems a bit fragile. Depends on libsqlite3 and/or the vmod.
            if (req.http.X-Profile == "1") {
                std.log("User profile " + req.http.X-Profile-Id
                    + " found (" + req.http.X-Profile + ")");
            }
            else {
                std.log("User profile " + req.http.X-Profile-Id
                    + " not found");
            }
        }
    }
}

The commit message

I believe that good solutions deserve awesome commit messages. Here’s what I wrote:

Date:   Thu Jan 28 19:36:46 2016 +0100

    Fast VCL check for personalized profile existence

    How to have the cake and eat it too. Serve cached objects to the majority of
    users while personalizing recommendations to the ones that actually have a
    significant user profile available.

    Got the idea from the Varnish API engine[1].

    It's possible to perform tens of thousands of sqlite database lookups a second
    while processing requests in Varnish through VCL, thanks to SQLite3 being very
    lightweight and in this case embedded right inside Varnish through the sqlite3
    vmod[2].

    This commit hopefully adds all there is to it. The last bit is obviously the
    database file, which I placed in `/etc/varnish/user_profiles.db'. We will need
    to generate the .db file from the clicker and sync it to all frontends.

    Updates seem to be received immediately.

    When no database file is present, as will be in the initial deployment, the
    `check_user_profile()' function will work normally, signaling that no custom
    user profile has been found.

    [1] https://www.varnish-software.com/products/varnish-api-engine
    [2] https://github.com/fgsch/libvmod-sqlite3

How to rollout gradually?

Another interesting aspect is the way we could “control the flow” to this personalized recommendations API, that is, deciding what percentage of users that had personalized profiles, would actually get personalized recommendations.

A gradual rollout would certainly be the best approach, and it was implemented in two different ways:

  • once the SQL lookup was performed and the result was positive, we would still “roll the dice” and only allow 1% (or 5%, 10%) to actually pass through to the backend as personalized recommendations. This was an additional safety measure.
  • when batch building the SQLite database, we could decide to curtail the amount of users with personalized profiles. For example, excluding all users that had not read at least 5 or 10 articles. This barrier served two purposes. It effectively limited the amount of users that would be included in the SQLite database and at the same time made sure we had accumulated significant user profile information before attempting to serve personalized recommendations. A sort of win-win I didn’t expect at first :-)

As usual, if you have any feedback, email me or write below (but comments are subject to approval due to lots of spam).

Long-lived JVM applications memory usage tuning

A few days have passed since the last blog post about jvm memory usage monitoring tools, and I have learned so much about patterns of JVM memory usage and magic flags to use to influence it. I still can’t call myself an expert, but judging from the corpus of stackoverflow posts about jvm and memory, at least I’m not totally clueless. :-)

EDIT: this post has now been further extended and published on the newly published Kahoot engineering blog.

JVM memory usage monitoring tools

During the last few days I looked into JVM memory monitoring tools.

That is one area of expertise that I am definitely not familiar with. The general approach in my career so far has been to avoid Java for long running processes :-) The only other (empirical) pearl of wisdom I remember from past experiences with Java™ is that the max heap memory should be less or equal to the max system memory divided by 2 (maxHeap <= totalMem / 2). I’ve seen this relation work nicely over the years for a fairly busy Solr search cluster with anywhere from 24 to 96Gb memory servers, where maxHeap was never higher than 8Gb though, as those servers had a lot of other processes running at the same time. Obviously if other processes are running concurrently on the system, the max heap size should be decreased accordingly.

When observing servers, it’s typical to monitor the “usable” memory (in Datadog, that would be system.mem.usable) which is a sum of the system current free and cached or buffered memory. That is, the total memory that the system can grab and use at any time if needed.

This measure doesn’t necessarily tell us which processes are using memory and why. We’ve also observed spikes of memory usage when log aggregation tools like filebeat are reading, parsing and shipping logs to the logging servers. It would be useful to start tracking how much memory is used by the specific merlin java process rather than looking at an aggregated memory metric.

Searching around, I found a few useful articles (among them https://www.pushtechnology.com/support/kb/understanding-the-java-virtual-machine-heap-for-high-performance-applications/) and several tools that helped dig deeper and extract more information about what is happening on the gameservers with regards to memory usage. I’d like to mention them here for future reference and to collect eventual feedback from others.

jvmtop

​URL: github.com/patric-r/jvmtop

Simple console application that provides high level stats about heap usage and garbage collection CPU usage. Here’s an example from the documentation page:

JvmTop 0.8.0 alpha   amd64  8 cpus, Linux 2.6.32-27, load avg 0.12
 https://github.com/patric-r/jvmtop

  PID MAIN-CLASS      HPCUR HPMAX NHCUR NHMAX    CPU     GC    VM USERNAME   #T DL
 3370 rapperSimpleApp  165m  455m  109m  176m  0.12%  0.00% S6U37 web        21
11272 ver.resin.Resin [ERROR: Could not attach to VM]
27338 WatchdogManager   11m   28m   23m  130m  0.00%  0.00% S6U37 web        31
19187 m.jvmtop.JvmTop   20m 3544m   13m  130m  0.93%  0.47% S6U37 web        20
16733 artup.Bootstrap  159m  455m  166m  304m  0.12%  0.00% S6U37 web        46

where the various columns are:

PID = process id
MAIN-CLASS = the "jvm name" but often the entry point class (with used main() method)
HPCUR = currently used heap memory
HPMAX = maximum heap memory the jvm can allocate
NHCUR = currently used non-heap memory (e.g. PermGen)
NHMAX = maximum non-heap memory the jvm can allocate
CPU = CPU utilization
GC = percentage of time spent in garbage collection (~100% means that the process does garbage collection only)
VM = Shows JVM vendor, java version and release number (S6U37 = Sun JVM 6, Update 37)
USERNAME = Username which owns this jvm process
#T = Number of jvm threads
DL = If !D is shown if the jvm detected a thread deadlock

Useful to get a quick glance at a few critical parameters. I have tested the most recent version (0.9.0) and the compilation from the source code was quick and easy.

jvm-mon

​URL: github.com/ajermakovics/jvm-mon

Another console application, but a bit more sophisticated than jvmtop. It also displays trends as it’s meant to be run for a longer period of time. I think this is the appropriate level of fancy I like :-)

I quite like jvm-mon, it’s clear and the data is easy to understand. The charts resize dynamically based on how long you keep it running.

jps_stat

URL: github.com/amarjeetanandsingh/jps_stat

A simpler shell script that displays more or less the same information shown by jvmtop and also keeps running and updating the stats. AFAIK, it’s not possible to run it “one-shot”. That would be useful to build our own metric monitoring.

This is what jps_stat looks like when executed:jps_stat console screenshot

jstat

https://docs.oracle.com/javase/7/docs/technotes/tools/share/jstat.html

Last of the lot is jstat which is part of the JVM distribution. Usage is very simple and it can easily be embedded in a one-liner or a shell script, as in:

watch -d -n1 jstat -gc $(pidof java)

`jstat -gc` outputs garbage collection statistics, some of which I still haven’t understood the purpose. A sample output is the following:

Every 1.0s: jstat -gc 12743                                                                                                                                                                Thu Jun  6 09:33:23 2019

 S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT     GCT
34048.0 34048.0 10229.3  0.0   272640.0 37062.4  2731264.0  1519963.6  32192.0 30728.8 3804.0 3510.3  34926  873.839  26      2.781  876.620

These metrics are a bit more detailed than just looking at heap usage as a whole. I’m not really sure I need to look into all of this specifically.

Conclusion

Not much to say, but for now it was useful to get a bit more details of how the server is running, how much heap memory is consumed, how much non-heap memory is used, and how the total memory used by the java process itself compares to the total memory used on the server.

Plus, it’s handy to write down these things so I can refer back to them when needed :-)

My Retro Arcade Videogames Screensaver for Linux

Screensavers are magic, they’re like magnets, at least to me. I’ve always been fascinated by them. There’s great creativity and programming energy that goes into making one.

I think the first screensavers I encountered were probably the ones written for Windows 3.1, even though the loading colored bars of the Commodore 64 could perhaps qualify as a screensaver as well.

One of the greatest screensavers I remember is the one that came with Novell Netware servers. This was around 1997, and it looked approximately like this video:

https://www.youtube.com/watch?v=RitHgCBNrNs

This is a work of art. It was a text-mode screensaver, and it displayed a few “snakes” going around the screen. Each snake represented a processor. Multi-core processors were very much in the future back then, but we were running our Netware server with a SMP machine and two Pentium Pro processors, which was pretty awesome for the time.

If the processor occupancy was high, the snake would move faster and at the same time, IIRC the higher the number of processes active on each processor, the longer the snake. Simple and brilliant. With a simple look at the screen, you could instantly tell how much the system was loaded, and that was very important, since that Netware machine enabled hundreds of employees to do their daily job.

I have used Linux as my workstation operating system for at least fifteen years now, and jwz’s xscreensaver package has always been present and very appreciated, so much so
that one day I started my own screensaver project for fun.
I decided to rewrite the Netware screensaver obviously and
I wrote about it in the past.

All this to explain that screensavers are a thing for me :-)

Fast forward to now, I think I found the next level of screensaver goodness, and I’m quite satisfied of it. It is a retro-arcade-videogames screensaver. The craze started when I changed job and started using a new Linux workstation. I noticed all my colleagues running MacOS running this rather stunning aerial video footage as screensaver.

I soon discovered that it’s possible to run the same on Linux, and there’s a project called xscreensaver-aerial that you can run as part of xscreensaver. While that is very cool,
it became boring quickly. The idea of xscreensaver-aerial however is neat. You can display your own video files in fullscreen, just fill up a folder with video files and when xscreensaver is invoked, one of those videos will be played. The shell script linked above does this, with the help of mpv, a nice video player for Linux, even though in a slightly convoluted way.

The following step was obvious: use mpv and adapt the shell script to do what I wanted. Pick random video files from a folder. I started downloading longplay videos of my favourite video games. For example, Bomb Jack or Space Ace, all the rage 80’s arcades or Amiga games, for
that nice nostalgia factor.

Started using this version for a while. I run a two monitor setup, one in portrait mode for code and text editing, and another one in landscape mode for browser and every other need.

Sometimes the screensaver would play the same video on both monitors (lame) or play a video with an inappropriate aspect ratio for the monitor, filling the screen with vertical or horizontal black bars.

Thus, last improvement was to automatically detect the aspect ratio of each monitor and pick from either a `landscape` or `portrait` video folder.

The trick I used to do this is nothing else than some shell code:

 #
 # Overwrites $MODE with either `landscape` or `portrait`
 #
 check_monitor_mode() {
     MODE="landscape"
     local tmpfile="$(mktemp)"
     logger -t xscreensaver "xwinid: $XSCREENSAVER_WINDOW"
     xwininfo -id "${XSCREENSAVER_WINDOW}" > "${tmpfile}"
     local width=$(grep "Width: " "${tmpfile}" | awk '{ print $2 }')
     local height=$(grep "Height: " "${tmpfile}" | awk '{ print $2 }')
     unlink "${tmpfile}"
     if [ $width -gt $height ]; then
         logger -t xscreensaver "landscape mode (${width}x${height})"
     else
         logger -t xscreensaver "portrait mode (${width}x${height})"
         MODE="portrait"
     fi
 }

The command to play the video, which must be saved as an xscreensaver module, is:

mpv --really-quiet --no-audio --fs --no-stop-screensaver --wid="$XSCREENSAVER_WINDOW" "${themovie}" &

By customizing ~/.xscreensaver it’s possible to add your own executable programs, give them a name, etc, so for example my ~/.xscreensaver contains the following:

...
 programs:
 - "Apple TV Aerial" atv4                 \n\
 - retro-arcade-screensaver               \n\
 - "Novell Netware Screensaver" loadsnake \n\
 ...

Here’s the final result:

How to profile Python/Django applications

Using the django-profiler module available at https://github.com/char0n/django-profiler

from profiling import profile

and later:

@profile(stats=True)
def view_to_be_profiled(self, request):
    ...

There’s a couple of settings that it’s possible to enable to view SQL queries and tweak the logger name:

PROFILING_SQL_QUERIES = True
PROFILING_LOGGER_NAME = 'profiler'

Just a reminder for myself. Nothing more.

MySQL (Percona XtraDB) slave replication crash resilience settings

It’s been a geological age since my last blog post!

Oh, so many things happened in the meantime. For the past four years, I worked on the development and operations side of the news recommendation system that powered Opera Discover. With enough energy, I have planned to write a recommender systems “primer” series. We’ll see.

Meanwhile, I’d like to keep these notes here. They’ve been useful to make MySQL replication recover gracefully from network instability, abrupt disconnections and generally datacenter failures. Here they are.

Coming MySQL 5.6 on Debian Wheezy, we began to experience mysql replication breakages after abrupt shutdowns or sudden machine crashes. When systems came back up, more frequently then not, mysql replication would stop due to corrupted slave relay logs.

I started investigating this problem and soon found documentation and blog posts describing the log corruption issues and how mysql development addressed that. Here’s the pages I used as references:

Additionally, we had (I believe unrelated) problems with some mysql meta tables that couldn’t be queried, even though they were listed as existing in the mysql shell and in the filesystem.
We solved this problem with the following steps:

DROP TABLE innodb_table_stats;
ALTER TABLE innodb_table_stats DISCARD TABLESPACE;
stop mysql
rm -rf /var/lib/mysql/mysql/innodb_table_stats.*
restart mysql

These steps have to be executed in this order, even if altering a table after having dropped it may seem nonsensical. It is nonsensical, as sometimes mysql things are.

Crash safe replication settings

We’ve distilled a set of standalone replication settings that will provide years and years of unlimited crash-safe replication fun (maybe). Here they are:

# More resilient slave crash recovery
master-info-repository = TABLE
relay-log-info-repository = TABLE
relay-log-recovery = ON
sync-master-info = 1
sync-relay-log-info = 1

Let’s see what each of these settings does.

master-info-repository=TABLE and relay-log-info-repository=TABLE instruct mysql to store master and relay log information into the mysql database rather than in separated *.info files in the /var/lib/mysql folder.
This is important because in case of crashes, we would like to ensure that master/relay log information is subject to the same ACID properties that the database itself provides. Corollary: make sure the relevant meta tables have InnoDB as storage engine.
For example, a SHOW CREATE TABLE slave_master_info should say Engine=InnoDB.

relay-log-recovery=ON is critical in case of corruption of relay log files on a slave system. When MySQL encounters corrupted relay log files during startup, by default it will drop the ball and halt. This option set to ON, will cause mysql to attempt refetching the relay log files from the master database. The master should then be configured to keep its binlogs for a suitable amount of time (often I use 2 weeks, but really depends on the volume of database changes). As a test, it’s possible to replace the current relay log file with a corrupted copy (from /dev/urandom for example). MySQL will discard the corrupted log file and attempt download from the master, after which a regular startup will be carried out. Fully automatic recovery!

sync-master-info=1 and sync-relay-log-info=1 enable the synchronized commit of both master and relay log information to the database with every transaction commit. This is again something that must be evaluated in each single application. Most probably if you have a high volume of writes, you don’t want to enable it. However, if the writes rate is low enough, this option won’t cost any additional performance and should instead make sure that the slave_master_info and slave_relay_log_info tables are always consistent with the state of the replication and of the rest of the database.

That is all. I’d love to hear any feedback or corrections to this information.

The quest for a perfect keyboard — part 2

You might want to read episode 1 first, in which I started documenting my quest to find an excellent keyboard that matches me.

Filco Majestouch 2 Ninja TKL The keyboard I currently use at work is one I bought myself, a Filco Majestouch 2 TKL US layout, TKL meaning without the numeric keypad (so called Ten Key Less). In part one I said I was experiencing some strain in my left hand typically after a day of work or so.

That doesn’t happen anymore.

To understand why, I’ll go back a few months from now. I was reading
Twitter, and I noticed a message from a friend.

He wasn’t using his Kinesis keyboard, and asked if anyone was interested. He was basically giving away his Kinesis.

(here’s an interesting article involving another Kinesis)

Surprise!

We got in touch and Tim shipped his keyboard to Norway. I had
been playing with the thought of buying one for a while, but it seemed
too expensive and I was not convinced at all that I could learn to type on it comfortably.

Kinesis Advantage black keyboard

The very first time I plugged the keyboard in, I was at home (I mean I was literally at home), relaxed, very excited about learning how the Kinesis worked, where all the keys were located, and what was the hand-brain response. Everything went better than expected.

Next I brought the keyboard at work. That is when things started to
break down :-)

Kinesis at work

I’m slowly learning the Norwegian language. At work, where I spend most of the time, the “official” language is English. That means I don’t get a lot of practice for free, and it’s taking me a long time to become proficient.

When an opportunity to practice Norwegian comes up, I feel some resistance to use it because I suck at it. Add that English is more comfortable to use and everyone can understand it, and you have a good recipe to keep me from learning even longer :-)

Learning to type on a Kinesis is like speaking a new language. All the movements and keystrokes that are hardwired in your muscle memory suddenly cannot work anymore like you’re used to and you feel frustrated and helpless. I noticed I was at least 5-10 times slower than I was on my usual keyboard. This was the first week of usage.

Exercise!

The instruction manual for the Kinesis is something you definitely want to read and use. For many reasons. It explains how to activate many functions and special keys that otherwise you wouldn’t know how to.

More importantly, it warns you about this awkward reaction your body and fingers are going to have when you start using the keyboard and advices to perform some exercises to retrain your brain and muscle memory.

Those exercises are really effective, and after just a couple of days of doing them I saw a big difference in typing comfort and rate of errors. What I didn’t expect was to realize that they also help you type better no matter what keyboard you use. One of the key (ehm) lessons these exercises will teach you is when it’s better to use your left or right fingers.

For example, if I have to type # (the hash or square sign, on a US layout keyboard, upper symbol on the 3 key), I usually press the left SHIFT key with my pinky and then press 3 with my (left) middle finger.

The Kinesis manual strongly advices against that, and instead helps you learn to type # by depressing the right-side SHIFT instead with your right hand, and the 3 with your left hand. It will be much more comfortable and will result in much less strain after a long work session.

I guess here I just gave away that I’m no trained typist :-)
Oh well, that’s ok.

Back to black

Back to my (black) regular keyboard, I could really see that this way of typing helps in the long run, so I tried to practice it and now I find myself using these straining finger combinations a lot less.

But… what happened to the Kinesis then?

I learned to type decently on it, with an OK speed, still making more mistakes than I’d like, but at least not feeling frustrated. However, the Kinesis proved not to be the keyboard for me!

I truly believe it is an excellent keyboard. However, here’s my personal list of cons (beware, this is a nitpicky list):

  • function keys and all other keys on that row, including escape, are too small and close to each other and lack the same mechanical switch feel of the rest of the keyboard.
  • palm-rests are slightly too high, at least for me. I found them a bit uncomfortable.
  • no Control key on the right (or was it left?) side, so sometimes using the correct key combination takes either two keystrokes or must be done with the same hand instead of using (properly) two.

keyboard.io

I’ve been following Jesse Vincent’s blog for a while now, enough to know he’s a keyboard enthusiast (nerd?) and has been building a lot of keyboard prototypes, to find “The One”, and he went on to found the keyboard.io initiative.

That’s interesting, because following his posts, I can see that I am looking at exactly the same features and ergonomicity(?) when evaluating a keyboard for daily — programming — use.

Here’s the keyboard.io “00” model:

Keyboard.io Model 00. Copyright Jesse Vincent / keyboard.io

Keyboard.io Model 00. Photo courtesy Jesse Vincent / keyboard.io

It would be very interesting to try it. It’s not sold yet, so feedback can be limited to what I can see:

  • It’s clear that a lot of design and thought has gone into this keyboard, and it’s probably just right for Jesse that built it, so it can’t be any better from that point of view.
  • The use of wood instead of a more keyboardy material feels strange, like those “Diamond iPhone” products, or crocodile skin shoes or something like that. Maybe it’s awesome, but that’s how it feels to me.
  • Key placement seems to be the most compelling and studied feature of this keyboard. It seems to map actual hands and fingers closer than any other keyboard I’ve seen, including the ergonomic ones. It’s hard to imagine (without having one here with me at the moment :-) how to use the thumb, especially with the white key. That one seems to be right under the finger “body”. Not sure.
  • The “butterfly” shape doesn’t feel like professional to me. I wouldn’t associate it with a model that a keyboard nerd would want to buy. From the pictures it seems like the vertical (longitudinal) dimension is very large compared to the width. I guess it’s not a keyboard that people want to carry around too much :-)
  • I hope the cable is at least 1.5m in length. It’s never long enough.

That’s it for part 2.

I’m still sticking to the Filco for now! :-)

The quest for a perfect keyboard — part 1

I think I have become addicted to computer keyboards.

Maybe I always was, but I didn’t notice until about a year ago. Let’s go back in time a bit.

Year is 1981. I am about 6 years old. Not going to school yet,
in fact I would skip first grade and start straight in second. I must have been too adorable, so my mom didn’t want to let me go to school :-)

My dad brings home something that would change my life forever, some sort of toy. One of the first mass market home computers ever made. The Commodore VIC-20.

The computer *is* the keyboard. I think that is where
the magic started. Fast forward a few years, and I got
the C64, then the Amiga 500, the best home computer ever built:

Tried a few more (old) computer models in school, and then the second “strange” encounter.

My dad again :-) was trying to buy a computer powerful
enough to run some architectural CAD program, that only
came with PowerPC Macs series 6000 at the time.

That’s when I saw my first split keyboard.

Don’t know why, but it felt irresistibly attractive for me, subconsciously imprinting in my mind the link between split keyboards
and raw processing power :-)

Fast forward another few years when my then girlfriend,
about 15 years later still my wife, gave me a completely
unexpected and totally awesome gift:

A split keyboard, similar to the Microsoft Natural Elite,
the old model. I have been working on this very keyboard
ever since, for more than ten years. It has undergone
three or four full disassemblies and cleaning cycles.

Up to last year, when it started having the first aging
problems, particularly with the space bar that needed
a few hits to record a keystroke.

Not sure if I was more sad or more excited to start
searching for a replacement keyboard :-)

After a few days, I found what I think are the reference
communities for keyboard enthusiasts: GeekHack and Deskthority.

A month later I already had an IBM Model “M” scavenged
from a recycled electronics bin here at work, and having annoyed my
office colleagues with the buckling springs noise
for a couple of weeks, I finally went on to buy
myself this little gem:

This is a “Filco Majestouch Ninja 2 TKL” with US layout.
I’ve been using US layouts for my entire life more or less.

The Filco is a really good keyboard, very good build quality,
sturdy, heavy, awesome to look at and to work with.
A bit noisy, yes, I’m reminded every now and then :-)

After months of using the Filco, all I can say is that it’s
great, and I love working with it. If you map CAPS LOCK to
CTRL it’s even better.

I use it with awesomewm, so the Windows/Meta4 key is used a lot.
On some particularly long sessions, I can feel some strain on
my left hand fingers, pinky and thumb.

Which made me wonder whether I should go back to a split layout…

Stay tuned for part 2 – The search continues.

Display and filter traffic at the varnish level: vlogdump

Haven’t written much in the last few months. The reason is that I’ve been at work building the Opera Discover service backend, that we launched on Opera mobile for Android just a few days ago.

A few weeks before, during the first stress test sessions of Discover, I wrote this little tool called vlogdump that Opera allowed me to put up on github. The main purpose, besides learning awk :-) is to display and filter traffic coming into your varnish daemon..

vlogdump is not meant to replace varnishlog but I know that sometimes varnishlog gives me too much output to deal with, especially if I want to pinpoint a single client or a single request. I know that the varnishlog that ships with varnish 3.0.x is way better in this regard, but we’re using 2.1.x, and that version of varnishlog is not as capable.

vlogdump is easier to look at than varnishlog, but at the same time it conveys much more information than varnishncsa or the typical access.log format.

Here’s an example of output:

$ varnishlog | vlogdump -v only_misses=1
172.22.0.15 => GET /assets/e85ed0a7b1b87120a0a2bfa025531c6733a48802 HTTP/1.0 MISS
            <= 200 OK 28.432 ms 172.22.0.18 => GET /assets/5a9e9440c5c85e8dc5d65e03e15c95e390901fa7 HTTP/1.0 MISS
            <= 200 OK 36.905 ms 172.22.0.18 => GET /icons/categories/te/icon32x32-technology.png HTTP/1.0 MISS
            <= 304 Not Modified 0.589 ms 172.22.0.15 => GET /api/fetch/article-preview/?client=2&language=en-GB HTTP/1.1 MISS
            <= 301 MOVED PERMANENTLY 8.381 ms 172.22.0.18 => GET /assets/c3830e95b717761005e26ce49ebab253e0ccb40b HTTP/1.0 MISS
            <= 200 OK 291.354 ms 172.22.0.18 => GET /api/category?client=2&language=en-GB HTTP/1.1 MISS
            <= 200 OK 58.025 ms   ...

Another interesting example.

Show request and response headers of transactions that resulted in cache hits and had request headers (any of them) matching "Android":

$ varnishlog | vlogdump -v show_req_headers=1 -v show_resp_headers=1 -v req_headers_match=Android -v only_hits=1
83.149.37.122 => GET /api/category/?... HTTP/1.1 HIT
            <= 200 OK         0.088 ms
   req.http.Accept = application/json;version=1
   req.http.Accept-Encoding = gzip
   req.http.Host = ...opera.com
   req.http.Connection = Keep-Alive
   req.http.User-Agent = Mozilla/5.0 (Linux; Android 4.1.2; GT-N7100 Build/JZO54K) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.58 Mobile Safari/537.31 OPR/14.0.1074.57768
 
   beresp.http.Server = Apache
   beresp.http.Content-Encoding = gzip
   beresp.http.Content-Type = application/json
   beresp.http.Vary = Accept-Encoding, Origin
   beresp.http.Content-Length = 4217
   beresp.http.Date = Sat, 25 May 2013 07:59:53 GMT
   beresp.http.X-Varnish = 1611090407 1611007435
   beresp.http.Age = 267
   beresp.http.Via = 1.1 varnish
   beresp.http.Connection = keep-alive

Now that you're eager to try it :-), you can do so in a few commands, and assuming you have the right awk installed:

wget -q -Ovlogdump https://raw.github.com/cosimo/vlogdump/master/vlogdump
varnishlog | ./vlogdump [options]

The documentation lists all the available options.

You can do more interesting things:

  • display request or response headers for each transaction
    (-v show_req_headers=1, -v show_resp_headers=1)
  • show only requests slower than 200ms (vlogdump -v only_slow=200)
  • show only cache misses or hits (-v only_hits=1 or only_misses=1)
  • show only transactions where the URL matches regexp X
    (-v url_match='X' or -v url_match='!X' for negative match)
  • show only transactions where the HTTP status code was X
    (-v only_status=X)
  • show only transactions where the request or response headers match a given regular expression (-v req_headers_match=Blah, -v resp_headers_match=Error)

You can also combine most of these options together. That is very useful when you are interested in a small fraction of the traffic, but you want to see the whole in-flight transactions.

One recommendation though. It is my first (last?) significant awk script :-) I know it works well, and I'm using it, but due to the way it works, I wouldn't leave it running for long periods of time, as it will slowly eat your memory keeping track of all transactions and clients.

If you have feedback or questions, feel free to comment on github or send me an email.

Logging nagios remote commands

Quick trick, I needed it to debug execution of remote nagios commands.

Just drop this file into /var/nagios/.bashrc, assuming your local nagios user is configured to use the /bin/bash shell:

#!/bin/bash
#
# Log every command run by the nagios user
# into /var/log/auth.log (at least on Debian and derivatives)
#
trap 'logger -p auth.info -t nagios "Running $BASH_COMMAND"' DEBUG

The trap function executes a given command or list of commands when the list of signals specified as arguments are raised,
as in:

trap [COMMAND] [SIGNALS]

The DEBUG signal is special: it will fire every time a command is executed. Using logger ensures that whatever command the nagios user is trying to execute will be logged.

Last bit, how do you get the command text? It’s available in the $BASH_COMMAND variable.

Here’s an extract of the resulting log information:

Mar 30 10:48:05 big1 nagios: Running /usr/lib/nagios/plugins/check_cpu -i 5 -w 90 -c 98
Mar 30 10:49:42 big1 nagios: Running /usr/lib/nagios/plugins/check_tcp -p 3306
Mar 30 10:49:42 big1 nagios: Running /var/nagios/libexec/check_load -w40,40,40 -c50,50,50
Mar 30 10:50:26 big1 nagios: Running /usr/lib/nagios/plugins/check_procs -w 1:1 -c 1:1 -a /usr/sbin/cron
Mar 30 10:50:44 big1 nagios: Running /usr/lib/nagios/plugins/check_disk -w 20 -c 10 -r "^/(ssd|store[1-3])?$"
...

To learn more about traps, here’s a web search on “bash traps”.