Author Archives: cosimo

Deploying Large Deep Learning Models in Production

Most deep learning or machine learning (ML) articles and tutorials focus on how to build, train and evaluate a model. The model deployment stage is rarely covered in detail, even though it is just as important if not fundamental part of a ML system. In other words, how do we take a working ML model from a jupyter notebook to a production ML-powered API?

I hope more and more practitioners will cover the deployment aspect of ML models. For now, I can offer my own experience about how I approached this problem, hoping this will be useful to some of you out there.

Creating a useful ML model

How to create a useful ML model is the part of the work I won’t cover in this post. :-)

I assume that you already have:

  • a model or pipeline that is either pre-trained or that you have trained yourself
  • a model based on PyTorch, though most of the information here will probably help with any ML framework
  • some idea on how to make your model available as a RESTful API

First step: defining a simple API

The rest of this article will use Python as a programming language, for various reasons, the most important being that the ML model is based on PyTorch. In my specific case, the problem I worked on was text clustering.

Given a set of sentences, the API should output a list of clusters. A cluster is a group of sentences that have a similar meaning, or as similar as possible. This task is usually referred to with the term “semantic similarity”.
Here’s an example. Given the sentences:

  • “Dog Walking: 10 Simple Steps”
  • “The Secrets of Dog Walking”
  • “Why You Need To Dog Walking”
  • “The Art of Dog Walking”
  • “The Joy of Dog Walking”
  • “Public Speaking For The Modern Age”,
  • “Learn The Art of Public Speaking”
  • “Master The Art of Public Speaking”
  • “The Best Way To Public Speaking”

The API should return the following clusters:

  • Cluster 1 = (“Dog Walking: 10 Simple Steps”, “The Secrets of Dog Walking”, “Why You Need To Dog Walking”, “The Art of Dog Walking”, “The Joy of Dog Walking”)
  • Cluster 2 = (“Public Speaking For The Modern Age”, “Learn The Art of Public Speaking”, “Master The Art of Public Speaking”, “The Best Way To Public Speaking”)

The model

I plan to describe the details of the specific model and algorithm I used in a future post. For now, the important aspect is that this model can be loaded in memory with some function we define as follows:

model = get_model()

This model will likely be a very large in-memory object. We only want to load it once in our backend process and use it throughout the request lifecycle, possibly for more than just one request. A typical model will take a long time to load. Ten seconds or more is not unheard of, and we can’t afford to load it for every request. It would make our service terribly slow and unusable.

A simple Python backend module

Last year I discovered FastAPI, and I immediately liked it. It’s easy to use, intuitive and yet flexible. It allowed me to quickly build up every aspect of my service, including its documentation, auto-generated from the code.

FastAPI provides a well-structured base to build upon, whether you are just starting with Python or you are already an expert. It encourages use of type hints and model classes for each request and response. Even if you have no idea what these are, just follow along FastAPI’s good defaults and you will likely find this way of working quite neat.

Let’s build our service from scratch. I usually start from a python virtualenv, an isolated python environment where you can install your dependencies.

virtualenv --python /usr/bin/python3.8 .venv
source .venv/bin/activate

If you are not familiar with virtualenv, there are many tutorials you can read online.
Next step, we write our requirements file, with all the python modules we need to run our project. Here’s an example:

# --- requirements.txt
fastapi~=0.61.1

Save the file as requirements.txt. You can install the modules with pip. There are plenty of guides on how to get pip on your system if you don’t have it:

pip install -r requirements.txt

Doing so will install FastAPI. Let’s create our backend now. Copy the following skeleton API into a main.py file. If you prefer, you can clone the FastAPI template published at https://github.com/cosimo/fastapi-ml-api:

from typing import Optional

from fastapi import FastAPI

app = FastAPI()
model = get_model()

@app.post("/cluster")
def cluster():
return {"Hello": "World"}

You can run this service with:

uvicorn main:app --reload

You’ll notice right away that any changes to the code will trigger a reload of the server: if you are using the production ML model, the model own load time will quickly become a nuisance. I haven’t managed to solve this problem yet. One approach I could see working is to either mock the model results if possible, or use a lighter model for development.

Invoking uvicorn in this way is recommended for development. For production deployments, FastAPI’s docs recommend using gunicorn with the uvicorn workers. I haven’t looked into other options in depth. There might be better ways to deploy a production service. For now this has proven to be reliable for my needs. I did have to tweak gunicorn’s configuration to my specific case.

Running our service with gunicorn

The gunicorn start command looks like the following:

gunicorn -c gunicorn_conf.py -k uvicorn.workers.UvicornWorker --preload main:app

Note the arguments to gunicorn:

  • -k tells gunicorn to use a specific worker class
  • main:app instructs gunicorn to load the main module and use app (in this case the FastAPI instance) as the application code that all workers should be running
  • --preload causes gunicorn to change the worker startup procedure

Preloading our application

Normally gunicorn would create a number of workers, and then have each worker load the application code. The --preload option inverts the sequence of operations by loading the application instance first and then forking all worker processes. Because of how fork() works, each worker process will be a copy of the main gunicorn process and will share (part of) the same memory space.

Making our ML model part of the FastAPI application (or making our model load when the FastAPI application is first created) will cause our model variable to be “shared” across all processes!

The effect of this change is massive. If our model, once loaded into memory, occupies 1 Gb of RAM, and we want to run 4 gunicorn workers, the net gain is 3 Gb of memory that we will have available for other uses. In a container-based deployment, it is especially important to keep the memory usage low. Reclaiming 75% of the total memory that would otherwise be used is an excellent result.

I don’t know enough details about PyTorch models or Python itself to understand how this sharing keeps being valid across the process lifetime. I believe that modifying the model in any way will cause copy-on-write operations and ultimately the model variable to be copied in each process memory space.

Complications

Turns out we don’t get this advantage for free. There are a few complications with having a PyTorch model shared across different processes. The PyTorch documentation covers them in detail, even though I’m not sure I did in fact understand all of it.

In my project I tried several approaches, without success:

  • use pytorch.multiprocessing in the gunicorn configuration module
  • modify gunicorn itself (!) to use pytorch.multiprocessing to load the model. I did it just as a prototype, but even then… bad idea
  • investigate alternative worker models instead of prefork. I don’t remember the results of this investigation, but they must have been unsuccessful
  • use /dev/shm (Linux shared memory tmpfs) as a filesystem where to store the Pytorch model file

A Solution?

The approach I ended up using is the following.

gunicorn must create the FastAPI application to start it, so I loaded the model (as a global) when creating the FastAPI application, and verified the model was loaded before that, and only loaded once.

I added the preload_app = True option to gunicorn’s configuration module.

I limited the amount of workers (my tests showed 3 to work best for my use case), and limited the amount of requests each gunicorn worker will serve. I used max_requests = 50. I limited the amount of requests because I noticed a sudden increase in memory usage in each worker regularly some minutes after startup. I couldn’t trace it back to something specific, so I used this dirty workaround.

Another tweak was to allow the gunicorn workers to start up in a longer than default time, otherwise they would be killed and respawned by gunicorn’s own watchdog as they were taking too long to load the ML model on startup. I used a timeout of 60 seconds instead of the default 30.

The most difficult problem to troubleshoot was workers suddenly stopping and not serving any more requests after a short while. I solved that by not using `async` on my FastAPI application methods. Other people have reported this solution not working for them… This remains to be understood.

Lastly, when loading the Pytorch model, I used the .eval() and .share_memory() methods on it, before returning it to the FastAPI application. This is happening just on first load.

For example, this is how my model loading looks like:

def load_language_model() -> SentenceTransformer:
    language_model = SentenceTransformer(SOME_MODEL_NAME)
    language_model.eval()
    language_model.share_memory()

    return language_model

The value returned by this method is assigned to a global loaded before the FastAPI application instance is created.

I doubt this is the way to do things, but I did not find any clear guide on how to do this. Information about deploying production models seems quite scarce, if you remember the premise to this post.

In summary:

  • preload_app = True
  • Load the ML model before the FastAPI (or wsgi) application is created
  • Use .eval() and .share_memory() if your model is PyTorch-based
  • Limit the amount of workers/requests
  • Increase the worker start timeout period

Read on for other tips about dockerization of all this. But first…

Gunicorn configuration

Here’s more or less all the customizations needed for the gunicorn configuration:

# Preload the FastAPI application, so we can load the PyTorch model
# in the parent gunicorn process and share its memory with all the workers
preload_app = True

# Limit the amount of requests a single worker will handle, so as to
# curtail the increase in memory usage of each worker process
max_requests = 50

Bundling model and application in a Docker container

Your choice of deployment target might be different. What I used for our production environment is a Dockerfile. It’s easily applicable as a development option but also good for production in case you deploy to a platform like Kubernetes like I did.

Initially I tried to build a Dockerfile with everything I needed. I kept the PyTorch model file as binary in the git repository. The binary was larger than 500Mb, and that required the use of git-lfs at least for Github repositories. I found that to be a problem when trying to build Docker containers from Github Actions. I couldn’t easily reconstruct the git-lfs objects at build time. Another shortcoming of this approach is that the large model file makes the docker container context huge, increasing build times.

Two stage Docker build

In cases like this, splitting the Docker build in two stages can help. I decided to bundle the large model binary into a first stage Docker container, and then build up my application layer on top as stage two.

Here’s how it works in practice:

# --- Dockerfile.stage1

# https://github.com/tiangolo/uvicorn-gunicorn-fastapi-docker
FROM tiangolo/uvicorn-gunicorn-fastapi:python3.8

# Install PyTorch CPU version
# https://pytorch.org/get-started/locally/#linux-pip
RUN pip3 install torch==1.7.0+cpu torchvision==0.8.1+cpu torchaudio==0.7.0 -f https://download.pytorch.org/whl/torch_stable.html

# Here I'm using sentence_transformers, but you can use any library you need
# and make it download the model you plan using, or just copy/download it
# as appropriate. The resulting docker image should have the model bundled.
RUN pip3 install sentence_transformers==0.3.8
RUN python -c 'from sentence_transformers import SentenceTransformer; model = SentenceTransformer("")'

Build and push this container image to your docker container registry as stage1 tag.

After that, you can build your stage2 docker image starting from the stage1 image.

# --- Dockerfile
FROM $(REGISTRY)/$(PROJECT):stage1

# Gunicorn config uses these env variables by default
ENV LOG_LEVEL=info
ENV MAX_WORKERS=3
ENV PORT=8000

# Give the workers enough time to load the language model (30s is not enough)
ENV TIMEOUT=60

# Install all the other required python dependencies
COPY ./requirements.txt /app
RUN pip3 install -r /app/requirements.txt

COPY ./config/gunicorn_conf.py /gunicorn_conf.py
COPY ./src /app
# COPY ./tests /tests

You may need to increase the runtime shared memory to be able to load the ML model in a preload scenario.
If that’s the case, or if you get errors on model load when running your project in Docker or Kubernetes, you need to run docker with --shm-size=1.75G for example, or any suitable amount of memory for your own model, as in:

docker run --shm-size=1.75G --rm <command>

The equivalent directive for a helm chart to deploy in Kubernetes is (WARNING: POSSIBLY MANGLED YAML AHEAD):

apiVersion: apps/v1
kind: Deployment
metadata:
  ...
spec:
  ...
  template:
    ...
    spec:
      volumes:
        - name: modelsharedmem
          emptyDir:
            sizeLimit: "1750Mi"
            medium: "Memory"
      containers:
        - name: {{ .Chart.Name }}
          ...
          volumeMounts:
            - name: modelsharedmem
              mountPath: /dev/shm
          ...

A Makefile to bind it all together

I like to add a Makefile to my projects, to create a memory of the commands needed to start a server, run tests or build containers. I don’t need to use brain power to memorize any of that, and it’s easy for colleagues to understand what commands are used for which purpose.

Here’s my sample Makefile:

# --- Makefile
PROJECT=myproject
BRANCH=main
REGISTRY=your.docker.registry/project

.PHONY: docker docker-push start test

start:
    ./scripts/start.sh

# Stage 1 image is used to avoid downloading 2 Gb of PyTorch + nlp models
# every time we build our container
docker-stage1:
    docker build -t $(REGISTRY)/$(PROJECT):stage1 -f Dockerfile.stage1 .
    docker push $(REGISTRY)/$(PROJECT):stage1

docker:
    docker build -t $(REGISTRY)/$(PROJECT):$(BRANCH) .

docker-push:
    docker push $(REGISTRY)/$(PROJECT):$(BRANCH)

test:
    JSON_LOGS=False ./scripts/test.sh

Other observations

I had initially opted for Python 3.7, but I tried upgrading to Python 3.8 because of a comment on a related FastAPI issue on Github, and in my tests I found that Python 3.8 uses slightly less memory than Python 3.7 over time.

See also

I published a sample repository to get started with a project like the one I just described: https://github.com/cosimo/fastapi-ml-api.

And these are the links to issues I either followed or commented on while researching my solutions:

The Perl echo chamber, marketing and … is Perl really dying?

Recently I came across this tweet from Curtis/Ovid, which references longer post about a proposal to integrate a better, more modern object-oriented “system” (Corinna) in Perl 5.

The proposal itself is not what I’d like to address here. I haven’t followed Corinna’s evolution. I believe it goes in a positive direction for the language, FWIW.

From that original tweet, a comment from Rafael followed:

[…] but I’m still wondering what are the real factors that make companies seek an exit strategy from Perl 5. Who makes this kind of expensive decision, and why? I suspect obscure OO syntax is not a major one.

This is what I replied with:

This is indicative of the fundamental problem in the Perl echo chamber. Some people still have no idea why companies are moving away from Perl. If you want to hear the perspective from someone who has seen this happen in multiple companies, let me know :-)

Sorry for this premise, but I was afraid what follows would make no sense otherwise.

Why is Perl dying today?

First of all, I don’t think “<language> is dying” is a useful question to ask, nor it is indicative of anything particularly interesting. I’m sure everyone reading this will have encountered plenty of “C is dying”, “Java is dying” or similar, and yet, C and Java are still being used everywhere. In one sense, no language really dies ever. In Perl’s situation, things are slightly different though, as (I believe) Python slowly conquered Perl’s space over time.

What does it mean for a language to die, or to be dead?

From an end user point of view, let’s say a random programmer employed in a company or freelance, a language could be dying if a task they want to accomplish using that language is hard because there are no supporting libraries for it (think CPAN or PyPi), or the libraries are so old they don’t work anymore. That situation surely conveys the idea that the language is not in use anymore, or very few people must be using that language. One would expect that a common task in 2021 must be easy to accomplish with a language worth using in 2021.

What about a company‘s point of view? The reality is that companies don’t have an opinion on languages, only people do. Teams do have an opinion on languages. The group dynamics inside a team influence what languages are acceptable for current and new projects.

Is Perl dying then?

My experience

Some years ago I was a fairly active member of the Perl community, I attended and presented at various Perl conferences around Europe, talking about my experience using Perl at a few small and large companies.

I remember picking up Perl for the first time based on a suggestion from my manager back then. He gave me a hard copy print-out of the whole of Perl 5.004 man pages, and said: “We are going to use this language. It’s amazing, take some time to study it and we’ll start!”. This was 1998, and I had such a fantastic time :-). I was such a noob, but Perl was amazing. It could do everything you needed and then some, and it was easy and simple. The language was fast already back then, and it got faster over time. At that point in time, I was working in a very small company, we were three people initially, and we ended up writing a complete web framework from scratch that is still in use today, after more than 20 years. If that’s not phenomenal, I don’t know what is. It’d be cool to talk about this framework: it was more advanced than anything that’s ever been done even considering it’s 2021… a story for another time.

And by the way, we were running our Perl code on *anything*, and I mean anything, Windows PCs, Linux, Netware and even AS/400, a limited subset of it at least, at a time when Java’s “write once, run everywhere” was just an empty marketing promise. Remember this was the time of Netscape Navigator and Java applets. Ramblings, I know, but perhaps useful to understand where things have gone wrong.

In 2007, I left my job in Italy and moved to Norway to work for Opera Software. Back then, Opera’s browser was still running the Presto engine, and a little department inside Opera was in charge of web services. That’s where I was headed. Most services there were written in Perl. Glorious times for me, I would learn an awful lot there, meet a lot of skilled developers. Soon after I started working there, 2007, some colleagues were already making fun of Perl. It’s a “write-only language”, “not meant for serious stuff”, “lack of web frameworks”, etc… Those were the times when Python frameworks started to emerge, some of which would eventually disappear. I remember a few colleagues strongly arguing to move to this Python framework called Pylons, and then eventually to Django.

I believe this general attitude towards Perl originated from different factors:

  • personal preference towards other languages and/or dislike towards Perl
  • the desire to be working with the latest “hip” framework or language
  • the discomfort of maintaining an aging codebase with problems

These factors exist and are legitimate reasons to want to move away from any language or framework. I’m not saying they are justified, but I do understand why people wanted that. In our field, I have seen it’s quite common to try and avoid the objective difficulties of maintaining a legacy project, going the greener way of an overly optimistic rewrite, which normally ends in tears.

Throughout the years, I noticed other contributing factors to the progressive abandonment of Perl, even in companies like Opera.
I’ll mention two that I experienced directly:

  1. Outdated or non existent supporting libraries
  2. Teams composition

There was a time a few years ago, when CPAN was awesome, the best language support system in existence and every other language community was envying it. CPAN pretty much was selling Perl by itself. In my case, the libraries on CPAN educated me and made me adopt a testing culture that no other language (in my knowledge) had before Perl. Today, seeing npm modules being installed without running tests makes me uncomfortable :-)

Then over time (years) a shift happened. You would search on CPAN for a library that would help you with a common task and you wouldn’t find anything, or you would only find quick hacks that didn’t really work properly. In my case, I remember the first example of that being OAuth2. If I had to speculate, I would say this is a product of many elements, one of which is the average age of Perl programmers getting higher.

Another related shift I remember from those years is companies publishing their APIs/SDKs started dismissing Perl, at first relying on some CPAN module to eventually appear, then completely omitting Perl support. In the beginning, we politely complained to those companies, trying to make a point, but unfortunately there was no turning back. These days almost no SDK comes with a Perl component.

The second major aspect I have experienced is related to teams. In 2012 I was tasked with writing my first ever greenfield project, entirely from scratch, a project that would turn out to be one of the things I’m most proud of, Opera Discover, an online news recommendation system for the Opera browser, still working today! A team of three veteran engineers (myself included) was assembled, and there and then, we were faced with a decision: what language should we use for this?

While I was most experienced in Perl and knew Python a little, the other two colleagues didn’t know Perl. They had experience in C++ mostly, as this was Opera after all. We were chosen not based on our programming language expertise, rather (I suppose) based on our capability to tackle such a big and complex project. While I could have proposed that the project be written in Perl, in good conscience I knew that choice was not viable. Django was readily available and could provide a wide range of functionality we actually needed. No alternative in the Perl world could come close to such a good value proposition. The fact that Python was (like Perl had been for me!) a very accessible choice, simple to pick up, easily installed on any Linux system, and with plenty of solid up-to-date libraries, made the choice obvious.

With the Discover project, I started learning Python properly as a day-to-day programming language. I remember being horrified (and making fun of) the httplib2/httplib3 situation initially. Then I learned about the requests module and forgot all about it. This is to say, Python also has its quirks of course. The disastrous Python 2 vs Python 3 decision in the Python community caused a lot of grief and uncertainty for people (Perl could have learned something from that…). Nowadays, that’s a non-argument, everything runs on Python 3 and if you still haven’t moved, you will soon.

In general, having learned Python quite well, my mindset with regards to programming and my job changed completely. I’m not a Perl programmer. I’m not a Python programmer either. I can use different tools whenever they are more suited to what I need to do. In fact, in my last four years I have written software in NodeJS and Java of all things… I used to despise and make fun of Java, but I had never worked on any professional project before. While I do maintain that Java has some horrible aspects, contrary to my expectations, I have enjoyed working with it, it has an efficient runtime, awesome threading, solid libraries and debugging/inspection tools.

While I do understand Ovid’s point about wanting to keep the business going, and enjoying Perl as a language, I have personally moved on many years ago. I still use Perl for the occasional script when it’s convenient, but for other use cases, like web APIs, I prefer Python and FastAPI, PyTorch for machine learning, etc.. so my conclusion is that it’s the libraries and the ecosystem that drive language use, and not the language itself.

A better OO system will unfortunately do nothing for Perl (in my opinion at least). Better marketing will without a doubt do nothing for Perl. As if a prettier website could change the situation and the aspects I talked about… it can’t! The situation we have in front of us in 2021 is the result of technological and social changes started at least a decade ago.

I realize this may be an incoherent post. Sorry about that, I tried to write it right away or it would have probably never come out.
If you have questions or comments, let me know and I’ll try to address them if I can.

Most importantly, I do not wish to convince anyone that what I wrote is true. It is simply my experience. If there’s one thing I wish people would take from it, it’s to move away from the thought of yourself being a “X Programmer” and broaden your horizons and set of tools available to you. It was a tremendously positive move for myself, one I wished I had done before.

Peace.

pgtop – a top clone for PostgreSQL

According to meta::cpan records, the first release of pgtop is dated April 26, 2005, which makes this little software more than 15 years old!

Back then I had just found out about the brilliant mytop by Jeremy Zawodny, and my day-to-day experience being on Postgres, IIRC version 6.5.3, I decided to try and “convert” mytop to Postgres.

Being quite naive, I thought the endeavour would be much easier than it really was. I’m glad I started though, which is why pgtop exists in the first place. It’s not the only one either. I seem to remember a few similar pgtop projects by other programmers.

After using MySQL and Percona Server for many years, due to a new job, I have gone back to Postgres, version 9.5 and 10 at this time. In recent months, I have done some work to improve performance of our database queries, and remembered writing and using pgtop years before.

Since I lost(*) the original sources, I tried the pgtop version I last uploaded to CPAN, 0.05, dated 2008. It did work, in the sense that I could run the same perl code unmodified, a great testament to Perl as language and as runtime. It didn’t work because the underlying Postgres meta tables that were used in version 6 changed their schema in the 10-12 years since :-)

I spent some time to adapt the metadata queries to work with recent Postgres versions, and was slightly amused by the quality of my 15 year old code… The best feeling about this little tool was to rediscover how useful a few dozen lines of code can be. The service provider monitoring helps, but doesn’t even come close to the level of detail pgtop can provide.

After getting pgtop to work again, I quickly added a few more useful features. I was pleased by the efficiency with which I could work on this tool, considering its age.

So far I added just what was strictly necessary to me:

  • Updated pgtop to the current decade. Now requires perl >= 5.014
  • Fixed to work with Postgres >= 9.0
  • Added a sample Dockerfile to build and run pgtop as Docker container
  • Added a --config option, to load arbitrary config files. This is useful if you want to monitor several databases at once, for example in a tmux session. The config file supports all the options that are available on the command line.
  • Implemented a query killer command, activated pressing K to kill at once all queries slower than a given threshold, in seconds. This is useful if the database is overwhelmed by a lot of slow queries. I don’t recommend using it, particularly if it involves killing UPDATE or INSERT queries, but it can be quite useful.
  • Added a --slow_threshold option, to consider queries slow if they have been running for longer than the given value (in seconds). Now the tool highlights slow queries in bold yellow, and logs all the slow queries to a pgtop.log file.
  • Added a --slack_webhook option, to automatically notify a slack channel if a query crosses the slow threshold runtime value. All the information about the slow query including the SQL will be included in the slack message.

Please let me know if you give it a try! :-)

Five tips to be a more effective command line user

In the movies, heroes manipulate complex graphics environments using only their keyboard; no mouse in sight. Descending from the movies realm to reality, the command line, not the GUI, is where heroes save the day.

This article intends to be helpful for those who are already command line (CLI) users. Complete beginners are of course encouraged to read on, even though they may not grasp all the advantages immediately and perhaps there is a lot of other more important things to learn when starting. On the other hand, I expect long time CLI users to already work similarly. I do hope they might also find interesting tricks to adopt.

The motivation

First of all, why be more effective? Not everyone wants to, and that is fine. These tips contribute to two primary objectives:

Looking back on everything I tried over the years, I'd like to illustrate those tips that I believe brought me the most "bang for the buck", the most value with the smallest effort and the ones that are more easily applicable to anyone else.

Assumptions

I'm going to assume you are using bash on Linux or MacOSX. A recent MacOSX install shouldn't be too different. Windows has bash too these days, so hopefully these suggestions will be widely applicable.

I'm also going to assume that you, reader, already know how to comfortably move around the CLI (CTRL + A, CTRL + E, CTRL + W, …), recall past commands (!<nnn>, !!, CTRL + R) or arguments (ALT + .). There is enough material to write other posts about this. Let me know and I'll be happy to write it!

1. Shell History

Here we go, first recommendation: use your shell history capabilities.

If you are not already doing that, you can search through your shell history — all the commands you have typed — with CTRL + R. However, the default configuration for bash only keeps up to a certain number of commands.

Computers and hard drives being what they are in 2020, there's no reason why you shouldn't extend your shell history to record all commands you have ever typed from the very beginning of your system history. When I setup a new computer, I normally copy over all my $HOME files, so my command history extends, time-wise, well beyond the system I am writing this on.

My shell command history starts in October 2015, when I first learned this trick. Here's how to do it:

# /etc/profile.d/extended_history.sh

# Show the timestamp for each entry of the history file
export HISTTIMEFORMAT="%Y-%m-%dT%H:%M:%S "

# Ensure the history file size and entry number is large
# enough to record years upon years of history
export HISTFILESIZE=500000000
export HISTSIZE=50000000

At least on Debian and derivative systems, dropping a file into /etc/profile.d/ makes it part of the system-wide profile settings, so that is a handy way of applying those settings to all users.

As a result, the history command will work as before, but the numeric index of each command will not reset every time you open a new shell, or every time the history file gets over a certain size, either in number of entries or in file size.

Here's how the history command output looks like with those settings:

23  2015-10-06T19:51:30 git diff
24  2015-10-06T19:51:33 git add locale/en/LC_MESSAGES/django.pot
25  2015-10-06T19:51:49 git status -uno
26  2015-10-06T19:51:51 git commit -a
27  2015-10-06T19:52:11 git push
28  2015-10-06T20:11:35 make test-recommender_translations
29  2015-10-07T18:53:33 vim ~/notes/recsys/impressions-tracking.txt

At the moment, my shell history file (~/.bash_history) is almost 7 Mb, corresponding to a little less than five years worth of commands. There is really no risk of running out of disk space, so keep those commands around.

Keeping a full history has obvious advantages:

  • If you don't remember how you did something or specific options to a command, you can always use history | grep xyz (or CTRL + R) to find out, and all the commands from months (or years!) back will be there. Obviously this does not apply retroactively :-)
  • If you remember only when you did something but not what it was, it's also easy to grep for specific dates and times.
  • You can easily analyze your shell usage patterns, for example finding what are the top 50 shell commands you have ever used:
$ history \
    | awk '{ print substr($0, length($1 $2) + 3) }' \
    | sort | uniq -c \
    | sort -rn \
    | head -50

# on one line:
$ history | awk '{ print substr($0, length($1 $2) + 3) }' | sort | uniq -c | sort -rn | head -50

In order, those lines do the following:

  1. history: take all history entries
  2. awk ...: remove the entry numeric index and timestamp, to only display the command itself and all the arguments
  3. sort | uniq -c: count the number of occurrences for all the distinct entries
  4. sort -rn: reverse sort numerically all the commands
  5. head -50: take the first 50 commands

If you are confused by all these commands, don't worry too much about them. It's just a way to count the most typed commands in your history.
As a curiousity, here's some of my top commands:

13071  ls -l
 7422  git diff
 6338  git status
 3469  cd ..
 2219  git push
 1816  git pull
 1499  git commit -a
 1367  git log
  940  git commit
  851  gpr
  687  gcs
  400  srdm platf
  348  vimp
  333  l1
  314  srdm merl
  306  dcu
  302  mp;rl-f
  206  gce
  196  realias
  169  gcm
  153  mptr;rl-f
  152  gc-

2. Fast Directory Changes

One of the most frequent operations on the command line is moving among directories, with the cd built-in command.

Especially if you've worked for a long time on many projects, or if you work with Java, you tend to have a lot of directory levels nested quite deeply. The cd commands is then tedious to type. Using tab to invoke your shell autocomplete comes in handy, but not having to type at all can easily beat that.

This is a trick I learned from Damian Conway's Productive Programmer course. He was in Oslo a few years ago, and with the help of my company we organized to have him hold this course internally.

The idea is to use a bespoke shell (or Perl, Python, Node, …) script, to quickly navigate to any directory. Example. Currently I am working on a project called merlin, whose parent directory is ~/src/work. Every time I want to do something in this project, I have to:

cd ~/src/work/merlin

Within that project, there are a bunch of directories, so you could end up writing something like:

cd ~/src/work/merlin/gameserver/prototype/java/src/

The idea is to construct a program that can do the "typing" for you, so you'd use the following command instead:

cd2 src w merl g p j s

I called it cd2 but you can call it however you like of course. This program should:

  • take as input a list of string arguments
  • try to expand them to the closest directory name entry
  • if a directory is found, navigate to it
  • take the next argument and repeat this cycle

When this is done, your shell will be left into the target directory of your choice without any long typing or waiting for autocomplete misfired tabs.

I chose to implement my script in bash for simplicity and call it ~/bin/search-directory.sh. The code is almost trivial and here it is in its entirety:

#!/bin/bash
#
# Search through the home directory space based on the
# list of string arguments given as input.
# From an idea by Damian Conway.
#

# Start from my home directory
SRC_DIR=~

# Take all arguments given as input
ARGS="$*"

# For each argument, try to expand it into the nearest directory name
# at the current level
for dir in $ARGS ; do
    sub=$(find -L "$SRC_DIR/" -mindepth 1 -maxdepth 1 -type d -name "$dir*" \
        | sort \
        | egrep -v '\.egg-info' \
        | head -1)
    if [ ! -z "$sub" ]; then
        # We found a subdir, search will proceed from there
        SRC_DIR=$sub
    else
        # Stop: we didn't find any matching entry.
        exit 1
    fi
done

echo "$SRC_DIR"

exit 0

One could clearly do better than this by employing more sophisticated logic. Initially I thought I'd need better, but this simple script has served me well for the past years, and I don't want to complicate it unnecessarily.

There is one more obstacle to clear though. The script will print the final directory match and exit, without affecting the parent shell's current directory.

How to make the active shell actually change directory? I added a tiny function to my ~/.bashrc file:

# `srd` stands for 'source directory'
srd () {
    match=$(~/bin/search-directory.sh src $*)
    if [ ! -z "$match" ]; then
    echo "→  $match"
        cd "$match"
    fi
}

I made the function always supply the src directory by default, so I don't have to type that either. With these bits set up, you can then move to the example directory above with the command:

srd w merl g p j s

And this is just the beginning :-)
Read on for how to combine this technique with the power of aliases and shorten the command even more.

3. Aliases

Shell aliases are a simple way to define or redefine commands.
The typical example would be to shorten common options for your commands. If you know you always type ls -la, you might want to teach that to your shell. The way to do that is:

$ alias ls='ls -la'

From then on, every time you type ls, your shell will automatically expand the command to ls -la.

Based on what I have seen during my career, shell aliases are something that relatively few people are using. Right now, my shell configuration contains almost 500 lines of aliases, of which around 200 I keep active and probably 30-50 I normally use.

I wasn't always such a heavy alias user. I became one when I had the fantastic experience to work with the Fastmail team in Australia for a period of a few months. I learned how they were doing infrastructural and development work and from the first day I saw they were using a ton of shell commands that were completely obscure to me.

I was quite good at operations/sysadmin work, but after seeing how that team worked, the bar was forever raised and it sank in that I had still a lot to learn. I still do :-)

I use aliases for many things, but mainly to not have to remember a lot of unnecessary details. Here's a few from my list:

Alias Expanded command What/why
less less -RS shortening and options expansion. -RS is to show ANSI color escapes correctly and avoid line wrapping
gd git diff shortening
gc- git checkout - switch to the previous git branch you were on
vmi vim saver for when I type too quickly
cdb cd .. cd "back"
cdb5 cd ../../../../../ to quickly back out of nested directories
kill-with-fire killall -9 for those docker processes…
f. find . -type f -name find file names under the current directory tree
x1 xargs -I{} -L1 simplify using xargs, invoking commands for each line of input f.ex.
awk<n> awk '{ print $<n> }' for when you need to extract field number from a text file or similar. Ex.: awk5 < file extracts the 5th field from the file
vde1 ssh varnish-de-1.domain.com host-based alias. I don't want to have to remember hostnames, so I add aliases instead, with simple mnemonic rules, such as vde1 -> varnish node 1 in the german cluster
jq. jq -C . when you want to inspect JSON payloads, f.ex. curl https://some.api | jq.
dcd docker-compose down is anybody really typing docker-compose ?
dcp docker-compose pull
dcu docker-compose up
dkwf docker-kill-with-fire shorthand for docker stop + docker rm or whatever sequence of commands you need to stop a container. See? I don't have to remember :-)
db docker-bash db postgres instead of docker exec -it container-id bash
dl docker-logs same for docker logs -f ...

Some aliases that I have added thinking they'd be useful I have rarely used. Some have become a staple of my daily CLI life. Sometimes, if a new alias catches on only depends on the first few days. If you make a mindful effort to use it, there's a good chance it will stick (if it's actually good).

To make aliases persistent, instead of typing the alias command in your shell, you can add it to your ~/.bashrc file as you can with any other command. You can also create a ~/.aliases file and keep all your aliases there. If you do that, you then need to include your aliases file in your bash configuration. You do that by adding (only once) this line to your ~/.bashrc:

# ~/.bashrc
...
source ~/.aliases

Every time you feel the need to add a new alias, you can simply edit the ~/.aliases file and reload it into your active shell (source ~/.aliases). When you get tired of that, you can use another trick from Conway's course, and add the last alias you will ever need:

alias realias="${EDITOR:-vim} ~/.aliases; source ~/.aliases"

Typing realias will bring up the alias file in your editor and when you save it and exit, all the new aliases will be immediately available in the parent shell.

Once you start down this path, your creativity won't stop seeing new ways to work smarter and faster.

4. Directory Autorun

This is one of the most recent additions to my arsenal. I found myself typing the same commands over and over whenever I entered specific directories.

The idea is then simply to have a sequence of commands automatically executed for me whenever I enter a directory. This is extremely useful in many occasions. For example, if you want to select a specific Python virtualenv, a Node.js version or AWS profile whenever you enter a specific directory.

I chose to do this by dropping an .autorun file in the target directory. Here's a tiny .autorun I have in a Javascript-based project:

#!/bin/bash
REQUIRED="v11.4.0"
CURRENT=$(nvm current)

if [ "$CURRENT" != "$REQUIRED" ]; then
    nvm use $REQUIRED
fi

In this case I want the shell to automatically activate the correct node.js version I need for this project whenever I enter the directory. If the current version, obtained through nvm current, is already the one I need, nothing is done.

It's quite handy, and I immediately got used to it. I can't do without it now. Another example, to select the correct AWS credentials profile and Python virtualenv:

#!/bin/bash

if [ -z "$AWS_PROFILE" -o "$AWS_PROFILE" != "production" ] ; then
    export AWS_PROFILE=production
    echo "— AWS_PROFILE set to $AWS_PROFILE"
fi

if [ -z "$VIRTUAL_ENV" ] ; then
    source .venv/bin/activate
    echo "— Activated local virtualenv"
fi

The glue to make this work is a couple of lines added to your ~/.bashrc file:

# Support for `.autorun` commands when entering a directory
PROMPT_COMMAND+=$'\n'"[ -s .autorun ] && source ./.autorun"

If you are concerned other users could use your machine, or even in general if you like to keep things tidy, ensure you set appropriate permissions for these .autorun files. A chmod 0600 .autorun could be in order.

Remember to run source ~/.bashrc if you make changes to that file, or they won't immediately reflect on your active shell session.

5. SSH Configuration

SSH is one of the most powerful tools in your arsenal. It can be used to tunnel, encrypt and compress data for connections to arbitrary protocols. I'm not going to cover that functionality here. There are good tutorials out there already, such as this and this.

A smart ssh configuration can help you be more effective on the command line. I'd like to show three specific examples that I use every day:

  1. Persistent ssh connections
  2. Hostname aliases
  3. Automatic ssh key selection

Persistent ssh connections

If you connect to remote hosts often, I'm sure you have noticed the amount of time it takes to establish a new ssh connection. The higher the latency, the longer it takes. It is normal for that initial handshake — where a lot of things happen — to take 2 to 5 seconds.

Performing many small operations via ssh can waste a notable amount of time. One solution to this problem is the transparent use of persistent ssh connections.

The connection is established the first time you ssh (or scp) to a host, and next time you perform a similar operation towards the same host and port, the same TCP/IP connection will be used. This implies that the connection remains active after the ssh command has completed.

The ssh configuration directives that enable this behaviour are the following:

# Normally this is in ~/.ssh/config
ControlMaster auto
ControlPath /var/tmp/ssh_mux_%h_%p_%r
ControlPersist 1h

ControlMaster auto enables this behaviour automatically, without you having to specify whether you want to use shared connections (the ones already opened from before) or not. In particular cases, you may want to specify ControlMaster no on the command line to prevent ssh from using an already open connection. Generally this is not desired though, so ControlMaster auto will normally do what you want.

ControlPath is the filename template that will be used to create the socket files, where:

  • %h is the hostname
  • %p is the port number
  • %r is the username used to connect

ControlPersist is the option that determines how long the connections will stay shared waiting for new clients after being established. In my case, I set it to 1h (one hour) and that works well for me.

In case you want to know more about ssh configuration, I recommend reading the related man page. On linux, that is available with:

man 5 ssh_config

Hostname aliases and key selection

I mentioned I want to get unnecessary details out of my memory as much as possible. The ssh configuration file has lots of useful directives. One of these is the per-host configuration blocks.

If you need to connect to a host quite often and its name is not particularly memorable, like an AWS or GCP hostname, you can add host-specific directives to your ~/.ssh/config file:

# ~/.ssh/config

...

Host aws-test
    Hostname 1.2.3.4
    User my-username

From then on, you can use the command ssh aws-test to connect to this host. You won't have to remember the IP address, or the username you need to use to connect to this host. This is particularly useful if you have dozens of hosts or even projects that use different usernames or host naming schemes.

When you have to work with different projects, it's good practice to employ distinct ssh key-pairs instead of a single one. When you start using ssh, you have a ~/.ssh/id_rsa (or ~/.ssh/id_dsa) file, depending on the type of key and an associated ~/.ssh/id_rsa.pub (or ~/.ssh/id_dsa.pub).

I like to have several key-pairs and use them in different circumstances. For example, the key that is used to connect to a production environment is never the same used to connect to a staging or test environment. Same goes for completely different projects, or customers if you do any freelance work.

Continuing from the example above, you can tell ssh to use a specific private key when connecting to a host:

Host aws-test
   Hostname 1.2.3.4
   User my-username
   IdentityFile ~/.ssh/test_rsa

Host aws-prod
   Hostname 42.42.42.42
   User my-username
   IdentityFile ~/.ssh/prod_rsa

Host patterns work too:

Host *.amazonaws.*
   User my-aws-username
   IdentityFile ~/.ssh/aws_rsa

Host *.secretproject.com
   User root
   IdentityFile ~/.ssh/secret_rsa

Final tip

The more I write, the more it feels there is to write about the command line :-) I'll stop here for now, but please let me know if you'd like me to cover some more basic — or maybe more advanced? — use cases. There are a lot of useful tools that can make you more effective when using the command line.

My suggestion is to periodically gather information about how you use the command line, and spend some time to reassess what are the most frequent commands you use, if there are new ways to perform the same actions, perhaps entirely removing the need to type lots of commands.

When I have to do boring, repetitive tasks, I can't help but look into ways to get myself out of those. Sometimes writing a program is the best way to automate those tasks away. It may take more in the beginning, but at least I managed to transform a potentially boring task into programming, of which luckily I'm never bored :-)

Fast VCL checks for personalized backend responses

I’d like to talk about a problem I encountered a few years ago and one possible solution to it. This particular problem stuck with me for a long time for several reasons.
The first one is that at the time I considered the problem basically unsolvable. It would be like having a cake and eating it too, as the proverb goes. Another reason is that this problem had me spinning my wheels thinking about a solution for a good while.

Without any pretense of this being a particularly clever solution or anything like that, I’d like to illustrate what the general problem is and a possible solution I came up with. Hopefully this will be useful to you.

The general problem

Suppose you have a backend request of some sort, an API or a particular web page. In my case it was a json-based recommendations API, which returned a list of recommended news articles to read. The specific purpose of the request is not terribly important. What’s more important is the fact that this request can be personalized depending on the user that makes the request. I believe this is a quite common scenario.

In a recommendations context, it’s also common for a user not to be signed in to the service, or to be invoking the API for the first time. In this case, the recommendations engine does not have any previous information about the user, also called the cold start case.

In this specific project, we had operated in a “permanent cold start mode”, meaning the recommendations we were offering were never differentiated per user. There were a few knobs and settings to influence which type of recommendations one would get from the system (f.ex. less Sports articles and more Arts or Travel), but the system would not learn over time or change its recommendations based on user signals like articles read.

Among other things, this mode of operation allowed us to serve our entire userbase (around 90M monthly active users, around 10M weekly) with only two servers per data-center, also thanks to a very aggressive caching strategy.

When we started experimenting with personalized recommendations, it was immediately clear that we would not be able to handle the additional backend load caused by all the per-user requests. We estimated that, given the cache hit ratio drop, we would need something ridiculous like 50x the amount of servers. For each API request, we would have to:

  1. fetch the distinct user profile
  2. check if the profile contained any information about previously read articles or otherwise useful information to personalize the offered recommendations
  3. compute and return the personalized recommendations

These steps can only be performed by the recommendations engine backend. This implies that we would not be having any help from our caching in Varnish, which made personalized recommendations much harder to implement for us, at least without employing inordinate amounts of servers and having to significantly rebuild our system infrastructure.

You could very well say that that is a problem in itself, and it probably is :-)

A possible solution

I remember spending quite some time thinking about this, not seeing any possible solution. One day I attended a meetup. One of the engineers there talked about the Varnish API engine. The API Engine is a commercial Varnish add-on that can implement authentication and paywalls directly in the caching layer. The person talking about this mentioned how API engine embedded the SQLite3 database, and how this was crucial to the performance of it, since the caching layer is effectively the first bottleneck of a system.

I connected the dots almost immediately and I realized I had a possible way forward to solve my problem. This is how I imagined I could approach the problem:

  • organize user signals collection (what articles each user is reading, etc…) and user profile building as a completely separate batch activity
  • every x number of hours, build a sqlite database with a single table, user_profiles, consisting of two columns, a user_id string and a has_profile boolean. With such table in place, looking up whether we can build a significantly personalized recommendations set for a user is a only an SQL primary-key lookup away.
  • Using the excellent SQLite3 vmod, implement this SQL lookup in our existing Varnish VCL layer. Make sure that for every possible case this code never fails. For example, if the database file does not exist, or the file is for some reason corrupt, etc… we want to behave as if the particular user for the running request had no personalized profile.
  • Ensure that we would be able to update the SQLite database file at any time, without stopping Varnish, and the new file would be visible to the SQL queries immediately or at least after a short delay.

We tested the whole assembly and it seemed to work correctly. The final step consisted in actually computing the personalized profiles, building the real SQLite database, syncing it to the backend systems, and performing the dispatch logic in the VCL layer.

This is more or less the final logic I used:

  • If the request was for an anonymous user, don’t even perform the user profiles SQL lookup, and return the generic recommendations cached payload.
  • If the request comes from a user that has no personalized profile, that is, no record is present in the SQLite table, also return the generic recommendations payload.
  • If the user profiles lookup is positive, that is, a record exists in the user profiles table in SQLite and its has_profile flag is true, then pass the request on to the backend. We know it is a request that must be personalized and only the backend can do that.

Using such logic allows to serve the majority of your user base, which presumably has not logged in, or does not have any significant user profile yet, caching as much as possible. But it also allows personalized recommendations for all users that do have a profile.

We are shifting the critical decision as early in the chain as possible, that is, in your caching layer, either Varnish or similar, before the backend service is even consulted. Taking the decision to the backend service would not be feasible for the reasons already discussed.

The actual code

We used Puppet as configuration management tool back then, with a custom varnish module. I extended the existing manifest to add a new user_profiles.vcl file and to install by default the sqlite3 vmod for Varnish.

The existing VCL code was also modified to:

  • perform the personalized profile SQL query
  • decide whether to pass the request based on the result of the SQL query

The following code illustrates those two steps:

diff --git a/config.vcl b/config.vcl
index 8e25a8a..50c70ce 100644
--- a/config.vcl
+++ b/config.vcl
@@ -1,22 +1,23 @@
 # Recommender system VCL config

 include "/etc/varnish/accept-encoding.vcl";
 include "/etc/varnish/purge.vcl";
 include "/etc/varnish/x-forwarded-for.vcl";
 include "/etc/varnish/auth.vcl";
 include "/etc/varnish/stats.vcl";
+include "/etc/varnish/user_profiles.vcl";
 include "/etc/varnish/strip-tracking-cookies.vcl";

 backend apache {
     .host  = "127.0.0.1";
     .port  = "8000";
     .probe = {
         .url       = "/ping.html";
         .interval  = 10s;
         .timeout   = 5s;
         .window    = 20;
         .threshold = 3;
         .initial   = 3;
     }
 }
@@ -147,45 +148,49 @@ sub vcl_recv {
     if (req.backend.healthy && req.http.User-Agent ~ "McHammer") {
         return (pass);
     }

     # Client clicks must go through the backend (*with* client-id cookie)
     if (req.url ~ "^/api/1\.0/feedback/") {
         return (pass);
     }

     call check_authorization;
+    call check_user_profile;
     call accept_encoding_normalize;

+    # Users with tracking cookies can be served personalized results
+    if (req.http.X-Profile == "1") {
+        std.log("User has customized profile. Rolling the dice.");
+        # Initially keep the percentage of PASS very low, to test the
waters.
+        if (std.random(0, 100) < 1.0) {
+            std.log("User has customized profile and within 1.0%.
Passing.");
+            return (pass);
+        }
+    }

 }

The new user_profiles.vcl file consisted of the following code:

#-----------------------------------------------------------------------------
# Fast check for personalized user profiles
#-----------------------------------------------------------------------------
#
# The general idea is to use this fast check to send users who we know
# have a personalized user profile to the backend without caching, while
# retaining the ability to send cached objects for everyone else.
#
# Uses a SQLite3 database and libvmod-sqlite3 by Federico Schwindt:
# https://github.com/fgsch/libvmod-sqlite3
#
# Extracts the `clientId' from the HTTP Cookie header.
# Looks up the profile_id key having value equal to the `clientId' cookie.
# The underlying schema is very simple:
#
#   CREATE TABLE user_profiles (
#       profile_id char(100) PRIMARY KEY NOT NULL,
#       data text
#   );
#
# At least initially we will not use the data column.

import sqlite3;

sub vcl_init {
    sqlite3.open("/etc/varnish/user_profiles.db", "|;");
}

sub check_user_profile {

    # Quick yes/no test for the clientId cookie
    if (req.http.Cookie ~ "userId=") {

        # Extract a userId value from the Cookie header,
        # which remains untouched. Make sure we can still extract a clientId
        # value even if there's other cookies before/after ours.
        #
        # XXX Not sure what happens when client sends multiple Cookie lines.
        set req.http.X-Profile-Id = regsub(req.http.Cookie,
            "(?:^|.*;\s*)(?:userId=(.*?))\s*(?:;.*|$)", "\1");

        # No need to do anything if userId hasn't been found
        if (req.http.X-Profile-Id != "") {
            #std.log("Checking profile_id: " + req.http.X-Profile-Id);

            # First case of VCL-injection vulnerability :-)
            set req.http.X-Profile = sqlite3.exec(
                "SELECT 1 FROM user_profiles WHERE profile_id='"
                + req.http.X-Profile-Id
                + "'");

            # req.http.X-Profile !~ "^SQL" to catch errors like missing DB,
            # but seems a bit fragile. Depends on libsqlite3 and/or the vmod.
            if (req.http.X-Profile == "1") {
                std.log("User profile " + req.http.X-Profile-Id
                    + " found (" + req.http.X-Profile + ")");
            }
            else {
                std.log("User profile " + req.http.X-Profile-Id
                    + " not found");
            }
        }
    }
}

The commit message

I believe that good solutions deserve awesome commit messages. Here’s what I wrote:

Date:   Thu Jan 28 19:36:46 2016 +0100

    Fast VCL check for personalized profile existence

    How to have the cake and eat it too. Serve cached objects to the majority of
    users while personalizing recommendations to the ones that actually have a
    significant user profile available.

    Got the idea from the Varnish API engine[1].

    It's possible to perform tens of thousands of sqlite database lookups a second
    while processing requests in Varnish through VCL, thanks to SQLite3 being very
    lightweight and in this case embedded right inside Varnish through the sqlite3
    vmod[2].

    This commit hopefully adds all there is to it. The last bit is obviously the
    database file, which I placed in `/etc/varnish/user_profiles.db'. We will need
    to generate the .db file from the clicker and sync it to all frontends.

    Updates seem to be received immediately.

    When no database file is present, as will be in the initial deployment, the
    `check_user_profile()' function will work normally, signaling that no custom
    user profile has been found.

    [1] https://www.varnish-software.com/products/varnish-api-engine
    [2] https://github.com/fgsch/libvmod-sqlite3

How to rollout gradually?

Another interesting aspect is the way we could “control the flow” to this personalized recommendations API, that is, deciding what percentage of users that had personalized profiles, would actually get personalized recommendations.

A gradual rollout would certainly be the best approach, and it was implemented in two different ways:

  • once the SQL lookup was performed and the result was positive, we would still “roll the dice” and only allow 1% (or 5%, 10%) to actually pass through to the backend as personalized recommendations. This was an additional safety measure.
  • when batch building the SQLite database, we could decide to curtail the amount of users with personalized profiles. For example, excluding all users that had not read at least 5 or 10 articles. This barrier served two purposes. It effectively limited the amount of users that would be included in the SQLite database and at the same time made sure we had accumulated significant user profile information before attempting to serve personalized recommendations. A sort of win-win I didn’t expect at first :-)

As usual, if you have any feedback, email me or write below (but comments are subject to approval due to lots of spam).

Long-lived JVM applications memory usage tuning

A few days have passed since the last blog post about jvm memory usage monitoring tools, and I have learned so much about patterns of JVM memory usage and magic flags to use to influence it. I still can’t call myself an expert, but judging from the corpus of stackoverflow posts about jvm and memory, at least I’m not totally clueless. :-)

EDIT: this post has now been further extended and published on the newly published Kahoot engineering blog.

JVM memory usage monitoring tools

During the last few days I looked into JVM memory monitoring tools.

That is one area of expertise that I am definitely not familiar with. The general approach in my career so far has been to avoid Java for long running processes :-) The only other (empirical) pearl of wisdom I remember from past experiences with Java™ is that the max heap memory should be less or equal to the max system memory divided by 2 (maxHeap <= totalMem / 2). I’ve seen this relation work nicely over the years for a fairly busy Solr search cluster with anywhere from 24 to 96Gb memory servers, where maxHeap was never higher than 8Gb though, as those servers had a lot of other processes running at the same time. Obviously if other processes are running concurrently on the system, the max heap size should be decreased accordingly.

When observing servers, it’s typical to monitor the “usable” memory (in Datadog, that would be system.mem.usable) which is a sum of the system current free and cached or buffered memory. That is, the total memory that the system can grab and use at any time if needed.

This measure doesn’t necessarily tell us which processes are using memory and why. We’ve also observed spikes of memory usage when log aggregation tools like filebeat are reading, parsing and shipping logs to the logging servers. It would be useful to start tracking how much memory is used by the specific merlin java process rather than looking at an aggregated memory metric.

Searching around, I found a few useful articles (among them https://www.pushtechnology.com/support/kb/understanding-the-java-virtual-machine-heap-for-high-performance-applications/) and several tools that helped dig deeper and extract more information about what is happening on the gameservers with regards to memory usage. I’d like to mention them here for future reference and to collect eventual feedback from others.

jvmtop

URL: github.com/patric-r/jvmtop

Simple console application that provides high level stats about heap usage and garbage collection CPU usage. Here’s an example from the documentation page:

JvmTop 0.8.0 alpha   amd64  8 cpus, Linux 2.6.32-27, load avg 0.12
 https://github.com/patric-r/jvmtop

  PID MAIN-CLASS      HPCUR HPMAX NHCUR NHMAX    CPU     GC    VM USERNAME   #T DL
 3370 rapperSimpleApp  165m  455m  109m  176m  0.12%  0.00% S6U37 web        21
11272 ver.resin.Resin [ERROR: Could not attach to VM]
27338 WatchdogManager   11m   28m   23m  130m  0.00%  0.00% S6U37 web        31
19187 m.jvmtop.JvmTop   20m 3544m   13m  130m  0.93%  0.47% S6U37 web        20
16733 artup.Bootstrap  159m  455m  166m  304m  0.12%  0.00% S6U37 web        46

where the various columns are:

PID = process id
MAIN-CLASS = the "jvm name" but often the entry point class (with used main() method)
HPCUR = currently used heap memory
HPMAX = maximum heap memory the jvm can allocate
NHCUR = currently used non-heap memory (e.g. PermGen)
NHMAX = maximum non-heap memory the jvm can allocate
CPU = CPU utilization
GC = percentage of time spent in garbage collection (~100% means that the process does garbage collection only)
VM = Shows JVM vendor, java version and release number (S6U37 = Sun JVM 6, Update 37)
USERNAME = Username which owns this jvm process
#T = Number of jvm threads
DL = If !D is shown if the jvm detected a thread deadlock

Useful to get a quick glance at a few critical parameters. I have tested the most recent version (0.9.0) and the compilation from the source code was quick and easy.

jvm-mon

URL: github.com/ajermakovics/jvm-mon

Another console application, but a bit more sophisticated than jvmtop. It also displays trends as it’s meant to be run for a longer period of time. I think this is the appropriate level of fancy I like :-)

I quite like jvm-mon, it’s clear and the data is easy to understand. The charts resize dynamically based on how long you keep it running.

jps_stat

URL: github.com/amarjeetanandsingh/jps_stat

A simpler shell script that displays more or less the same information shown by jvmtop and also keeps running and updating the stats. AFAIK, it’s not possible to run it “one-shot”. That would be useful to build our own metric monitoring.

This is what jps_stat looks like when executed:jps_stat console screenshot

jstat

https://docs.oracle.com/javase/7/docs/technotes/tools/share/jstat.html

Last of the lot is jstat which is part of the JVM distribution. Usage is very simple and it can easily be embedded in a one-liner or a shell script, as in:

watch -d -n1 jstat -gc $(pidof java)

`jstat -gc` outputs garbage collection statistics, some of which I still haven’t understood the purpose. A sample output is the following:

Every 1.0s: jstat -gc 12743                                                                                                                                                                Thu Jun  6 09:33:23 2019

 S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT     GCT
34048.0 34048.0 10229.3  0.0   272640.0 37062.4  2731264.0  1519963.6  32192.0 30728.8 3804.0 3510.3  34926  873.839  26      2.781  876.620

These metrics are a bit more detailed than just looking at heap usage as a whole. I’m not really sure I need to look into all of this specifically.

Conclusion

Not much to say, but for now it was useful to get a bit more details of how the server is running, how much heap memory is consumed, how much non-heap memory is used, and how the total memory used by the java process itself compares to the total memory used on the server.

Plus, it’s handy to write down these things so I can refer back to them when needed :-)

My Retro Arcade Videogames Screensaver for Linux

Screensavers are magic, they’re like magnets, at least to me. I’ve always been fascinated by them. There’s great creativity and programming energy that goes into making one.

I think the first screensavers I encountered were probably the ones written for Windows 3.1, even though the loading colored bars of the Commodore 64 could perhaps qualify as a screensaver as well.

One of the greatest screensavers I remember is the one that came with Novell Netware servers. This was around 1997, and it looked approximately like this video:

This is a work of art. It was a text-mode screensaver, and it displayed a few “snakes” going around the screen. Each snake represented a processor. Multi-core processors were very much in the future back then, but we were running our Netware server with a SMP machine and two Pentium Pro processors, which was pretty awesome for the time.

If the processor occupancy was high, the snake would move faster and at the same time, IIRC the higher the number of processes active on each processor, the longer the snake. Simple and brilliant. With a simple look at the screen, you could instantly tell how much the system was loaded, and that was very important, since that Netware machine enabled hundreds of employees to do their daily job.

I have used Linux as my workstation operating system for at least fifteen years now, and jwz’s xscreensaver package has always been present and very appreciated, so much so
that one day I started my own screensaver project for fun.
I decided to rewrite the Netware screensaver obviously and
I wrote about it in the past.

All this to explain that screensavers are a thing for me :-)

Fast forward to now, I think I found the next level of screensaver goodness, and I’m quite satisfied of it. It is a retro-arcade-videogames screensaver. The craze started when I changed job and started using a new Linux workstation. I noticed all my colleagues running MacOS running this rather stunning aerial video footage as screensaver.

I soon discovered that it’s possible to run the same on Linux, and there’s a project called xscreensaver-aerial that you can run as part of xscreensaver. While that is very cool,
it became boring quickly. The idea of xscreensaver-aerial however is neat. You can display your own video files in fullscreen, just fill up a folder with video files and when xscreensaver is invoked, one of those videos will be played. The shell script linked above does this, with the help of mpv, a nice video player for Linux, even though in a slightly convoluted way.

The following step was obvious: use mpv and adapt the shell script to do what I wanted. Pick random video files from a folder. I started downloading longplay videos of my favourite video games. For example, Bomb Jack or Space Ace, all the rage 80’s arcades or Amiga games, for
that nice nostalgia factor.

Started using this version for a while. I run a two monitor setup, one in portrait mode for code and text editing, and another one in landscape mode for browser and every other need.

Sometimes the screensaver would play the same video on both monitors (lame) or play a video with an inappropriate aspect ratio for the monitor, filling the screen with vertical or horizontal black bars.

Thus, last improvement was to automatically detect the aspect ratio of each monitor and pick from either a `landscape` or `portrait` video folder.

The trick I used to do this is nothing else than some shell code:

 #
 # Overwrites $MODE with either `landscape` or `portrait`
 #
 check_monitor_mode() {
     MODE="landscape"
     local tmpfile="$(mktemp)"
     logger -t xscreensaver "xwinid: $XSCREENSAVER_WINDOW"
     xwininfo -id "${XSCREENSAVER_WINDOW}" > "${tmpfile}"
     local width=$(grep "Width: " "${tmpfile}" | awk '{ print $2 }')
     local height=$(grep "Height: " "${tmpfile}" | awk '{ print $2 }')
     unlink "${tmpfile}"
     if [ $width -gt $height ]; then
         logger -t xscreensaver "landscape mode (${width}x${height})"
     else
         logger -t xscreensaver "portrait mode (${width}x${height})"
         MODE="portrait"
     fi
 }

The command to play the video, which must be saved as an xscreensaver module, is:

mpv --really-quiet --no-audio --fs --no-stop-screensaver --wid="$XSCREENSAVER_WINDOW" "${themovie}" &

By customizing ~/.xscreensaver it’s possible to add your own executable programs, give them a name, etc, so for example my ~/.xscreensaver contains the following:

...
 programs:
 - "Apple TV Aerial" atv4                 \n\
 - retro-arcade-screensaver               \n\
 - "Novell Netware Screensaver" loadsnake \n\
 ...

Here’s the final result:

How to profile Python/Django applications

Using the django-profiler module available at https://github.com/char0n/django-profiler

from profiling import profile

and later:

@profile(stats=True)
def view_to_be_profiled(self, request):
    ...

There’s a couple of settings that it’s possible to enable to view SQL queries and tweak the logger name:

PROFILING_SQL_QUERIES = True
PROFILING_LOGGER_NAME = 'profiler'

Just a reminder for myself. Nothing more.

MySQL (Percona XtraDB) slave replication crash resilience settings

It’s been a geological age since my last blog post!

Oh, so many things happened in the meantime. For the past four years, I worked on the development and operations side of the news recommendation system that powered Opera Discover. With enough energy, I have planned to write a recommender systems “primer” series. We’ll see.

Meanwhile, I’d like to keep these notes here. They’ve been useful to make MySQL replication recover gracefully from network instability, abrupt disconnections and generally datacenter failures. Here they are.

Coming MySQL 5.6 on Debian Wheezy, we began to experience mysql replication breakages after abrupt shutdowns or sudden machine crashes. When systems came back up, more frequently then not, mysql replication would stop due to corrupted slave relay logs.

I started investigating this problem and soon found documentation and blog posts describing the log corruption issues and how mysql development addressed that. Here’s the pages I used as references:

Additionally, we had (I believe unrelated) problems with some mysql meta tables that couldn’t be queried, even though they were listed as existing in the mysql shell and in the filesystem.
We solved this problem with the following steps:

DROP TABLE innodb_table_stats;
ALTER TABLE innodb_table_stats DISCARD TABLESPACE;
stop mysql
rm -rf /var/lib/mysql/mysql/innodb_table_stats.*
restart mysql

These steps have to be executed in this order, even if altering a table after having dropped it may seem nonsensical. It is nonsensical, as sometimes mysql things are.

Crash safe replication settings

We’ve distilled a set of standalone replication settings that will provide years and years of unlimited crash-safe replication fun (maybe). Here they are:

# More resilient slave crash recovery
master-info-repository = TABLE
relay-log-info-repository = TABLE
relay-log-recovery = ON
sync-master-info = 1
sync-relay-log-info = 1

Let’s see what each of these settings does.

master-info-repository=TABLE and relay-log-info-repository=TABLE instruct mysql to store master and relay log information into the mysql database rather than in separated *.info files in the /var/lib/mysql folder.
This is important because in case of crashes, we would like to ensure that master/relay log information is subject to the same ACID properties that the database itself provides. Corollary: make sure the relevant meta tables have InnoDB as storage engine.
For example, a SHOW CREATE TABLE slave_master_info should say Engine=InnoDB.

relay-log-recovery=ON is critical in case of corruption of relay log files on a slave system. When MySQL encounters corrupted relay log files during startup, by default it will drop the ball and halt. This option set to ON, will cause mysql to attempt refetching the relay log files from the master database. The master should then be configured to keep its binlogs for a suitable amount of time (often I use 2 weeks, but really depends on the volume of database changes). As a test, it’s possible to replace the current relay log file with a corrupted copy (from /dev/urandom for example). MySQL will discard the corrupted log file and attempt download from the master, after which a regular startup will be carried out. Fully automatic recovery!

sync-master-info=1 and sync-relay-log-info=1 enable the synchronized commit of both master and relay log information to the database with every transaction commit. This is again something that must be evaluated in each single application. Most probably if you have a high volume of writes, you don’t want to enable it. However, if the writes rate is low enough, this option won’t cost any additional performance and should instead make sure that the slave_master_info and slave_relay_log_info tables are always consistent with the state of the replication and of the rest of the database.

That is all. I’d love to hear any feedback or corrections to this information.