Category Archives: Development

Deploying Large Deep Learning Models in Production

Most deep learning or machine learning (ML) articles and tutorials focus on how to build, train and evaluate a model. The model deployment stage is rarely covered in detail, even though it is just as important if not fundamental part of a ML system. In other words, how do we take a working ML model from a jupyter notebook to a production ML-powered API?

I hope more and more practitioners will cover the deployment aspect of ML models. For now, I can offer my own experience about how I approached this problem, hoping this will be useful to some of you out there.

Creating a useful ML model

How to create a useful ML model is the part of the work I won’t cover in this post. :-)

I assume that you already have:

  • a model or pipeline that is either pre-trained or that you have trained yourself
  • a model based on PyTorch, though most of the information here will probably help with any ML framework
  • some idea on how to make your model available as a RESTful API

First step: defining a simple API

The rest of this article will use Python as a programming language, for various reasons, the most important being that the ML model is based on PyTorch. In my specific case, the problem I worked on was text clustering.

Given a set of sentences, the API should output a list of clusters. A cluster is a group of sentences that have a similar meaning, or as similar as possible. This task is usually referred to with the term “semantic similarity”.
Here’s an example. Given the sentences:

  • “Dog Walking: 10 Simple Steps”
  • “The Secrets of Dog Walking”
  • “Why You Need To Dog Walking”
  • “The Art of Dog Walking”
  • “The Joy of Dog Walking”
  • “Public Speaking For The Modern Age”,
  • “Learn The Art of Public Speaking”
  • “Master The Art of Public Speaking”
  • “The Best Way To Public Speaking”

The API should return the following clusters:

  • Cluster 1 = (“Dog Walking: 10 Simple Steps”, “The Secrets of Dog Walking”, “Why You Need To Dog Walking”, “The Art of Dog Walking”, “The Joy of Dog Walking”)
  • Cluster 2 = (“Public Speaking For The Modern Age”, “Learn The Art of Public Speaking”, “Master The Art of Public Speaking”, “The Best Way To Public Speaking”)

The model

I plan to describe the details of the specific model and algorithm I used in a future post. For now, the important aspect is that this model can be loaded in memory with some function we define as follows:

model = get_model()

This model will likely be a very large in-memory object. We only want to load it once in our backend process and use it throughout the request lifecycle, possibly for more than just one request. A typical model will take a long time to load. Ten seconds or more is not unheard of, and we can’t afford to load it for every request. It would make our service terribly slow and unusable.

A simple Python backend module

Last year I discovered FastAPI, and I immediately liked it. It’s easy to use, intuitive and yet flexible. It allowed me to quickly build up every aspect of my service, including its documentation, auto-generated from the code.

FastAPI provides a well-structured base to build upon, whether you are just starting with Python or you are already an expert. It encourages use of type hints and model classes for each request and response. Even if you have no idea what these are, just follow along FastAPI’s good defaults and you will likely find this way of working quite neat.

Let’s build our service from scratch. I usually start from a python virtualenv, an isolated python environment where you can install your dependencies.

virtualenv --python /usr/bin/python3.8 .venv
source .venv/bin/activate

If you are not familiar with virtualenv, there are many tutorials you can read online.
Next step, we write our requirements file, with all the python modules we need to run our project. Here’s an example:

# --- requirements.txt
fastapi~=0.61.1

Save the file as requirements.txt. You can install the modules with pip. There are plenty of guides on how to get pip on your system if you don’t have it:

pip install -r requirements.txt

Doing so will install FastAPI. Let’s create our backend now. Copy the following skeleton API into a main.py file. If you prefer, you can clone the FastAPI template published at https://github.com/cosimo/fastapi-ml-api:

from typing import Optional

from fastapi import FastAPI

app = FastAPI()
model = get_model()

@app.post("/cluster")
def cluster():
return {"Hello": "World"}

You can run this service with:

uvicorn main:app --reload

You’ll notice right away that any changes to the code will trigger a reload of the server: if you are using the production ML model, the model own load time will quickly become a nuisance. I haven’t managed to solve this problem yet. One approach I could see working is to either mock the model results if possible, or use a lighter model for development.

Invoking uvicorn in this way is recommended for development. For production deployments, FastAPI’s docs recommend using gunicorn with the uvicorn workers. I haven’t looked into other options in depth. There might be better ways to deploy a production service. For now this has proven to be reliable for my needs. I did have to tweak gunicorn’s configuration to my specific case.

Running our service with gunicorn

The gunicorn start command looks like the following:

gunicorn -c gunicorn_conf.py -k uvicorn.workers.UvicornWorker --preload main:app

Note the arguments to gunicorn:

  • -k tells gunicorn to use a specific worker class
  • main:app instructs gunicorn to load the main module and use app (in this case the FastAPI instance) as the application code that all workers should be running
  • --preload causes gunicorn to change the worker startup procedure

Preloading our application

Normally gunicorn would create a number of workers, and then have each worker load the application code. The --preload option inverts the sequence of operations by loading the application instance first and then forking all worker processes. Because of how fork() works, each worker process will be a copy of the main gunicorn process and will share (part of) the same memory space.

Making our ML model part of the FastAPI application (or making our model load when the FastAPI application is first created) will cause our model variable to be “shared” across all processes!

The effect of this change is massive. If our model, once loaded into memory, occupies 1 Gb of RAM, and we want to run 4 gunicorn workers, the net gain is 3 Gb of memory that we will have available for other uses. In a container-based deployment, it is especially important to keep the memory usage low. Reclaiming 75% of the total memory that would otherwise be used is an excellent result.

I don’t know enough details about PyTorch models or Python itself to understand how this sharing keeps being valid across the process lifetime. I believe that modifying the model in any way will cause copy-on-write operations and ultimately the model variable to be copied in each process memory space.

Complications

Turns out we don’t get this advantage for free. There are a few complications with having a PyTorch model shared across different processes. The PyTorch documentation covers them in detail, even though I’m not sure I did in fact understand all of it.

In my project I tried several approaches, without success:

  • use pytorch.multiprocessing in the gunicorn configuration module
  • modify gunicorn itself (!) to use pytorch.multiprocessing to load the model. I did it just as a prototype, but even then… bad idea
  • investigate alternative worker models instead of prefork. I don’t remember the results of this investigation, but they must have been unsuccessful
  • use /dev/shm (Linux shared memory tmpfs) as a filesystem where to store the Pytorch model file

A Solution?

The approach I ended up using is the following.

gunicorn must create the FastAPI application to start it, so I loaded the model (as a global) when creating the FastAPI application, and verified the model was loaded before that, and only loaded once.

I added the preload_app = True option to gunicorn’s configuration module.

I limited the amount of workers (my tests showed 3 to work best for my use case), and limited the amount of requests each gunicorn worker will serve. I used max_requests = 50. I limited the amount of requests because I noticed a sudden increase in memory usage in each worker regularly some minutes after startup. I couldn’t trace it back to something specific, so I used this dirty workaround.

Another tweak was to allow the gunicorn workers to start up in a longer than default time, otherwise they would be killed and respawned by gunicorn’s own watchdog as they were taking too long to load the ML model on startup. I used a timeout of 60 seconds instead of the default 30.

The most difficult problem to troubleshoot was workers suddenly stopping and not serving any more requests after a short while. I solved that by not using `async` on my FastAPI application methods. Other people have reported this solution not working for them… This remains to be understood.

Lastly, when loading the Pytorch model, I used the .eval() and .share_memory() methods on it, before returning it to the FastAPI application. This is happening just on first load.

For example, this is how my model loading looks like:

def load_language_model() -> SentenceTransformer:
    language_model = SentenceTransformer(SOME_MODEL_NAME)
    language_model.eval()
    language_model.share_memory()

    return language_model

The value returned by this method is assigned to a global loaded before the FastAPI application instance is created.

I doubt this is the way to do things, but I did not find any clear guide on how to do this. Information about deploying production models seems quite scarce, if you remember the premise to this post.

In summary:

  • preload_app = True
  • Load the ML model before the FastAPI (or wsgi) application is created
  • Use .eval() and .share_memory() if your model is PyTorch-based
  • Limit the amount of workers/requests
  • Increase the worker start timeout period

Read on for other tips about dockerization of all this. But first…

Gunicorn configuration

Here’s more or less all the customizations needed for the gunicorn configuration:

# Preload the FastAPI application, so we can load the PyTorch model
# in the parent gunicorn process and share its memory with all the workers
preload_app = True

# Limit the amount of requests a single worker will handle, so as to
# curtail the increase in memory usage of each worker process
max_requests = 50

Bundling model and application in a Docker container

Your choice of deployment target might be different. What I used for our production environment is a Dockerfile. It’s easily applicable as a development option but also good for production in case you deploy to a platform like Kubernetes like I did.

Initially I tried to build a Dockerfile with everything I needed. I kept the PyTorch model file as binary in the git repository. The binary was larger than 500Mb, and that required the use of git-lfs at least for Github repositories. I found that to be a problem when trying to build Docker containers from Github Actions. I couldn’t easily reconstruct the git-lfs objects at build time. Another shortcoming of this approach is that the large model file makes the docker container context huge, increasing build times.

Two stage Docker build

In cases like this, splitting the Docker build in two stages can help. I decided to bundle the large model binary into a first stage Docker container, and then build up my application layer on top as stage two.

Here’s how it works in practice:

# --- Dockerfile.stage1

# https://github.com/tiangolo/uvicorn-gunicorn-fastapi-docker
FROM tiangolo/uvicorn-gunicorn-fastapi:python3.8

# Install PyTorch CPU version
# https://pytorch.org/get-started/locally/#linux-pip
RUN pip3 install torch==1.7.0+cpu torchvision==0.8.1+cpu torchaudio==0.7.0 -f https://download.pytorch.org/whl/torch_stable.html

# Here I'm using sentence_transformers, but you can use any library you need
# and make it download the model you plan using, or just copy/download it
# as appropriate. The resulting docker image should have the model bundled.
RUN pip3 install sentence_transformers==0.3.8
RUN python -c 'from sentence_transformers import SentenceTransformer; model = SentenceTransformer("")'

Build and push this container image to your docker container registry as stage1 tag.

After that, you can build your stage2 docker image starting from the stage1 image.

# --- Dockerfile
FROM $(REGISTRY)/$(PROJECT):stage1

# Gunicorn config uses these env variables by default
ENV LOG_LEVEL=info
ENV MAX_WORKERS=3
ENV PORT=8000

# Give the workers enough time to load the language model (30s is not enough)
ENV TIMEOUT=60

# Install all the other required python dependencies
COPY ./requirements.txt /app
RUN pip3 install -r /app/requirements.txt

COPY ./config/gunicorn_conf.py /gunicorn_conf.py
COPY ./src /app
# COPY ./tests /tests

You may need to increase the runtime shared memory to be able to load the ML model in a preload scenario.
If that’s the case, or if you get errors on model load when running your project in Docker or Kubernetes, you need to run docker with --shm-size=1.75G for example, or any suitable amount of memory for your own model, as in:

docker run --shm-size=1.75G --rm <command>

The equivalent directive for a helm chart to deploy in Kubernetes is (WARNING: POSSIBLY MANGLED YAML AHEAD):

apiVersion: apps/v1
kind: Deployment
metadata:
  ...
spec:
  ...
  template:
    ...
    spec:
      volumes:
        - name: modelsharedmem
          emptyDir:
            sizeLimit: "1750Mi"
            medium: "Memory"
      containers:
        - name: {{ .Chart.Name }}
          ...
          volumeMounts:
            - name: modelsharedmem
              mountPath: /dev/shm
          ...

A Makefile to bind it all together

I like to add a Makefile to my projects, to create a memory of the commands needed to start a server, run tests or build containers. I don’t need to use brain power to memorize any of that, and it’s easy for colleagues to understand what commands are used for which purpose.

Here’s my sample Makefile:

# --- Makefile
PROJECT=myproject
BRANCH=main
REGISTRY=your.docker.registry/project

.PHONY: docker docker-push start test

start:
    ./scripts/start.sh

# Stage 1 image is used to avoid downloading 2 Gb of PyTorch + nlp models
# every time we build our container
docker-stage1:
    docker build -t $(REGISTRY)/$(PROJECT):stage1 -f Dockerfile.stage1 .
    docker push $(REGISTRY)/$(PROJECT):stage1

docker:
    docker build -t $(REGISTRY)/$(PROJECT):$(BRANCH) .

docker-push:
    docker push $(REGISTRY)/$(PROJECT):$(BRANCH)

test:
    JSON_LOGS=False ./scripts/test.sh

Other observations

I had initially opted for Python 3.7, but I tried upgrading to Python 3.8 because of a comment on a related FastAPI issue on Github, and in my tests I found that Python 3.8 uses slightly less memory than Python 3.7 over time.

See also

I published a sample repository to get started with a project like the one I just described: https://github.com/cosimo/fastapi-ml-api.

And these are the links to issues I either followed or commented on while researching my solutions:

The Perl echo chamber, marketing and … is Perl really dying?

Recently I came across this tweet from Curtis/Ovid, which references longer post about a proposal to integrate a better, more modern object-oriented “system” (Corinna) in Perl 5.

The proposal itself is not what I’d like to address here. I haven’t followed Corinna’s evolution. I believe it goes in a positive direction for the language, FWIW.

From that original tweet, a comment from Rafael followed:

[…] but I’m still wondering what are the real factors that make companies seek an exit strategy from Perl 5. Who makes this kind of expensive decision, and why? I suspect obscure OO syntax is not a major one.

This is what I replied with:

This is indicative of the fundamental problem in the Perl echo chamber. Some people still have no idea why companies are moving away from Perl. If you want to hear the perspective from someone who has seen this happen in multiple companies, let me know :-)

Sorry for this premise, but I was afraid what follows would make no sense otherwise.

Why is Perl dying today?

First of all, I don’t think “<language> is dying” is a useful question to ask, nor it is indicative of anything particularly interesting. I’m sure everyone reading this will have encountered plenty of “C is dying”, “Java is dying” or similar, and yet, C and Java are still being used everywhere. In one sense, no language really dies ever. In Perl’s situation, things are slightly different though, as (I believe) Python slowly conquered Perl’s space over time.

What does it mean for a language to die, or to be dead?

From an end user point of view, let’s say a random programmer employed in a company or freelance, a language could be dying if a task they want to accomplish using that language is hard because there are no supporting libraries for it (think CPAN or PyPi), or the libraries are so old they don’t work anymore. That situation surely conveys the idea that the language is not in use anymore, or very few people must be using that language. One would expect that a common task in 2021 must be easy to accomplish with a language worth using in 2021.

What about a company‘s point of view? The reality is that companies don’t have an opinion on languages, only people do. Teams do have an opinion on languages. The group dynamics inside a team influence what languages are acceptable for current and new projects.

Is Perl dying then?

My experience

Some years ago I was a fairly active member of the Perl community, I attended and presented at various Perl conferences around Europe, talking about my experience using Perl at a few small and large companies.

I remember picking up Perl for the first time based on a suggestion from my manager back then. He gave me a hard copy print-out of the whole of Perl 5.004 man pages, and said: “We are going to use this language. It’s amazing, take some time to study it and we’ll start!”. This was 1998, and I had such a fantastic time :-). I was such a noob, but Perl was amazing. It could do everything you needed and then some, and it was easy and simple. The language was fast already back then, and it got faster over time. At that point in time, I was working in a very small company, we were three people initially, and we ended up writing a complete web framework from scratch that is still in use today, after more than 20 years. If that’s not phenomenal, I don’t know what is. It’d be cool to talk about this framework: it was more advanced than anything that’s ever been done even considering it’s 2021… a story for another time.

And by the way, we were running our Perl code on *anything*, and I mean anything, Windows PCs, Linux, Netware and even AS/400, a limited subset of it at least, at a time when Java’s “write once, run everywhere” was just an empty marketing promise. Remember this was the time of Netscape Navigator and Java applets. Ramblings, I know, but perhaps useful to understand where things have gone wrong.

In 2007, I left my job in Italy and moved to Norway to work for Opera Software. Back then, Opera’s browser was still running the Presto engine, and a little department inside Opera was in charge of web services. That’s where I was headed. Most services there were written in Perl. Glorious times for me, I would learn an awful lot there, meet a lot of skilled developers. Soon after I started working there, 2007, some colleagues were already making fun of Perl. It’s a “write-only language”, “not meant for serious stuff”, “lack of web frameworks”, etc… Those were the times when Python frameworks started to emerge, some of which would eventually disappear. I remember a few colleagues strongly arguing to move to this Python framework called Pylons, and then eventually to Django.

I believe this general attitude towards Perl originated from different factors:

  • personal preference towards other languages and/or dislike towards Perl
  • the desire to be working with the latest “hip” framework or language
  • the discomfort of maintaining an aging codebase with problems

These factors exist and are legitimate reasons to want to move away from any language or framework. I’m not saying they are justified, but I do understand why people wanted that. In our field, I have seen it’s quite common to try and avoid the objective difficulties of maintaining a legacy project, going the greener way of an overly optimistic rewrite, which normally ends in tears.

Throughout the years, I noticed other contributing factors to the progressive abandonment of Perl, even in companies like Opera.
I’ll mention two that I experienced directly:

  1. Outdated or non existent supporting libraries
  2. Teams composition

There was a time a few years ago, when CPAN was awesome, the best language support system in existence and every other language community was envying it. CPAN pretty much was selling Perl by itself. In my case, the libraries on CPAN educated me and made me adopt a testing culture that no other language (in my knowledge) had before Perl. Today, seeing npm modules being installed without running tests makes me uncomfortable :-)

Then over time (years) a shift happened. You would search on CPAN for a library that would help you with a common task and you wouldn’t find anything, or you would only find quick hacks that didn’t really work properly. In my case, I remember the first example of that being OAuth2. If I had to speculate, I would say this is a product of many elements, one of which is the average age of Perl programmers getting higher.

Another related shift I remember from those years is companies publishing their APIs/SDKs started dismissing Perl, at first relying on some CPAN module to eventually appear, then completely omitting Perl support. In the beginning, we politely complained to those companies, trying to make a point, but unfortunately there was no turning back. These days almost no SDK comes with a Perl component.

The second major aspect I have experienced is related to teams. In 2012 I was tasked with writing my first ever greenfield project, entirely from scratch, a project that would turn out to be one of the things I’m most proud of, Opera Discover, an online news recommendation system for the Opera browser, still working today! A team of three veteran engineers (myself included) was assembled, and there and then, we were faced with a decision: what language should we use for this?

While I was most experienced in Perl and knew Python a little, the other two colleagues didn’t know Perl. They had experience in C++ mostly, as this was Opera after all. We were chosen not based on our programming language expertise, rather (I suppose) based on our capability to tackle such a big and complex project. While I could have proposed that the project be written in Perl, in good conscience I knew that choice was not viable. Django was readily available and could provide a wide range of functionality we actually needed. No alternative in the Perl world could come close to such a good value proposition. The fact that Python was (like Perl had been for me!) a very accessible choice, simple to pick up, easily installed on any Linux system, and with plenty of solid up-to-date libraries, made the choice obvious.

With the Discover project, I started learning Python properly as a day-to-day programming language. I remember being horrified (and making fun of) the httplib2/httplib3 situation initially. Then I learned about the requests module and forgot all about it. This is to say, Python also has its quirks of course. The disastrous Python 2 vs Python 3 decision in the Python community caused a lot of grief and uncertainty for people (Perl could have learned something from that…). Nowadays, that’s a non-argument, everything runs on Python 3 and if you still haven’t moved, you will soon.

In general, having learned Python quite well, my mindset with regards to programming and my job changed completely. I’m not a Perl programmer. I’m not a Python programmer either. I can use different tools whenever they are more suited to what I need to do. In fact, in my last four years I have written software in NodeJS and Java of all things… I used to despise and make fun of Java, but I had never worked on any professional project before. While I do maintain that Java has some horrible aspects, contrary to my expectations, I have enjoyed working with it, it has an efficient runtime, awesome threading, solid libraries and debugging/inspection tools.

While I do understand Ovid’s point about wanting to keep the business going, and enjoying Perl as a language, I have personally moved on many years ago. I still use Perl for the occasional script when it’s convenient, but for other use cases, like web APIs, I prefer Python and FastAPI, PyTorch for machine learning, etc.. so my conclusion is that it’s the libraries and the ecosystem that drive language use, and not the language itself.

A better OO system will unfortunately do nothing for Perl (in my opinion at least). Better marketing will without a doubt do nothing for Perl. As if a prettier website could change the situation and the aspects I talked about… it can’t! The situation we have in front of us in 2021 is the result of technological and social changes started at least a decade ago.

I realize this may be an incoherent post. Sorry about that, I tried to write it right away or it would have probably never come out.
If you have questions or comments, let me know and I’ll try to address them if I can.

Most importantly, I do not wish to convince anyone that what I wrote is true. It is simply my experience. If there’s one thing I wish people would take from it, it’s to move away from the thought of yourself being a “X Programmer” and broaden your horizons and set of tools available to you. It was a tremendously positive move for myself, one I wished I had done before.

Peace.

Fast VCL checks for personalized backend responses

I’d like to talk about a problem I encountered a few years ago and one possible solution to it. This particular problem stuck with me for a long time for several reasons.
The first one is that at the time I considered the problem basically unsolvable. It would be like having a cake and eating it too, as the proverb goes. Another reason is that this problem had me spinning my wheels thinking about a solution for a good while.

Without any pretense of this being a particularly clever solution or anything like that, I’d like to illustrate what the general problem is and a possible solution I came up with. Hopefully this will be useful to you.

The general problem

Suppose you have a backend request of some sort, an API or a particular web page. In my case it was a json-based recommendations API, which returned a list of recommended news articles to read. The specific purpose of the request is not terribly important. What’s more important is the fact that this request can be personalized depending on the user that makes the request. I believe this is a quite common scenario.

In a recommendations context, it’s also common for a user not to be signed in to the service, or to be invoking the API for the first time. In this case, the recommendations engine does not have any previous information about the user, also called the cold start case.

In this specific project, we had operated in a “permanent cold start mode”, meaning the recommendations we were offering were never differentiated per user. There were a few knobs and settings to influence which type of recommendations one would get from the system (f.ex. less Sports articles and more Arts or Travel), but the system would not learn over time or change its recommendations based on user signals like articles read.

Among other things, this mode of operation allowed us to serve our entire userbase (around 90M monthly active users, around 10M weekly) with only two servers per data-center, also thanks to a very aggressive caching strategy.

When we started experimenting with personalized recommendations, it was immediately clear that we would not be able to handle the additional backend load caused by all the per-user requests. We estimated that, given the cache hit ratio drop, we would need something ridiculous like 50x the amount of servers. For each API request, we would have to:

  1. fetch the distinct user profile
  2. check if the profile contained any information about previously read articles or otherwise useful information to personalize the offered recommendations
  3. compute and return the personalized recommendations

These steps can only be performed by the recommendations engine backend. This implies that we would not be having any help from our caching in Varnish, which made personalized recommendations much harder to implement for us, at least without employing inordinate amounts of servers and having to significantly rebuild our system infrastructure.

You could very well say that that is a problem in itself, and it probably is :-)

A possible solution

I remember spending quite some time thinking about this, not seeing any possible solution. One day I attended a meetup. One of the engineers there talked about the Varnish API engine. The API Engine is a commercial Varnish add-on that can implement authentication and paywalls directly in the caching layer. The person talking about this mentioned how API engine embedded the SQLite3 database, and how this was crucial to the performance of it, since the caching layer is effectively the first bottleneck of a system.

I connected the dots almost immediately and I realized I had a possible way forward to solve my problem. This is how I imagined I could approach the problem:

  • organize user signals collection (what articles each user is reading, etc…) and user profile building as a completely separate batch activity
  • every x number of hours, build a sqlite database with a single table, user_profiles, consisting of two columns, a user_id string and a has_profile boolean. With such table in place, looking up whether we can build a significantly personalized recommendations set for a user is a only an SQL primary-key lookup away.
  • Using the excellent SQLite3 vmod, implement this SQL lookup in our existing Varnish VCL layer. Make sure that for every possible case this code never fails. For example, if the database file does not exist, or the file is for some reason corrupt, etc… we want to behave as if the particular user for the running request had no personalized profile.
  • Ensure that we would be able to update the SQLite database file at any time, without stopping Varnish, and the new file would be visible to the SQL queries immediately or at least after a short delay.

We tested the whole assembly and it seemed to work correctly. The final step consisted in actually computing the personalized profiles, building the real SQLite database, syncing it to the backend systems, and performing the dispatch logic in the VCL layer.

This is more or less the final logic I used:

  • If the request was for an anonymous user, don’t even perform the user profiles SQL lookup, and return the generic recommendations cached payload.
  • If the request comes from a user that has no personalized profile, that is, no record is present in the SQLite table, also return the generic recommendations payload.
  • If the user profiles lookup is positive, that is, a record exists in the user profiles table in SQLite and its has_profile flag is true, then pass the request on to the backend. We know it is a request that must be personalized and only the backend can do that.

Using such logic allows to serve the majority of your user base, which presumably has not logged in, or does not have any significant user profile yet, caching as much as possible. But it also allows personalized recommendations for all users that do have a profile.

We are shifting the critical decision as early in the chain as possible, that is, in your caching layer, either Varnish or similar, before the backend service is even consulted. Taking the decision to the backend service would not be feasible for the reasons already discussed.

The actual code

We used Puppet as configuration management tool back then, with a custom varnish module. I extended the existing manifest to add a new user_profiles.vcl file and to install by default the sqlite3 vmod for Varnish.

The existing VCL code was also modified to:

  • perform the personalized profile SQL query
  • decide whether to pass the request based on the result of the SQL query

The following code illustrates those two steps:

diff --git a/config.vcl b/config.vcl
index 8e25a8a..50c70ce 100644
--- a/config.vcl
+++ b/config.vcl
@@ -1,22 +1,23 @@
 # Recommender system VCL config

 include "/etc/varnish/accept-encoding.vcl";
 include "/etc/varnish/purge.vcl";
 include "/etc/varnish/x-forwarded-for.vcl";
 include "/etc/varnish/auth.vcl";
 include "/etc/varnish/stats.vcl";
+include "/etc/varnish/user_profiles.vcl";
 include "/etc/varnish/strip-tracking-cookies.vcl";

 backend apache {
     .host  = "127.0.0.1";
     .port  = "8000";
     .probe = {
         .url       = "/ping.html";
         .interval  = 10s;
         .timeout   = 5s;
         .window    = 20;
         .threshold = 3;
         .initial   = 3;
     }
 }
@@ -147,45 +148,49 @@ sub vcl_recv {
     if (req.backend.healthy && req.http.User-Agent ~ "McHammer") {
         return (pass);
     }

     # Client clicks must go through the backend (*with* client-id cookie)
     if (req.url ~ "^/api/1\.0/feedback/") {
         return (pass);
     }

     call check_authorization;
+    call check_user_profile;
     call accept_encoding_normalize;

+    # Users with tracking cookies can be served personalized results
+    if (req.http.X-Profile == "1") {
+        std.log("User has customized profile. Rolling the dice.");
+        # Initially keep the percentage of PASS very low, to test the
waters.
+        if (std.random(0, 100) < 1.0) {
+            std.log("User has customized profile and within 1.0%.
Passing.");
+            return (pass);
+        }
+    }

 }

The new user_profiles.vcl file consisted of the following code:

#-----------------------------------------------------------------------------
# Fast check for personalized user profiles
#-----------------------------------------------------------------------------
#
# The general idea is to use this fast check to send users who we know
# have a personalized user profile to the backend without caching, while
# retaining the ability to send cached objects for everyone else.
#
# Uses a SQLite3 database and libvmod-sqlite3 by Federico Schwindt:
# https://github.com/fgsch/libvmod-sqlite3
#
# Extracts the `clientId' from the HTTP Cookie header.
# Looks up the profile_id key having value equal to the `clientId' cookie.
# The underlying schema is very simple:
#
#   CREATE TABLE user_profiles (
#       profile_id char(100) PRIMARY KEY NOT NULL,
#       data text
#   );
#
# At least initially we will not use the data column.

import sqlite3;

sub vcl_init {
    sqlite3.open("/etc/varnish/user_profiles.db", "|;");
}

sub check_user_profile {

    # Quick yes/no test for the clientId cookie
    if (req.http.Cookie ~ "userId=") {

        # Extract a userId value from the Cookie header,
        # which remains untouched. Make sure we can still extract a clientId
        # value even if there's other cookies before/after ours.
        #
        # XXX Not sure what happens when client sends multiple Cookie lines.
        set req.http.X-Profile-Id = regsub(req.http.Cookie,
            "(?:^|.*;\s*)(?:userId=(.*?))\s*(?:;.*|$)", "\1");

        # No need to do anything if userId hasn't been found
        if (req.http.X-Profile-Id != "") {
            #std.log("Checking profile_id: " + req.http.X-Profile-Id);

            # First case of VCL-injection vulnerability :-)
            set req.http.X-Profile = sqlite3.exec(
                "SELECT 1 FROM user_profiles WHERE profile_id='"
                + req.http.X-Profile-Id
                + "'");

            # req.http.X-Profile !~ "^SQL" to catch errors like missing DB,
            # but seems a bit fragile. Depends on libsqlite3 and/or the vmod.
            if (req.http.X-Profile == "1") {
                std.log("User profile " + req.http.X-Profile-Id
                    + " found (" + req.http.X-Profile + ")");
            }
            else {
                std.log("User profile " + req.http.X-Profile-Id
                    + " not found");
            }
        }
    }
}

The commit message

I believe that good solutions deserve awesome commit messages. Here’s what I wrote:

Date:   Thu Jan 28 19:36:46 2016 +0100

    Fast VCL check for personalized profile existence

    How to have the cake and eat it too. Serve cached objects to the majority of
    users while personalizing recommendations to the ones that actually have a
    significant user profile available.

    Got the idea from the Varnish API engine[1].

    It's possible to perform tens of thousands of sqlite database lookups a second
    while processing requests in Varnish through VCL, thanks to SQLite3 being very
    lightweight and in this case embedded right inside Varnish through the sqlite3
    vmod[2].

    This commit hopefully adds all there is to it. The last bit is obviously the
    database file, which I placed in `/etc/varnish/user_profiles.db'. We will need
    to generate the .db file from the clicker and sync it to all frontends.

    Updates seem to be received immediately.

    When no database file is present, as will be in the initial deployment, the
    `check_user_profile()' function will work normally, signaling that no custom
    user profile has been found.

    [1] https://www.varnish-software.com/products/varnish-api-engine
    [2] https://github.com/fgsch/libvmod-sqlite3

How to rollout gradually?

Another interesting aspect is the way we could “control the flow” to this personalized recommendations API, that is, deciding what percentage of users that had personalized profiles, would actually get personalized recommendations.

A gradual rollout would certainly be the best approach, and it was implemented in two different ways:

  • once the SQL lookup was performed and the result was positive, we would still “roll the dice” and only allow 1% (or 5%, 10%) to actually pass through to the backend as personalized recommendations. This was an additional safety measure.
  • when batch building the SQLite database, we could decide to curtail the amount of users with personalized profiles. For example, excluding all users that had not read at least 5 or 10 articles. This barrier served two purposes. It effectively limited the amount of users that would be included in the SQLite database and at the same time made sure we had accumulated significant user profile information before attempting to serve personalized recommendations. A sort of win-win I didn’t expect at first :-)

As usual, if you have any feedback, email me or write below (but comments are subject to approval due to lots of spam).

Long-lived JVM applications memory usage tuning

A few days have passed since the last blog post about jvm memory usage monitoring tools, and I have learned so much about patterns of JVM memory usage and magic flags to use to influence it. I still can’t call myself an expert, but judging from the corpus of stackoverflow posts about jvm and memory, at least I’m not totally clueless. :-)

EDIT: this post has now been further extended and published on the newly published Kahoot engineering blog.

How to profile Python/Django applications

Using the django-profiler module available at https://github.com/char0n/django-profiler

from profiling import profile

and later:

@profile(stats=True)
def view_to_be_profiled(self, request):
    ...

There’s a couple of settings that it’s possible to enable to view SQL queries and tweak the logger name:

PROFILING_SQL_QUERIES = True
PROFILING_LOGGER_NAME = 'profiler'

Just a reminder for myself. Nothing more.