Category Archives: Uncategorized

embed.ly supports My Opera urls

On My Opera, we started using the OEmbed specification a long time ago. The goal was to provide an easy way to get metadata information, for example from albums or pictures. Example: given a picture URL on My Opera:

http://my.opera.com/365/albums/showpic.dml?album=2721921&picture=42077711

we wanted to give our users a way to extract metadata information about that URL in an easy way. That's what OEmbed is about. So you have the following OEmbed API url:

http://api.my.opera.com/service/oembed?url=http%3A//my.opera.com/365/albums/showpic.dml%3Falbum%3D2721921%26picture%3D42077711

and the result of an HTTP GET to that URL is:

{
   "width" : "3443",
   "author_name" : "giulia-maria",
   "author_url" : "http://my.opera.com/giulia-maria/",
   "provider_url" : "http://my.opera.com/",
   "version" : "1.0",
   "provider_name" : "My Opera Community",
   "height" : "2293",
   "url" : "http://files.myopera.com/giulia-maria/albums/2721921/DSC_5396.JPG",
   "title" : "April 27, 2010. Pink jasmine or pink lilac?",
   "type" : "photo"
}

So that gives you some useful metadata if you need to embed the picture in a page of yours, or even within a widget. Various sites have been supporting OEmbed for many years now. You can see a brief list on oembed.com. However, now there's a new service called embed.ly that works as a oembed "hub", as it transparently supports URLs from many different websites.

And that includes My Opera too! Try it here:

http://api.embed.ly/embed?url=http%3A//my.opera.com/365/albums/showpic.dml?album=2721921%26picture=42077711

or try it with your favourite site.

Pimp my Debian

Have you ever reinstalled your workstation and found out that your Perl scripts need a shit load of modules that you don't have anymore? Or maybe on a server?

I have. In such cases you have to:

  • run your script,
  • find out which module is missing,
  • figure out if there's a debian package for it,
  • install the debian package,
  • GOTO 10

Today I was so annoyed and lazy, that I decided to put an end to this madness. So I wrote pimp-my-debian. It's an innocent script that you can run as follows:

$ pimp-my-debian --command 'perl ./myscript'

It will keep running your command (perl ./myscript), reading its output, and if it contains something like Can't locate module Foo/Bar.pm in @INC, or Base class package "Foo::Bar" is empty, it will try to figure out a suitable debian package, install it, and retry your command.

It worked so well that I so want to use it again… :-)
Get pimp-my-debian here and have fun!

Another quick update to Ubiquity for Opera

I just pushed a small update to my Ubiquity for Opera user javascript. Two tiny changes:

  • Now the ESC key hides the Ubiquity window
  • Fixed the text selection and focus when you reopen the Ubiquity window and you had input a command before.

Thanks to Martin Å rank for contributing the ubiq_focus() fix.

To download the latest version, go to http://github.com/cosimo/ubiquity-opera/. There's also a minified version there.

MySQL alter table in a transaction

I had a quite painful experience yesterday, I would say…

We were about to run live database migrations, when I decided to try it first in a staging database server, to have a more precise idea of how long they would take. So I connected to the MySQL test system:

mysql> use myopera
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> BEGIN WORK;
Query OK, 0 rows affected (0.00 sec)

mysql> ALTER TABLE blah ADD COLUMN xxx ...
...

The ALTER operation completed in a minute or so. Then I wanted to revert it, so I typed:

mysql> ROLLBACK;

So the transaction was rolled back. Just to double check, I verified that the table was indeed left in its original state, and to my surprise, it wasn't. Yes, the table is a InnoDB table, and yes, I had put the ALTER statement in a transaction.

Even after a ROLLBACK the table is left altered… Isn't that surprising? For me, it was. Of course, it is a documented MySQL behavior, and it's filed as bug/feature request.

Handling a file-server workload with varnish – part 2

I wrote about tuning varnish for a file server. I'd like to continue, detailing what we had to change compared to the varnish defaults to achieve a good and stable performance.

These were the main topics I had mentioned:

  • threads related parameters
  • hash bucket size
  • Session related parameters
  • Grace config and health checking

Threads-related parameters

One thing is certain. You'd better bump up the minimum threads number. When varnish starts up, it creates "n" threads. If more threads are needed, it can create as much as "m" threads but no more.

"n" is given by thread_pool_min * thread_pools. The defaults are 2 thread pools, and thread_pool_min is 200, so varnish will create 400 threads when starting up. We found that we need at least 6,000 threads, sometimes peaking at 8,000. In this case, it's better to start up directly with 7-8,000 threads. We set:

  • thread_pools = 8 since we have a 8 cores machine
  • thread_pool_min = 800

With these settings, on start Varnish will create 6,400 threads and keep them running all the time.

We also set a related param, thread_pool_add_delay to 2 ms, instead of the default I believe 20 ms. This allows Varnish to create a lot of threads more quickly when it starts up, or when more threads are needed. Using the default 20 ms value could slow down the threads creation process, and prevent Varnish to serve requests quickly enough.

Hash bucket size

Don't know much about the hashing internals, but I know we have tens of millions of files, even more, so we have to make sure the hash tables used to store cached objects are big enough, to prevent too many hashing collisions.

This is controlled by the -h option to varnishd. The default bucket size is 50023. We set it to 500009 (-h classic,500009). In this way, even if we could keep 10 million files in memory, we would only have 20 entries in each bucket on average. That's not ideal, but it's better than the default.

We didn't experiment with the new hashing algorithms like critbit.

Session-related parameters

Not so much on this particular server, but in general, we had to bump up the sess_workspace parameter. The default is 16kbytes (16384). sess_workspace controls the amount of memory dedicated by varnish to each connection (session in varnish speak), that is used as a working memory for the HTTP header manipulations. We set it to 32k here. On other servers, where we use a more elaborate VCL config, we use 128k as the default value.

Grace and health checking

Varnish can check that your defined backends are "healthy". That means that they respond to queries in the defined time, and they don't miss heartbeats. You enable health checks just by defining a .probe block in your backend definition (search the Varnish wiki for details).

Having health checks is very convenient. You can instruct varnish to extend the grace period when/if your backend is dead. This means: if Varnish detects that your backends are dead or overloaded and they miss some heartbeats, it will keep serving stale objects from its cache, even if they expired (their TTL is already over). You enable this behaviour by saying:

sub vcl_recv {
   set req.backend = mybackend;

   # Default grace period is 10s
   set req.grace = 10s;

   # OMG. Backend dead. Keep serving stuff until we recover them.
   if (! req.backend.healthy) {
      set req.grace = 4h;
   }
    
   ...
}

sub vcl_fetch {

   # Renew cached objects every minute ...
   set obj.ttl = 60s;

   # ... but keep all objects way past their expire date
   # in case we need them because backends died
   set obj.grace = 4h;

   ...

}

That's it. We're continuing to refine our configs and best practices for Varnish servers. If you have feedback, leave a comment or drop me an email.

Primary to secondary master failover

Here's how to failover from primary to secondary master.
This was written following the My Opera case, and we use MySQL, but should be fairly generic.

Disable monitoring checks

  • Pause any Pingdom checks that are running for the affected systems
  • Set downtime or disable notifications in Nagios for the affected systems

Record log file and position

Assuming your secondary master (DB2) is idle, then now you have to record the log file and position by issuing a SHOW MASTER STATUS command:


mysql> SHOW MASTER STATUS G
*************************** 1. row ***************************
            File: mysql-bin.000024    <- MASTER LOG FILE
        Position: 91074774            <- MASTER LOG POSITION
    Binlog_Do_DB:
Binlog_Ignore_DB:

1 row in set (0.00 sec)

Write them down somewhere.
If you need to perform any kind of write/alter query on this host, then you have to issue the show master status command again, because position will change.

Also try repeating this command. You should see that the log file and position do not change between different runs.

Enable maintenance mode

Now is the time to enable your maintenance or downtime mode for the site or service. That will of course depend on your deployment tools.

Stop backend servers

Your backend/application servers might need to stay up and running. For example, in case of the Auth service, we want this, because we're going to serve static responses (html, xml, etc…) to the clients instead of just letting the connections hang.

In other cases, it's fine to just shut down the backends. You may want to do this for 2 reasons:

  • to make sure that nobody is accidentally hitting your master database, from your internal network or otherwise
  • because doing so should close all the connections to your master database. This is actually depending on the wait_timeout variable in the mysql server. The connections won't go away until wait_timeout seconds have passed. This is the normal behaviour, so don't panic if you still see connections after you shut down the backends.

Switch to the new master now

This depends on how you actually perform the switch. I can imagine at least 2 ways to do this:

  • by instructing LVS to direct all connections to the secondary master
  • take over the IP address either manually or using keepalived

On My Opera, we use keepalived with a private group between the two master database servers, so it's just a matter of:

  • stopping keepalived on the primary master database
  • starting keepalived on the secondary master database

There is a quick and dirty bash script that allows to verify who's the master and makes the switch.

#!/bin/sh

DB1=pri-master-hostname
DB2=sec-master-hostname

function toggle_keepalive() {
        host=$1
        if [[ `ssh $host pidof keepalived` == "" ]]; then
                ssh $host /etc/init.d/keepalived start
                if [[ `ssh $host pidof keepalived` == "" ]]; then
                        echo '*** KEEPALIVE START FAILED ***'
                        echo 'Aborting the master failover procedure'
                        exit
                fi
        else
                ssh $host /etc/init.d/keepalived stop
                if [[ `ssh $host pidof keepalived` != "" ]]; then
                        echo '*** KEEPALIVE STOP FAILED ***'
                        echo 'Aborting the master failover procedure'
                        exit
                fi
        fi
}

echo "Master Database failover"
echo

# Find out who's the primary master now, and swap them
if [[ `ssh $DB1 pidof keepalived` == "" ]]; then
        PRIMARY=$DB2
        SECONDARY=$DB1
else
        PRIMARY=$DB1
        SECONDARY=$DB2
fi

echo Primary is $PRIMARY
echo Secondary is $SECONDARY

# Shutdown primary first, then enable secondary
toggle_keepalive $PRIMARY
toggle_keepalive $SECONDARY

As soon as you do that, the secondary master will be promoted to primary master.
Since they are assumed to be already replicating from each other, nothing will change for them. It will however for all the slaves that were replicating from the primary master. We'll see what to do about that later.

Restart backend servers

Now it's the right time to restart the backend servers, and check that they correctly connect to the new primary master.

On My Opera, we're using a virtual address, w-mlb (write-mysql-load-balancer), to refer to the active primary master database. We use this name in the configuration files everywhere.

This means that we don't have to change anything in the backend servers configuration. We just restart them, and they will connect to the new primary master, due to the IP takeover step described above.

Turn off maintenance mode

If the backends are working correctly, they're connecting to the new master db, it's time to remove the maintenance page, so do that.

We're enabling and disabling maintenance mode by enabling and disabling a virtual host configuration in our frontends and reloading or restarting the frontend httpd servers.

From now on, your application is hopefully up and running and receiving client requests, so your downtime window is over.

Check replication lag

The database slaves at this point are still replicating from the former primary master database (DB1).

But DB1 now is not receiving any traffic (queries) anymore, so it's basically idle, and it should be. Any queries happening on DB1 now mean that something is seriously wrong. There might be lingering connections, but no activity.

Then it's important that all the slaves show no replication lag, so issuing a SHOW SLAVE STATUS command should show zero seconds behind.

mysql> SHOW SLAVE STATUS G
*************************** 1. row ***************************
             Slave_IO_State: Waiting for master to send event
                Master_Host: <DB1-ip>
                Master_Port: <port>
              Connect_Retry: 60
            Master_Log_File: mysql-bin.000025
        Read_Master_Log_Pos: 13691126
...
      Seconds_Behind_Master: 0

1 row in set (0.00 sec)

It's important that Seconds Behind Master is zero.
If it's not, it means that the slave needs more time to fully replicate all the past traffic that had been going on on the former primary master, DB1.

Remember that the primary master is now DB2, while DB1 is the secondary master.

Change master on the slaves

Now you can perform the CHANGE MASTER TO command on all the slaves.

Now you have to bring back the notes about MASTER LOG FILE and MASTER LOG POSITION.

First, stop the slave replication.

mysql> STOP SLAVE;

Then the exact command to issue, if nothing else about your replication changed, is:

mysql> CHANGE MASTER TO MASTER_HOST='<DB2-ip>', MASTER_LOG_FILE='<master_log_file>', MASTER_LOG_POSITION='<master_log_position>';

Then restart the slave replication:

mysql> START SLAVE;
mysql> SHOW SLAVE STATUS G

The following SHOW SLAVE STATUS G command should show the replication running, and, depending on how long it took you to change master since the new master took over the ip, the number of seconds of replication lag.

This number should rapidly go down towards zero.

If it's not, then you might have a problem. Go hide now or take the first flight to Australia or something.

We wrote a switch-master Perl script that proved to be very effective and useful. Example:

./switch-master --host <your_slave> --new-master <new_master_ip> --log-file <master_log_file> --log-pos <master_log_position>

This script performs a lot of sanity checks. Before switching master, it checks that replication lag is zero. If it's not, waits a bit and checks again, etc…

It's made to try to prevent disaster from striking. Very useful and quick to use.

Enable monitoring checks

Now verify that everything looks fine, replication lag is zero, your backends are working correctly, try to use your site a bit.

If everything's fine, enable or unpause the monitoring checks.
You have made it!

Handling a file-server workload with varnish

During last couple of months, we have been playing with varnish for files.myopera.com, the My Opera main files server.

I'm not sure this is a typical use case for varnish, maybe it is, but it has a few unique challenges which I'll try to explain here:

  • really high number of connections (in the 10k range)
  • large file set, ~100 millions
  • the longer the TTL, the better (10 days it's the default)
  • really simple or no VCL logic

In other Varnish installations we're maintaining here at Opera, the real challenge is to seamlessly interface with backend application servers, but in this case the "backend" is just another http file server with little more logic.

Searching around, and using the resources I mentioned some blog posts ago, we have found a few critical settings that need to be tuned to achieve consistent performance levels. Some are obvious, some others are not so obvious:

  • threads related parameters
  • hash bucket size
  • Session related parameters
  • Grace config and health checking

I'll explain all the settings we had to change and how they affected us in a later post.

Looking at Cassandra

When I start to get interested in a project, I usually join the project users mailing list and read it for a couple of months, to get a general feeling of what's going on, what problems do people have, etc…

I became very interested in Cassandra, "A highly scalable, eventually consistent, distributed, structured key-value store".
So, almost 2 months ago, I joined the cassandra-users mailing list.

Turns out that Cassandra is in production at huge sites like Twitter and Digg. Which doesn't really tell you much if you don't know how they use it, and what they use it for. However, guess what? There's a lot of information out there.

Here's a few links I'd like to share about Cassandra:

The MySQL Sandbox

I learned about the MySQL Sandbox last year in Lisbon for the European Perl conference. I remember I talked about it with Giuseppe Maxia, the original author. I promised him to try it.

I wasn't really convinced about it until the first time I tried it.

That was when on My Opera we switched from master-slave to master-master replication. It was during a weekend last November. I was a bit scared of doing the switch. Then I remembered about MySQL Sandbox, and I tried it on an old laptop.

I was absolutely amazed to discover that I could simulate pretty much any setup, from simple to complex, master-slave, master-master, circular multi-master, etc…
It was also very quick to setup, and it's also very fast to install new sandboxed servers.
You can setup a 5 servers replication setup in less than 30 seconds.

MySQL Sandbox is a Perl tool that you can easily install from CPAN with:

$ sudo cpan MySQL-Sandbox

You get the make_sandbox command that allows you to create new sandboxes. Right now I'm trying it again for a maintainance operation we have to do on My Opera soon. I'm using the master-master setup like this:

make_replication_sandbox --master_master --server-version=/home/cosimo/mysql/mysql-5.0.51a-....tar.gz

so I can simulate the entire operation and try to minimize the risk of messing up the production boxes while also getting a bit more confident about these procedures.

MySQL Sandbox, try it. You won't go back :-)

A geolocating, distributed DNS service, geodns

It's been 2 months now that I formally changed my team from My Opera to Core Services. For most of my day, I still work on My Opera, but I get to work on other projects as well.

One of these projects regarded the browser-ballot screen, even if now it's being used for other purposes as well. It is a very interesting project, named Geodns. It is a DNS server.

Its purpose is not unique or new: create geographically-aware DNS zones. Example (just an example): my.geo.opera.com, a geographically-aware my.opera.com, that sends you to our US-based My Opera servers if your browser presents itself with a US or american ip address, norwegian servers if you are in Norway, etc… So, nothing new or particularly clever. Actually someone argues that DNS systems shouldn't be used in this way. But it's really convenient anyway…

So this DNS server is written in Perl, and it uses the omnipresent GeoIP library to find out every client IP address position on Earth, and then uses this information to send the client to the appropriate server based on some simple matching rules:

  • by specific country
  • by specific continent
  • if none match specifically, extract a random server from the pool of those available to serve requests that don't match any other rule

I also made geodns log to a special file that allows to use our own OpenGL engine to display realtime DNS hits on a photo-realistic 3D Earth.

In this picture you can see blue and red dots. The higher the spikes, the more requests there are. Blue is where our datacenters are, red is where clients are sending requests from.

I'm trying to get this published as open source, even if, as I said, it's not really complex or anything. Just a good out-of-the-box solution. It's been running in production for about 3 weeks now, and it's serving around 300 requests per second on average. Stable and fast, but now we're looking at increasing the traffic. My goal is to reach at least 2000 req/s on a single machine. We'll see… :)