Tag Archives: conference

SREcon EMEA 2022 Logo

My experience at SREcon EMEA 2022

A couple of weeks ago I attended SREcon EMEA in Amsterdam. Here’s some sparse thoughts about it, with no pretense of being exhaustive or coherent.

Looking Back

There are only a handful of conferences I’ve attended where I felt “at home”. Going back in time, Surge was the first one, then came Velocity. I’m adding SREcon to that list. It definitely felt like I was among people that speak the same language and have similar breadth and depth of expertise, and yet it is somewhat strange at the same time.

As I see it, there’s at least three “tiers” for such a big and niche conference. The FAANG folks, the tiny company with a sysadmin or devops or two, and then the big ocean of mid-sized companies, where people like us are. Our SRE team is four people and we manage a service with millions of monthly users. Needless to say, we have a lot on our plate :-)

I came to SREcon after a hiatus from conferences for some years. After a while, conferences tend to become self-referential and people start talking about the same things over and over again. I wanted to understand how things had changed in our field, what were people talking the most about, get some fresh perspectives and perhaps connect with people from other companies. What prompted me to do this was Niall Murphy tearing the SRE bible book apart.

The Question of SRE Identity

This year’s conference topic was “What could SRE be?”.
No surprise, then, that a good portion of the talks were about what I refer to as the question of identity for SREs. We have seen the same happen — and a lot more during all these years — for the DevOps movement.

What could SRE be, then? According to some presentations, one would conclude that whatever SRE is, it’s no longer what Google intended, it’s not what anyone else thinks it is either, it’s just what you think it is: a subjectivist view.

Among the Usenix slack conversations, there was a lot of chit-chat about SRE identity. My personal contribution was the following meme:

Other funny memes that were shared:

An interesting fact I learned during the conference is that the Google SRE book was written by assembling contributions from the best teams at Google, picking out their respective best practices. Paradoxically, this implies that the SRE book is not representative of how even Google itself does SRE. If you also consider that, at the time the SRE book was published (2016), Google employed about 1,200 people in the various SRE teams, the only possible conclusion is… if you are not Google, there is likely very little that you can apply to your everyday mere-mortal-SRE life.

Before you think I’m exaggerating, such conclusion was claimed by (ex-)Google engineers themselves, for example in Alex Hidalgo’s “Diamonds under Pressure” talk and (in my opinion) in one of the best talks of the conference, Emil Stolarsky’s Unified Theory of SRE. Another entertaining presentation in the same vein was Andrew Clay Shafer’s “SRE as She Is Spoke”. Andrew expressed this thesis that “progress [on the SRE journey] stops when the needs are met”, which seems a reasonable and pragmatic approach.
The videos are not up yet, but they should be in a few weeks.

Alongside to the “subjectivist” view, there were other talks, which could be classified as systems thinking, that focused on the more general and broad aspects of what SREs do, how to handle complex systems, human factors, etc… Among the best IMO were:

What else?

The question of SRE identity accounted for a notable part of the talks, but thankfully not all. It’s good to pause and reflect on our role, but personally that’s not why I was interested in SREcon, not primarily at least. What I like are the deep technical talks, where I get to know more about how other companies actually do the stuff we call SRE. Given my past conference experience, I expected Facebook/Meta’s talk to be somewhat disappointing, and it was. While some details of how Meta is structured were shared, and are always interesting, I expected a bit more on how the incident actually happened.

I loved Effie Mouzeli’s talk on how to make teams resilient, “Is Our Team as Resilient as Our Systems?”. We naturally focus on systems, but teams are a crucial part of the equation. My team and I have had to work on this a lot in the past years, and I’m hoping to share more about this soon. I felt this talk had a lot of good insights, some of which we’ve also applied over time.

Another talk that deserves a mention is Chris Sinjakli’s reflection on broadening the scope of how we work on reliability for our systems. This is sometimes difficult to do when toil is a big part of our jobs. Luckily it’s not for our team, not anymore at least, so this talk felt very relevant to me, and I recommend it.

I couldn’t attend some of the talks due to the two parallel tracks. I hope to catch-up when slides and videos will be published later on.

What about the hallway track?

In general, people say that conferences are most useful because of the casual conversations you can have in the hallways. While I do agree with it, the opportunities to have conversations vary depending on the type of person you are, and the people you meet, of course. My impression is that while some people at SREcon were happy to have conversations, most were likewise happy to be left alone, which is fair enough :-)
Just to say that it was really nice to meet people and chat, and almost all I talked to knew Kahoot! directly and were happy to share details about what they’re doing and equally interested in what we’re doing.

In some of these conversations I’ve been trying to motion for more concrete, down to earth, talks on how smaller companies like ours do SRE. It’s ok to aspire or be interested in how Google runs, but you come away with absolutely zero information that’s useful to your work life. Possibly there’s a downside even: people going home thinking they have to do whatever Google does (see chapters above) so ultimately… let’s give less importance to the Googles of the world, please!

Besides the hallway track, there was a nice “sidewalk” track. We walked around the city, 15 km a day on average — you gotta track those SLOs… — and I also managed to snap some nice pictures of Amsterdam at sunrise and sunset.

The Venue and Organization

Loved all of it, honestly the best conference I’ve ever been to. The venue was spectacular, there was plenty of space, slides were clearly visible on screen, and the food was awesome! We also used one of the available meeting rooms to participate in our own company hackaton after the conference finished, until they kicked us out. Here’s a sneak peek of what our team was working on:

I hope to return to SREcon next year in Dublin. By then, I’d love to see more not-Google, not-Meta, etc… talks on the program. Perhaps we (or you!) should think about presenting too, why not?

My experience at Velocity Europe 2011 in Berlin

TL;DR

This year there was the 1st edition of Velocity Europe. I got to present a talk on a DDoS attack we faced at Opera, and it was really awesome to be there.

The long version…

Around July this year I knew there was going to be a Velocity Conference in Europe, and I decided I would try to propose a few ideas for talks. I didn't have my hopes too high, but I wanted to give it a shot anyways, pushing myself way out of my comfort zone :)

The worst that could happen was that the talks didn't get accepted. After a month or so the crazy thing happened, and I got an invite to speak at Velocity, due in November.

Preparation

The first few months passed while I was slowly gathering material for the talk. The idea was talking about the DDoS attack that struck us in October 2010. Almost a year had passed, so if we hadn't taken notes and collected all sorts of logs and information, we wouldn't have had any chance to reconstruct all the "story" with enough detail to be interesting.

Anyway, weeks went by, and in September I started writing down an outline of the talk. It consisted in describing what happened during the DDoS and how we faced it, what we did, how we figured out what to do, etc… but I didn't have a clear idea of what to convey with the actual presentation. What would be the core message, if any?

If I learned anything out of all this, is that writing an outline is absolutely the best favor you can do to yourself to avoid so many problems later on. Just write it down as a text, a blog post, a story. Mind maps are also useful for me.

Last 3-4 weeks flew away while I was trying to put together a decent deck of slides.

In Berlin: pre-conference

"Birds of feathers" was the pre-conference event that took place on Monday 7th (November 2011, if you're reading this in the future), put together by a local team led by Schlomo Schapiro, which I got to talk to also during the conference. It was a good event, Steve Souders and John Allspaw and many other conference attendees were already there. There were sponsor companies presenting their products.

The most interesting sessions of the pre-conference IMO were:

  • 100ms: Steve Souders pushed everyone to think about the next level of web performance. How to bring down the "loading time" of web pages to 100ms. There was an interesting discussion about that. My point was that loading time really needs to be divided into at least dns resolution, server processing, network transfers, client rendering. So there's at least 4 totally different chunks that make up the load time and all of them can be optimized, but with varying levels of gain and complexity.
  • Dyn inc presentation about their product dashboard, that led to a better productivity and communication between teams. Cory van Wollerstein explained their mash-up of Jira and Confluence, used to automatically pull information from the tickets db and provide high-level overviews to executive teams. Very cool. He also argued whether having product managers is a good thing for a company.

The rest of the day I was busy polishing my presentation, and trying to rehearse at the hotel. A month before the conference, I had bought The Naked Presenter (ebook edition), hoping that it would help me do a decent presentation. The book of course recommended to rehearse. It felt very weird and embarassing, but I'm *so* glad I did it. I managed to streamline the presentation, and memorize the sequence of slides.

The Conference – Day 1

Schedule:

http://velocityconf.com/velocityeu/public/schedule/grid/2011-11-08

Keynotes

Opening remarks, plus Theo Schlossnagle, one of the minds behind SurgeCon, on how good operations dudes are usually generalists and need to have a wide spectrum knowledge instead of being "(Perl|Python|Ruby|Java) developers". I really recognize myself in this more generalist role than, for example, the Ruby-on-Rails guy.

Lightning demos

These were lightning demos during the first morning:

Rest of Day 1 went to hell

I had to convert all my slides to 4:3 and test again with the on-site equipment. I was also freaking out at the same time, so I missed everything else until my talk. Sorry :)

Most talks have been recorded and are already up on the Velocity site. Particularly interesting IMO, but video not available yet, are:

My talk

As I said, it was about the DDoS attack to my.opera.com of October 2010. I basically talked about how we found out we were under DDoS, and how we struggled to find our way to keep the site up and running despite the traffic. This was a mid-scale DDoS with around 18k distinct IPs attacking us. We had a hard time, but it was also very much fun in retrospect :) We learned quite a lot in the process, about HTTP and TCP/IP, nkiller2 and the TCP zero-window exploit. Most importantly, we learned to make better use of old and new tools to do troubleshooting. You will find all of this in the slides.

I did my best, and I think it was well received by the audience. While on stage, I really had the feeling that people enjoyed it, plus several folks came to say hello afterwards. One of the most frequent comments I heard was that people found my talk honest. That is the single thing I appreciate the most, because that had been one of my goals since the start. To tell an honest and detailed story of how things went, without pretending to be the super awesome heroes that know everything and can fix anything in no time.

Unfortunately, after the conference I was informed that there had been no recording of the talk. That is really sad. However, since there's no recording, I can pretend I was a nice speaker, given the ratings :). Seriously, if you have a picture or video recording, contact me :)

Here's the slides if you're interested:

http://velocityconf.com/velocityeu/public/schedule/detail/21653

The Conference – Day 2

Schedule:

http://velocityconf.com/velocityeu/public/schedule/grid/2011-11-09

Keynotes

Very inspirational talk by Jeff Veen, Typekit.com

Very well presented, great visuals. Great overall. How to create conditions for teams to work and work well.

http://velocityconf.com/velocityeu/public/schedule/detail/21788

Anticipation: What could possibly go wrong? by John Allspaw, Etsy

A great talk about how to prevent, analyze, respond to Operations problems. I very much like John's style, I think he's a pioneer, at least he introduced me to many great ideas, one above all, continuous deployment. I also like his many references to aviation, aerospace and military engineering fields.

http://velocityconf.com/velocityeu/public/schedule/detail/22258

Full stack awareness, Artur Bergman, Fastly

He's Artur Bergman. Listen to him :-) If anything, because he's really authentic.

http://velocityconf.com/velocityeu/public/schedule/detail/22914

Lightning demos

Another session of lightning demos, for our pleasure:

Browser performance track

This was a track in itself. I lost all of it, since I mainly followed the Operations track, but this was really interesting I heard. Recent speed enhancements in Opera, Chrome, Firefox and Javascript in general were explained in detail.

Afternoon talks

Deploying large payloads at scale, Ramon van Alteren (hyves.nl)

Biggest social network in the Netherlands (4M daily active users, ~10M total users). Ramon is a very cool guy. They have 3.5K servers, and their main application consists of 750Mb compiled php binaries to deploy. And they are experimenting with bittorrent tools to do that :)

I had a few hours of engaging talk with Ramon at one of the social events that followed the conference. We found lots of similarities in how we're dealing with infrastructural growth, scaling, etc… We both use config management tools like puppet extensively in our organizations. We promised each other to remain in touch about deployment matters.

http://velocityconf.com/velocityeu/public/schedule/detail/21571

HTTP connection management from 10 users to 100 millions, Bradley Heilbrun, YouTube

Really interesting dive into YouTube early (2005-2007) architecture with Apache, load balancers, GSLB.

I met Bradley later on that day and we had a quick chat. Turns out they use(d) PowerDNS with its pipe backend for geographic load balancing, much like as we do in Opera with GeoDNS. That made my day :-) It's a pity that companies like YouTube don't talk much about their current technology. They usually tell you about 2-3 years old architectures. That's still very valuable, of course.

http://velocityconf.com/velocityeu/public/schedule/detail/21708

Conclusions

If you're even remotely interested in operations, devops, running a service, scaling, performance, infrastructure, then Velocity is the conference. Surge is another one, probably even better, more hardcore-engineering focused. From my perspective, there's a couple of things that could be improved:

  • while I understand that sponsors are what makes conferences like Velocity possible, some sponsors took too much time out of the actual talk tracks. One or two talks were very promotional in nature, and it was clear to everyone that these companies were pushing their products or themselves. Maybe it wasn't their intention, but to me and to others I talked to, it came out that way.
    I think Velocity needs to screen better this type of talks and separate them from the authentic content that people want, the "stories from the trenches". As a counter-example, Google, among other companies, were doing sponsoring (and recruiting!) activities in a separate hall. That worked very well for everyone. Please let's keep it that way.
  • the on-site technical team wasn't fully prepared to handle presentations made with Open Office. That is not acceptable if you ask me, even if the majority of speakers have a Mac. It's 2011 (2012 now even), so you really need to be prepared to read OpenOffice files. I realize that wasn't Velocity organizers' fault, but I think it's something to consider for next time.

That said, I'm really really happy about my experience at Velocity Europe, both as a speaker and as attendee. It was really awesome, and worth every moment I spent working to prepare for it. Thank you O'Reilly, and I hope to be able to participate again some day :)

From disaster to stability: the scaling challenges of My Opera (Surge 2010)

I just read John Allspaw's blog post about his talk at Surge 2010. John's talks were among the best of the conference IMO. So I was amazed to read his post.

I was really honored to take part in this conference as speaker. Just before my proposal being accepted, I had bought the book Web Operations. When the Surge speakers list was announced, I was thrilled to discover that many contributors to the book, including Allspaw, were also speaking at Surge, and even more honored to have the privilege to being in the same group and knowing them in person.

If you're interested in the talk I presented, while the Omniti folks bring up videos and slides on the Surge website, you can download them from Slideshare:

I hope the video is delayed, so I can avoid the embarrassment for a little while :-)

Surge 2010 scalability conference in Baltimore, USA – DAY 2

This is a summary of day 2 of the Surge conference that took place in Baltimore, USA, 30th of September and 1st of October 2010. For a quite comprehensive blog post about day 1, you can read my previous post.

Here comes the list of talks I attended during Day 2.

Brian Cantryll – failures in commodity hardware

What happens when commodity hardware is used in an "enterprise" hardware project? Brian guided the audience through this industrial hw project. There was no recorded video of this talk, due to the content being potentially "sensitive". Very interesting talk, and Brian is IMO a very good speaker.

Benjamin Black – FastIP

Benjamin presented a – for me – new way to analyze metrics of a network, named "Flow". The flow-based network metrics can represent a network activity in a way that is completely different and much more accurate than what's usually done by operations and sysadmin departments. The downside is that is generates a lot of data. The advantage is that you can analyze and even replay? any traffic that took place between any two nodes of the network. I'm sure I didn't understand correctly because this would be amazing.

There's products out there that offer flow-based network analysis: Cisco Netflow, Ntop NProbe, etc… There's also a IETF working group about flow. We couldn't see any example/demo because there was a problem with the slides, IIRC.

FastIP also offers a related service. I contacted Benjamin about this after his talk. Maybe we'll be able to try something out or at least have a demonstration.

My TODO list:

Gavin Roy – Scaling MyYearBook.com

One of the most interesting talks in this conference IMO. MyYearbook is a Postgres shop, among the top 25 trafficked sites in the USA.

Gavin talked about many things they did to scale their site as the traffic was growing. Here's some of the things I remember:

  • DB connection pooling very important for them. Made a world of difference. They use PgBouncer and pgPool2
  • DB Horizontal scaling with pIProxy. TODO: look it up
  • DB Replication w/ Londiste, Slony, Bucardo
  • Postgresql 9.0 based standby to increase read-only capacity, and for hot-standby.
  • Partitioned the database by table, feature available since Pg 8.1

They have a primary-to-secondary master failover procedure. They looked into automating it, but a tech judgement is really necessary in case something goes wrong, so they will keep it manual. This was a question I asked to Gavin, since we've thought about automating our failover procedure for MySQL, but it's not so easy to just decide when to trigger the failover…

For user storage, they use Isilon IQ Series, apparently a FreeBSD appliance with on-board NFS. For DB servers, they looked at different solutions, but they keep coming back to direct attached storage. Their man db server, they have a massively powerful machine, IIRC, 512Gb of RAM and 128 cores machine. I have to double check this because it seems really impressive.

John Allspaw – Go or No Go

Another great talk by John, well presented and with great content. Not easy to summarize. The main topic was the "Go or no-go meeting", a 10 minutes get together of all involved parties before releasing changes or launching any new feature live.

This meeting basically consists of Yes/No questions:

  • Have you tested enough to deploy? QA still needed?
  • Has the feature being communicated (blog/forum/…)?
  • Does everyone know: when it will go live? who will push the feature?
  • Has the feature been in production for staff (or beta users)? That can be tricky to implement if the new feature implies social interactions (beta user tagging non-beta user)
  • Is it possible to dark launch this feature? Will we?
  • Is it possible to turn on this feature on a % of users? Will we?
  • Does it involve new infrastructure? If so: is there monitoring in place? (BLOCKER)
  • On/Off switch in the code/config is in place? Is it documented?
  • Are all the relevant people available for communication and launch?
  • Is there a place for users to provide feedback about the feature?
  • Post-launch "it's all done" time agreed?
  • Contingency checklist done and everytime reviewed it? (BLOCKER)

The "Contingency list" should answer the question: "What could possibly go wrong? What will we do about it?", with a list of potential issues and how to solve them in case shit hits the fan.

Apart from the Go/No-go meeting, which would be, also according to my past experience, a great way to avoid problems, there's at least a couple more really nice things to keep in mind when developing or launching a new feature:

  • "Dark launches": a dark launch is essentially a full launch of the new feature, but in such a way that is invisible to users. So if you're making db queries and processing stuff, you keep doing all that, you just throw the data away. You will be able to realize the (almost) full impact of the new feature on your application and compensate accordingly.
  • Feature "sampling" (% of users): you just enable the full feature for a small, and then growing, percentage of your user base. You can gradually grow to 100% and test the effect of the changes.

Great stuff.

Neil Gunther – Quantifying scalability

Here I was a bit too excited, due to my talk coming next, so unfortunately I didn't pay too much attention. It's a full analysis of scalability seen as a mathematical function, as capacity of your system as the load increases.

Cosimo Streppone – Scaling challenges of my.opera.com

I think

I used 5 minutes to show a live demo of the My Opera realtime monitor application that we built and afterwards I got very interesting questions, and also some nice twitter messages about it.

I also talked about how we've experimented in distributing requests across the different datacenters with our little geodns tool.

All in all, for me it was a fantastic experience. Practice will make me better, so I look forward to a next time :-)

Baron Schwartz – Scaling without sharding

Baron works for Percona. I had read some talks of his. I think he's a really good speaker. He explained in detail the scenarios that arise when dealing with database scaling, the typical characteristics of reads and writes, single server vs multiple servers deployments.

Basically what the talk tries to suggest is that very few situations require to shard your database. Single server setups can go very far, by optimizing the way the db works. Quote: "Sharding should be your last resort". Sharding should be enforced when write demand exceeds write capacity, so avoid sharding if you can, try to buffer/collate writes, defer update work, etc..

Closing day 2

Theo Schlossnagle closed the conference with a plenary keynote about a semi-serious "brief history of computing". Much fun, and a goodbye to next year's Surge.

For a glimpse of what happened live at the conference, you can also check out the Twitter stream for #surgecon.

Definitely a great conference. Stay tuned for videos and slides on the official site, http://omniti.com/surge/2010.

Communities in Action 2010 in Oslo

Last Monday, 10th of May, the regular Oslo.pm (Oslo Perl Mongers) meeting was a little special. In fact, there wasn't any Oslo.pm meeting, but we went to a free, one evening mini conference sponsored by several norwegian companies: "Communities in Action".

The basic idea was to put together several different communities in the Oslo area, so the conference program was diverse and exciting. There were 7 different tracks. Among them, the most interesting for me, apart from Oslo.pm, were:

  • XP (extreme programming, agile, etc…)
  • scalabin (about the scala language)
  • OWASP
  • Oslo C++ users group

Other tracks were by the NNUG, Norwegian .NET users group and IASA, some norwegian association about something…

There were a couple of colleagues from Opera, and other guys I know from Oslo.pm, but being there alone, I couldn't follow more than one track, so I tried jumping a bit between different rooms. I found the XP talk a bit boring, so I settled on the Oslo.pm track. There was a talk on rakudo * by Karl Rune Nilsen and a talk about meta-object programming in Perl vs Ruby by Matt Trout.

Both talks rocked. I think I had attended a very similar rakudo talk before, but the Ruby one was entirely new. It was funny when Matt, just before starting, tried to attract Ruby folks screaming and shouting
throughout the hotel hall :-)

After the conference, a small group of us gathered and headed over to tilt, a nice pub/pinball place, where I ended up making the day record even if I suck at pinballs…

European Perl Conference, Day 1

Every YAPC::EU (Yet Another Perl Conference Europe) is a really big event in the Perl world, with lots of people from every part of the planet. I got to know some of them already, so we just meet like good friends :-) This year's theme was Corporate Perl, how Perl is used in the corporate world.

This time though I was presenting a talk during the first day of the conference: How Opera uses Perl, that's up on Slideshare right now. If you take a look at it, you will find out that we actually use Perl for a lot of systems, from the very tiny to very complex, mission-critical ones. It's been quite some fun preparing the talk, and I think it also went decently.

There were lots of other interesting talks, even lightning talks, like Giuseppe Maxia's MySQL Sandbox, or Sue Spencer's talk about "Perl at Cisco Systems". There was also a talk on roles and inheritance in OO systems by Curtis Poe of the BBC, and a really funny lightning talk by Alex Kapranoff, a russian guy, but I don't remember the title. Merijn Brand presented lots of ways to improve your Perl modules. This guy's amazing. Also avid Opera user.

During lunch we met up with Martin Berends and Carl Mäsak and talked about Perl 6 syntax, CPAN 6, etc… really cool people.