Problems with bnx2 kernel module and high traffic

We're seeing an "elevated" level of traffic these days on the My Opera servers. As usual with operations matters, it's difficult to find one exact clear root cause. The rest of the post explains what we found and the fix for it.

TL;DR

You want to try options bnx2 disable_msi=1 in your /etc/modprobe.d/bnx2.conf if:

  • using squeeze and bnx2 version is 2.0.2
  • you see high traffic (10K+ connections)
  • you see errors on public network interface
  • server is dropping packets/connections randomly or it's really slow

The gory details

During last Tuesday the DDoS attack (that is still continuing now) on the My Opera servers ramped up from ~4k req/s/frontend to ~16k+ req/s/frontend. Both frontends were dist-upgraded (including a kernel upgrade) on May 23rd, but not rebooted, so the kernel update was armed but not actually live.

We started seeing these bad problems of dropped connections and general slowness after the frontend servers were rebooted. The reason why there were rebooted is because we have been hitting another really weird problem, the 210 days uptime timer bug. See this and this bug reports for more details.

Anyway, I'm not sure how to verify this, because I didn't restart the boxes myself, but my theory is after they were rebooted, the new bnx2 kernel module version 2.0.2 was loaded.

Then later on we found out about this very specific bnx2 v2.0.2 bug that only triggers in high traffic situations, at least on Debian Squeeze and Ubuntu, that causes network interfaces to stop working correctly, dropping traffic.

Long story short, there's a magic option that prevents this from happening. rmmod'ing and modprobing back the bnx2 module with this option fixed the problem so far.

# /etc/modprobe.d/bnx2.conf
options bnx2 disable_msi=1

Regarding what the option is about, I'm not even going to lie about it. I have no idea… We found it with this search:

https://encrypted.google.com/search?client=opera&rls=en&q=bnx2+debian+2.0.2+traffic&sourceid=opera&ie=utf-8&oe=utf-8&channel=suggest

First hit is our own Sven from sysadmin team:

http://lists.us.dell.com/pipermail/linux-poweredge/2011-October/045485.html

Second hit is the solution we used:

http://ubuntuforums.org/archive/index.php/t-1726045.html

We also did some tweaking for the large amount of TIME_WAIT connections that were resulting from this bnx2 bug, namely bumped up net.sys.ipv4.tcp_max_tw_buckets quite a bit.

Take aways

  1. Before rebooting a machine, check what's going to happen, when was last upgrade etc…, f.ex. /var/log/dpkg.log.
  2. In case you have firewall rules, iptables-save > /root/iptables-rules.YYYYMMDD and later restore if needed with iptables-restore < iptables-rules.YYYYMMDD
  3. Always check if the conntrack module is enabled. Most times you don't need it, and it will cause performance to drop under very high traffic (of course).

In this case what happened is that the conntrack module was accidentally also re-enabled by the reboot. We had previously disabled it, but didn't make the change permanent. This is because on My Opera we're still not using our config management infrastructure… Looking forward to make that happen. Soon. Hopefully :)

Leave a Reply

Your email address will not be published. Required fields are marked *