We're seeing an "elevated" level of traffic these days on the My Opera servers. As usual with operations matters, it's difficult to find one exact clear root cause. The rest of the post explains what we found and the fix for it.
You want to try
options bnx2 disable_msi=1 in your
- using squeeze and bnx2 version is 2.0.2
- you see high traffic (10K+ connections)
- you see errors on public network interface
- server is dropping packets/connections randomly or it's really slow
The gory details
During last Tuesday the DDoS attack (that is still continuing now) on the My Opera servers ramped up from ~4k req/s/frontend to ~16k+ req/s/frontend. Both frontends were dist-upgraded (including a kernel upgrade) on May 23rd, but not rebooted, so the kernel update was armed but not actually live.
We started seeing these bad problems of dropped connections and general slowness after the frontend servers were rebooted. The reason why there were rebooted is because we have been hitting another really weird problem, the 210 days uptime timer bug. See this and this bug reports for more details.
Anyway, I'm not sure how to verify this, because I didn't restart the boxes myself, but my theory is after they were rebooted, the new
bnx2 kernel module version 2.0.2 was loaded.
Then later on we found out about this very specific bnx2 v2.0.2 bug that only triggers in high traffic situations, at least on Debian Squeeze and Ubuntu, that causes network interfaces to stop working correctly, dropping traffic.
Long story short, there's a magic option that prevents this from happening. rmmod'ing and modprobing back the bnx2 module with this option fixed the problem so far.
# /etc/modprobe.d/bnx2.conf options bnx2 disable_msi=1
Regarding what the option is about, I'm not even going to lie about it. I have no idea… We found it with this search:
First hit is our own Sven from sysadmin team:
Second hit is the solution we used:
We also did some tweaking for the large amount of
TIME_WAIT connections that were resulting from this bnx2 bug, namely bumped up
net.sys.ipv4.tcp_max_tw_buckets quite a bit.
- Before rebooting a machine, check what's going to happen, when was last upgrade etc…, f.ex.
- In case you have firewall rules,
iptables-save > /root/iptables-rules.YYYYMMDDand later restore if needed with
iptables-restore < iptables-rules.YYYYMMDD
- Always check if the
conntrackmodule is enabled. Most times you don't need it, and it will cause performance to drop under very high traffic (of course).
In this case what happened is that the conntrack module was accidentally also re-enabled by the reboot. We had previously disabled it, but didn't make the change permanent. This is because on My Opera we're still not using our config management infrastructure… Looking forward to make that happen. Soon. Hopefully :)