When using Varnish on a high traffic site like opera.com or my.opera.com, it is important to reach a stable and sane configuration (both VCL and general service tuning).
If you're just starting using Varnish now, it's easy to overlook things (like I did, for example :) and later experience some crashes or unexpected problems.
Of course, you should read the Varnish wiki, but I'd suggest you also read at least the following links. I found them to be very useful for me:
- Kristian Lyngstøl's blog, the varnish-related posts, but other stuff as well. I had the opportunity to attend a 2-day Varnish training at Linpro, and he was holding the course. I can't say good enough of the advice Kristian gives in his blog. Really, go read it now!.
- Other users mails on the
varnish-misc
mailing list. In particular, two messages that carry so much helpful information that one could study for a month probably. This one by Twitter's John Adams, and This one by Audun Ytterdal, now working at VG.no, one of the biggest norwegian newspapers. - Artur Bergman OSCON 2009 talk on Varnish (PDF, or on slideshare). Dense with useful tips, Bergman runs a high traffic site chaining multiple distributed varnish servers.
A couple of weeks ago, we experienced some random Varnish crashes, 1 per day on average. That happened during a weekend. As usual, we didn't really notice that Varnish was crashing until we looked at our Munin graphs. Once you know that Varnish is crashing, everything is easier :)
Just look at your syslog file. We did, and we found the following error message:
Feb 26 06:58:26 p26-01 varnishd[19110]: Child (27707) died signal=6 Feb 26 06:58:26 p26-01 varnishd[19110]: Child (27707) Panic message: Missing errorhandling code in HSH_Prepare(), cache_hash.c line 188:#012 Condition((p) != 0) not true. thread = (cache-worker)sp = 0x7f8007c7f008 {#012 fd = 239, id = 239, xid = 1109462166,#012 client = 213.236.208.102:39798,#012 step = STP_LOOKUP,#012 handling = hash,#012 ws = 0x7f8007c7f078 { overflow#012 id = "sess",#012 {s,f,r,e} = {0x7f8007c7f808,,+16369,(nil),+16384},#012 },#012 worker = 0x7f82c94e9be0 {#012 },#012 vcl = {#012 srcname = {#012 "input",#012 "Default",#012 "/etc/varnish/accept-language.vcl",#012 },#012 },#012},#012 Feb 26 06:58:26 p26-01 varnishd[19110]: Child cleanup complete Feb 26 06:58:26 p26-01 varnishd[19110]: child (3710) Started Feb 26 06:58:26 p26-01 varnishd[19110]: Child (3710) said Closed fds: 3 4 5 10 11 13 14 Feb 26 06:58:26 p26-01 varnishd[19110]: Child (3710) said Child starts Feb 26 06:58:26 p26-01 varnishd[19110]: Child (3710) said Ready Feb 26 18:13:37 p26-01 varnishd[19110]: Child (7327) died signal=6 Feb 26 18:13:37 p26-01 varnishd[19110]: Child (7327) Panic message: Missing errorhandling code in HSH_Prepare(), cache_hash.c line 188:#012 Condition((p) != 0) not true. thread = (cache-worker)sp = 0x7f8008e84008 {#012 fd = 248, id = 248, xid = 447481155,#012 client = 213.236.208.101:39963,#012 step = STP_LOOKUP,#012 handling = hash,#012 ws = 0x7f8008e84078 { overflow#012 id = "sess",#012 {s,f,r,e} = {0x7f8008e84808,,+16378,(nil),+16384},#012 },#012 worker = 0x7f81a4f5fbe0 {#012 },#012 vcl = {#012 srcname = {#012 "input",#012 "Default",#012 "/etc/varnish/accept-language.vcl",#012 },#012 },#012},#012 Feb 26 18:13:37 p26-01 varnishd[19110]: Child cleanup complete Feb 26 18:13:37 p26-01 varnishd[19110]: child (30662) Started Feb 26 18:13:37 p26-01 varnishd[19110]: Child (30662) said Closed fds: 3 4 5 10 11 13 14 Feb 26 18:13:37 p26-01 varnishd[19110]: Child (30662) said Child starts Feb 26 18:13:37 p26-01 varnishd[19110]: Child (30662) said Ready
A quick research brought me to sess_workspace
.
We found out we had to increase the default (16kb), especially since we're doing quite a bit of HTTP header copying and rewriting around. In fact, if you do that, each varnish thread uses a memory space at most sess_workspace
bytes.
If you happen to need more space, maybe because clients are sending long HTTP header values, or because you are (like we do) writing lots of additional varnish-specific headers, then Varnish won't be able to allocate enough memory, and will just write the assert condition on syslog and drop the request.
So, we bumped sess_workspace
to 256kb by setting the following in the startup file:
-p sess_workspace=262144
And since then we haven't been having crashes anymore.
This may just be the reason for a Varnish 5.1.3 crash I just experienced. Will test that parameter out. Already many thanks for sharing!