Some months ago, while investigating on a problem in our infrastructure, I put together a small tool to help detecting TCP retransmits happening during HTTP requests.
TCP retransmissions can happen, for example, when a client sends a SYN packet to the server, the server responds with a SYN-ACK but, for any reason, the client never receives the SYN-ACK. In this case, the client correctly waits for a given time, called the TCP Retransmission Timeout. In the usual case, this time is set to 3 seconds.
There's probably a million reasons why the client may never receive a SYN-ACK. The one I've seen more often is packet loss, which in turn can have lots of reasons, for example a malfunctioning or misconfigured network switch.
However, you can immediately spot if your timeout/hang problems are caused by TCP retransmission because they happen to cause response times that are unusually frequently distributed around 3, 9 and 21 seconds (and on, of course).
In fact, the TCP retransmission timeout starts at 3 seconds, but if the client tries to resend after a timeout and still receives no answer, it doubles the wait to 6 s, so the total response time will be 9 seconds, assuming that the client now finally receives the SYN-ACK. Otherwise, 3 + 6 + 12 = 21, then 3 + 6 + 12 + 24 = 45 s and so on and so forth.
So, this little tool fires a quick batch of HTTP requests to a given server and measures the response times, highlighting slow responses (> 0.5s). If you see that the reported response times are 3.002s, 9.005s or similar, then you are probably in presence of TCP retransmission and/or packet loss.
Finally, here it is:
#!/usr/bin/env perl
#
# https://gist.github.com/1101500
#
# Fires HTTP request batches at the specified hostname
# and analyzes the response times.
#
# If you have suspicious frequency of 3.00x, 9.00x, 21.00x
# seconds, then most probably you have a problem of packet loss
# in your network.
#
# cosimo@opera.com, sometime in 2011
#
use strict;
use LWP::UserAgent ();
use Time::HiRes ();
$| = 1;
my $ua = LWP::UserAgent->new();
$ua->agent("$0/0.01");
# Tests this hostname
my $server = $ARGV[0] || die "Usage: $0 <hostname>n";
# Picks the URLs to test in this list, one after the other
my @url_pool = qw(
/ping.html
);
my $total_reqs = 0;
my $total_elapsed = 0.0;
my $n_pick = 0;
my $url_to_fire;
my $max_elapsed = 0.0;
my $max_elapsed_when = '';
my $failed_reqs = 0;
my $slow_responses = 0;
my $terminate_now = 0;
sub output_report {
print "Report for: $server at " . localtime() . "n";
printf "Total requests: %d in %.3f sn", $total_reqs, $total_elapsed;
print "Failed requests: $failed_reqsn";
print "Slow responses (>1s): $slow_responses (slowest $max_elapsed s at $max_elapsed_when)n";
printf "Average response time: %.3f s (%.3f req/s)n", $total_elapsed / $total_reqs, $total_reqs / $total_elapsed;
print "--------------------------------------------------------------------n";
sleep 1;
return;
}
$SIG{INT} = sub { $terminate_now = 1 };
while (not $terminate_now) {
$url_to_fire = "http://" . $server . $url_pool[$n_pick];
my $t0 = [ Time::HiRes::gettimeofday() ];
my $resp = $ua->get($url_to_fire);
my $elapsed = Time::HiRes::tv_interval($t0);
$failed_reqs++ if ! $resp->is_success;
$total_reqs++;
$total_elapsed += $elapsed;
if ($elapsed > $max_elapsed) {
$max_elapsed = $elapsed;
$max_elapsed_when = scalar localtime;
printf "[SLOW] %s, %s served in %.3f sn", $max_elapsed_when, $url_to_fire, $max_elapsed;
}
$slow_responses++ if $elapsed >= 0.5;
$n_pick = 0 if ++$n_pick > $#url_pool;
output_report() if $total_reqs > 0 and ($total_reqs % 1000 == 0);
}
continue {
Time::HiRes::usleep(100000);
}
output_report();
# End
It's also published here on Github, https://gist.github.com/1101500. Have fun!