Say you run an Apache virtual server with tens, perhaps hundreds, of sites on it. You see a load spike and you’re left wondering which virtual host is the one featured at the front page of Reddit. Or maybe you need to know, instantly, which site Googlebot has chosen to spider heavily.
This is a fairly common scenario for web server sysadmins. Unfortunately I never found an Apache module that showed just this. It turns out that just what I needed was tcpdump and a Perl script I threw together in 5 minutes.
The script reads the raw dump coming from tcpdump with the -s option set to maximum package length(65535 octets), and the -w option set to “-”, which makes it print the raw dump to STDOUT. A regular expression matches the Host: line from HTTP headers and stores it in a hash. Every few seconds a SIGALRM is triggered and the handler sorts the hosts in descending request count order.
Now you can see the most visited hosts listed at the top, updated every 5 seconds. It’s a simple as can be snapshot of what’s going on with your virtual host server, without all the detailed memory information provided by mod_status.
I’ve never been able to keep up with the multitude of Apache available out there. Perhaps others may chip in and provide an alternative method of doing this using a standard module instead.
So, here’s the hack.
Copy the following script and name it tcpd_host_filter.pl
#!/usr/bin/perl
# http://zefonseca.com/blogs/zen/
use strict;
use warnings;
our %hosts = ();
our $match_count = 0;
$SIG{ALRM} = \&dump_stats;
alarm(5);
while (<>) {
if ( m{Host\:\s+(\S+)}gms ) {
$hosts{$1}++;
$match_count++;
}
}
sub dump_stats {
print "\n\nACTIVE HOSTS\n";
foreach my $host ( reverse sort { $hosts{$a} <=> $hosts{$b} } keys %hosts ) {
my $ratio = $hosts{$host} / $match_count;
printf "%-5d %-32s %.2f %% \n", $hosts{$host}, $host, $ratio*100;
}
alarm(5);
}
Now make the script executable.
chmod 755 tcpd_host_filter.pl
Now make tcpdump dump everything coming in and out of the server, then filter it using our script.
tcpdump -s 65535 -w - | ./tcpd_host_filter.pl
There you go. Every 5 seconds it prints out how many hits each host received and what percentage it is compared to the total number of hits to the WWW server.
Sample output:
ACTIVE HOSTS 120 site1.com 30.61 % 116 site2.com 29.59 % 74 zefonseca.com 18.88 % 29 site3.com 7.40 % 9 site4.com 2.30 % 8 site5.com 2.04 % 7 site6.com 1.79 % 7 site7.com 1.79 % [ ... ]
Feel free to adapt the tcpdump line to your needs, especially the packet filtering options.
The usual disclaimer: You may want to try this on a test server first. I’ve tested this and used it actively on several production servers, and it never caused any problems. Then again I can’t be responsible if it does something unexpected for you. As far as I know it can’t possibly hurt your server, but it’s pretty much standard to include a disclaimer for everything these days, so there you go: the script is provided as-is, etc, no guarantees for any purpose, etc, blah blah so use it at your own risk.
Thanks! I found this perl script very useful. I put a link to it on my blog.