Say you run an Apache virtual server with tens, perhaps hundreds, of sites on it. You see a load spike and you’re left wondering which virtual host is the one featured at the front page of Reddit. Or maybe you need to know, instantly, which site Googlebot has chosen to spider heavily.
This is a fairly common scenario for web server sysadmins. Unfortunately I never found an Apache module that showed just this. It turns out that just what I needed was tcpdump and a Perl script I threw together in 5 minutes.
The script reads the raw dump coming from tcpdump with the -s option set to maximum package length(65535 octets), and the -w option set to “-”, which makes it print the raw dump to STDOUT. A regular expression matches the Host: line from HTTP headers and stores it in a hash. Every few seconds a SIGALRM is triggered and the handler sorts the hosts in descending request count order.
Now you can see the most visited hosts listed at the top, updated every 5 seconds. It’s a simple as can be snapshot of what’s going on with your virtual host server, without all the detailed memory information provided by mod_status.
I’ve never been able to keep up with the multitude of Apache available out there. Perhaps others may chip in and provide an alternative method of doing this using a standard module instead.
So, here’s the hack.
Copy the following script and name it tcpd_host_filter.pl
#!/usr/bin/perl
# http://zefonseca.com/blogs/zen/
use strict;
use warnings;
our %hosts = ();
our $match_count = 0;
$SIG{ALRM} = \&dump_stats;
alarm(5);
while (<>) {
if ( m{Host\:\s+(\S+)}gms ) {
$hosts{$1}++;
$match_count++;
}
}
sub dump_stats {
print "\n\nACTIVE HOSTS\n";
foreach my $host ( reverse sort { $hosts{$a} <=> $hosts{$b} } keys %hosts ) {
my $ratio = $hosts{$host} / $match_count;
printf "%-5d %-32s %.2f %% \n", $hosts{$host}, $host, $ratio*100;
}
alarm(5);
}
Now make the script executable.
chmod 755 tcpd_host_filter.pl
Now make tcpdump dump everything coming in and out of the server, then filter it using our script.
tcpdump -s 65535 -w - | ./tcpd_host_filter.pl
There you go. Every 5 seconds it prints out how many hits each host received and what percentage it is compared to the total number of hits to the WWW server.
Sample output:
ACTIVE HOSTS 120 site1.com 30.61 % 116 site2.com 29.59 % 74 zefonseca.com 18.88 % 29 site3.com 7.40 % 9 site4.com 2.30 % 8 site5.com 2.04 % 7 site6.com 1.79 % 7 site7.com 1.79 % [ ... ]
Feel free to adapt the tcpdump line to your needs, especially the packet filtering options.
The usual disclaimer: You may want to try this on a test server first. I’ve tested this and used it actively on several production servers, and it never caused any problems. Then again I can’t be responsible if it does something unexpected for you. As far as I know it can’t possibly hurt your server, but it’s pretty much standard to include a disclaimer for everything these days, so there you go: the script is provided as-is, etc, no guarantees for any purpose, etc, blah blah so use it at your own risk.
This may well be a useless post, since you should never, ever, use strtok() in any serious application. I’ll write about it because first, strtok() is part of the ANSI C standard libraries and is probably lurking in your operating system core somewhere, and second, because it is a good example of flawed design principles which got by the best minds in the business and ended up in the C standard library.
So, if you’re completely lost, but still reading, you probably know that strtok() is a standard C function, defined in the string.h standard header file. It will take a string you wish to tokenize as a first parameter, and given a delimiter string as a second parameter, future calls to the function with a NULL first parameter will hopefully return any successive strings contained in between delimiters.
Here’s an example:
The line “col1:col2:col3:col4″ could be tokenized as follows:
char *data = "col1:col2:col3:col4";
char *tok = (char *)malloc(SOMEMEMORY);
// verify if tok is NULL, skipped for clarity
tok = strtok(data, ":");
printf("%s\n", tok);
while ( (tok = strtok(NULL, ":")) != NULL ) {
printf("%s\n", tok);
}
The above example will *not* work – it only illustrates an intended regular use of strtok(). Why won’t it work? Because char *data is a static read-only string and strtork() will attempt to modify it(!!) in order to do its job. Yes, strtok() has serious collateral effects, it does not take a copy of the original string, it actually alters your data. This, in itself, is a deadly sin and reason enough to avoid this function at all costs.
strtok() is also not reentrant. If another thread calls strtok() with a different initial string, the next calls to it will parse that string instead. This happens globally, so any other part of your system which calls strtok will in fact break it.
Here are two examples of this hideous function in action. The following short program produces a Bus Error or Segmentation Fault, depending on which OS you run it on:
/*
============================================================================
Name : strtok.c
Copyleft : ZeFonseca.com
Description : Collateral effects in strtok()
============================================================================
*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(void) {
char *tok = NULL;
char *temp = "name:address:telephone:email";
int dll = strlen(temp) + 1;
tok = (char*)calloc(dll, 1);
if ( tok == NULL ) {
fprintf(stderr, "Unable to allocate %u bytes for token.\n", dll);
return EXIT_FAILURE;
}
// this will generate a Bus Error, program ends here
tok = strtok(temp, ":");
printf("%s\n", tok);
fflush(stdout);
while ( (tok = strtok(NULL, ":")) != NULL ) {
printf("%s\n", tok);
fflush(stdout);
}
return EXIT_SUCCESS;
}
When strtok() attempts to modify char *temp, the OS immediately kills the process, as it was trying to change a read-only memory region. The following paragraph, extracted from the GNU libc documentation, explains why:
String literals appear in C program source as strings of characters between double-quote characters (ā”ā) where the initial double-quote character is immediately preceded by a capital āLā (ell) character (as in L”foo”). In ISO C, string literals can also be formed by string concatenation: “a” “b” is the same as “ab”. For wide character strings one can either use L”a” L”b” or L”a” “b”. Modification of string literals is not allowed by the GNU C compiler, because literals are placed in read-only storage.
The following program detours from strtok()’s annoyances, first by copying the string literal in temp over to a dynamic memory region allocated at run-time. This region is also large enough to accommodate strtok()’s rogue mutilation of our input data.
/*
============================================================================
Name : strtok2.c
Copyleft : ZeFonseca.com
Description : A frankensteinish detour of some of strtok()'s defects
============================================================================
*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define STR_BUFF_SIZE 1024
int main(void) {
// we'll work with this buffer, copying temp here
char *dataline = NULL;
// I'll store a copy of the original data on dataline2
char *dataline2 = NULL;
// tok will hold our parsed tokens
char *tok = NULL;
// temp holds a line of tabular data we want to parse
char *temp = "name:address:telephone:email";
// we allocate strlen + 1 null byte
int dll = strlen(temp) + 1;
dataline = (char *)calloc(STR_BUFF_SIZE, 1);
memcpy(dataline, temp, dll);
dataline2 = (char *)malloc(STR_BUFF_SIZE);
tok = (char*)calloc(STR_BUFF_SIZE, 1);
if ( tok == NULL ) {
fprintf(stderr, "Out of memory. Unable to allocate %u bytes for token.\n", dll);
return EXIT_FAILURE;
}
memset(dataline2,0,dll);
memcpy(dataline2, dataline, dll);
printf("string 1 before: %s\n", dataline);
printf("string 2 before: %s\n", dataline2);
// strtok() will no longer cause a Bus Error
// dataline will gladly accept strtok()'s tampering
// of the input data
tok = strtok(dataline, ":");
printf("%s\n", tok);
fflush(stdout);
while ( (tok = strtok(NULL, ":")) != NULL ) {
printf("%s\n", tok);
fflush(stdout);
}
// strtok has placed null bytes in our input data
// so this printf will only print the first column name
printf("string 1 after: %s\n", dataline);
// while this printf will print the entire input line
printf("string 2 after: %s\n", dataline2);
// we've made it here, we can call it
// a success, considering we used strtok()
return EXIT_SUCCESS;
}
Which design principles have been broken in the strtok() implementation?
1) It is non orthogonal, because it changes the input data. This is a big no-no, an extremely unwelcome collateral effect. This sort of problem gets specially troublesome in larger systems where unexpected interactions cause bugs that are extremely hard to track down. For example, if some part of your system were to commit the original data back to a database after using strtok(), you’d be permanently altering the original input.
2) It uses a static variable to store state, making other unexpected interactions possible, such as a call to strtok in another part of the system changing the state of another consumer elsewhere. One part of the system is once more altering another part in an unexpected way, orthogonality again.
Lastly, but not a specific design issue, it was included in a standard library for widespread use despite these serious issues.
Conclusion
As you already knew, strtok() is a defective part of the C library and should not be used. There are substitutes for strtok() in most frameworks and libraries.
Thou shall not substitute strtok for strsep(), they suffer from the same problems – both mess up the input data.
strtok_r() is a reentrant version of strtok()(the _r is likely for reentrant), it uses a third parameter to store parsing state in between calls. It solves the reentrancy problem but also modifies the first argument, which is still a serious design flaw in my opinion. Other variations exist, such as strtok_s() on Microsoft platforms, but I believe every function on this family alters the first parameter.