This may well be a useless post, since you should never, ever, use strtok() in any serious application. I’ll write about it because first, strtok() is part of the ANSI C standard libraries and is probably lurking in your operating system core somewhere, and second, because it is a good example of flawed design principles which got by the best minds in the business and ended up in the C standard library.
So, if you’re completely lost, but still reading, you probably know that strtok() is a standard C function, defined in the string.h standard header file. It will take a string you wish to tokenize as a first parameter, and given a delimiter string as a second parameter, future calls to the function with a NULL first parameter will hopefully return any successive strings contained in between delimiters.
Here’s an example:
The line “col1:col2:col3:col4″ could be tokenized as follows:
char *data = "col1:col2:col3:col4";
char *tok = (char *)malloc(SOMEMEMORY);
// verify if tok is NULL, skipped for clarity
tok = strtok(data, ":");
printf("%s\n", tok);
while ( (tok = strtok(NULL, ":")) != NULL ) {
printf("%s\n", tok);
}
The above example will *not* work – it only illustrates an intended regular use of strtok(). Why won’t it work? Because char *data is a static read-only string and strtork() will attempt to modify it(!!) in order to do its job. Yes, strtok() has serious collateral effects, it does not take a copy of the original string, it actually alters your data. This, in itself, is a deadly sin and reason enough to avoid this function at all costs.
strtok() is also not reentrant. If another thread calls strtok() with a different initial string, the next calls to it will parse that string instead. This happens globally, so any other part of your system which calls strtok will in fact break it.
Here are two examples of this hideous function in action. The following short program produces a Bus Error or Segmentation Fault, depending on which OS you run it on:
/*
============================================================================
Name : strtok.c
Copyleft : ZeFonseca.com
Description : Collateral effects in strtok()
============================================================================
*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(void) {
char *tok = NULL;
char *temp = "name:address:telephone:email";
int dll = strlen(temp) + 1;
tok = (char*)calloc(dll, 1);
if ( tok == NULL ) {
fprintf(stderr, "Unable to allocate %u bytes for token.\n", dll);
return EXIT_FAILURE;
}
// this will generate a Bus Error, program ends here
tok = strtok(temp, ":");
printf("%s\n", tok);
fflush(stdout);
while ( (tok = strtok(NULL, ":")) != NULL ) {
printf("%s\n", tok);
fflush(stdout);
}
return EXIT_SUCCESS;
}
When strtok() attempts to modify char *temp, the OS immediately kills the process, as it was trying to change a read-only memory region. The following paragraph, extracted from the GNU libc documentation, explains why:
String literals appear in C program source as strings of characters between double-quote characters (ā”ā) where the initial double-quote character is immediately preceded by a capital āLā (ell) character (as in L”foo”). In ISO C, string literals can also be formed by string concatenation: “a” “b” is the same as “ab”. For wide character strings one can either use L”a” L”b” or L”a” “b”. Modification of string literals is not allowed by the GNU C compiler, because literals are placed in read-only storage.
The following program detours from strtok()’s annoyances, first by copying the string literal in temp over to a dynamic memory region allocated at run-time. This region is also large enough to accommodate strtok()’s rogue mutilation of our input data.
/*
============================================================================
Name : strtok2.c
Copyleft : ZeFonseca.com
Description : A frankensteinish detour of some of strtok()'s defects
============================================================================
*/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define STR_BUFF_SIZE 1024
int main(void) {
// we'll work with this buffer, copying temp here
char *dataline = NULL;
// I'll store a copy of the original data on dataline2
char *dataline2 = NULL;
// tok will hold our parsed tokens
char *tok = NULL;
// temp holds a line of tabular data we want to parse
char *temp = "name:address:telephone:email";
// we allocate strlen + 1 null byte
int dll = strlen(temp) + 1;
dataline = (char *)calloc(STR_BUFF_SIZE, 1);
memcpy(dataline, temp, dll);
dataline2 = (char *)malloc(STR_BUFF_SIZE);
tok = (char*)calloc(STR_BUFF_SIZE, 1);
if ( tok == NULL ) {
fprintf(stderr, "Out of memory. Unable to allocate %u bytes for token.\n", dll);
return EXIT_FAILURE;
}
memset(dataline2,0,dll);
memcpy(dataline2, dataline, dll);
printf("string 1 before: %s\n", dataline);
printf("string 2 before: %s\n", dataline2);
// strtok() will no longer cause a Bus Error
// dataline will gladly accept strtok()'s tampering
// of the input data
tok = strtok(dataline, ":");
printf("%s\n", tok);
fflush(stdout);
while ( (tok = strtok(NULL, ":")) != NULL ) {
printf("%s\n", tok);
fflush(stdout);
}
// strtok has placed null bytes in our input data
// so this printf will only print the first column name
printf("string 1 after: %s\n", dataline);
// while this printf will print the entire input line
printf("string 2 after: %s\n", dataline2);
// we've made it here, we can call it
// a success, considering we used strtok()
return EXIT_SUCCESS;
}
Which design principles have been broken in the strtok() implementation?
1) It is non orthogonal, because it changes the input data. This is a big no-no, an extremely unwelcome collateral effect. This sort of problem gets specially troublesome in larger systems where unexpected interactions cause bugs that are extremely hard to track down. For example, if some part of your system were to commit the original data back to a database after using strtok(), you’d be permanently altering the original input.
2) It uses a static variable to store state, making other unexpected interactions possible, such as a call to strtok in another part of the system changing the state of another consumer elsewhere. One part of the system is once more altering another part in an unexpected way, orthogonality again.
Lastly, but not a specific design issue, it was included in a standard library for widespread use despite these serious issues.
Conclusion
As you already knew, strtok() is a defective part of the C library and should not be used. There are substitutes for strtok() in most frameworks and libraries.
Thou shall not substitute strtok for strsep(), they suffer from the same problems – both mess up the input data.
strtok_r() is a reentrant version of strtok()(the _r is likely for reentrant), it uses a third parameter to store parsing state in between calls. It solves the reentrancy problem but also modifies the first argument, which is still a serious design flaw in my opinion. Other variations exist, such as strtok_s() on Microsoft platforms, but I believe every function on this family alters the first parameter.