On the performance of strlen

by Fred Mameri


I was working on an optimization problem when I started wondering just how efficient the implementation of the C standard library (libc) is. I decided to start my investigation with one of the simplest functions in the whole libc: strlen.

The experiment I designed for measuring the performance of strlen is as follows:

  1. I generated a large (around 71K) file with text from Lorem Ipsum
  2. A program opens that file, fseeks to the end, ftells the current offset, fseeks to the beginning, allocates as many bytes as requires, and freads the whole file into that memory region.
  3. Finally, it calls mystrlen 100,000 times.

mystrlen is a wrapper around strlen as follows:

size_t mystrlen(const char *text, int len) {
    size_t l;

    l = strlen(text);
    if (l != len) {
        fprintf(stderr,
            "error in size function\n"
            "found: %d, expected: %d\n",
            l,
            len);
        exit(1);
    }

    return l;
}

After running this function 100K times, we will have scanned through 7,186,600,000 characters (yes, that’s over 7 billion characters!). Now let’s see just how fast that can be done.

I defined 4 implementations of strlen with different performance characteristics:

  1. c
  2. std
  3. asm
  4. fastc

c is a vanilla C implementation of strlen, the way any freshman computer science student would implement it.
I have considered this the reference implementation, because being pure C, it will test the compiler’s ability to perform optimizations, as well as serve as a baseline for other less intuitive approaches.

size_t __cdecl strlen (const char* text) {
    size_t count = 0;
    while (*text++) count++;
    return count;
}

std is the implementation in libc. Its implementation is oftentimes machine-dependent and written in Assembly language directly.

asm is an Assembly language implementation of strlen. It uses x86-specific string instructions.
The implementation below uses gcc’s syntax for inline assembly and AT&T assembly syntax for the assembly language.
The same code could’ve easily been written in a assembly file as a procedure called _strlen, and then liked against the C object containing the main function.

size_t __cdecl strlen (const char* text) {
    size_t count;
    asm("subl   %%ecx, %%ecx;"
        "not    %%ecx       ;"
        "movl   %1, %%edi   ;"
        "sub    %%al, %%al  ;"
        "cld                ;"
        "repne  scasb       ;"
        "notl   %%ecx       ;"
        "decl   %%ecx       ;"
        "movl   %%ecx, %0   ;"
        :"=r"(count)            /* output */
        :"r"(text)              /* input */
        :"%eax", "%ecx", "%edi" /* clobbered registers */
    );
    return count;
}

fastc is a C implementation that is not at all obvious. It is very machine-dependent, but it is very interesting to us because it allows us to test the compiler’s optimizing abilities, especially against asm.
We’ll revisit the details fastc later.

We’ll test each of these 4 implementations of strlen in 3 different compiler output modes:

  1. No optimizations, generic x86 target architecture
  2. Full optimizations (-O3), generic x86 target architecture
  3. Full optimizations (-O3), core2 target architecture (-march=core2)

With the problem introduce, let’s go to results:

With no optizations, targeting a generic x86 architecture, we got the following results:

Implementation Total time
asm 10.8830 s
c 17.9130 s
fastc 4.5830 s
std 1.7960 s

These results are actually along the lines of what I expected. asm is faster than c, and std is the fastest one. But how can fastc be faster than asm?

The answer to that question is in what fastc does, and how it is different from asm and c.

If you take a closer look at both asm and c, you’ll notice that they both operate on individual bytes. In other words, there’s a lot of data being moved (remember, we are operating on more than 7 Gbs of textual data). Even with a large cache, each bytes needs to be moved from cache to register. It’s even surprising we can process 7 Gbs of data in 10 secs!

Which brings us to fastc. fastc is an implementation that operates on the whole word at the same time:

size_t __cdecl strlen (const char* text) {
    const char *s;
    const uint32_t* pdwText;
    register uint32_t dwText;

    pdwText = (uint32_t*)text;
    while (1) {
        dwText = *pdwText;

        if ((dwText - 0x01010101UL) & ~dwText & 0x80808080UL) {
            s = (const char*)pdwText;
            while (*s) s++;
            return s - text;
        }

        pdwText++;
    }
}

This algorithm operates on as many bytes as can fit in a register (in the case of this 32-bit program, 4 bytes).
The if check is verifying whether any byte in that word is 0x00. This technique was described in the Hacker’s Delight.

This algorithm can be further improved. Before we start analyzing words, we might want to make sure the pointer is word-aligned. If it’s not, we can fall back to the simple c algorithm until the pointer is aligned to the word boundary. Additionally, the inner while loop (the one executed when the found a 0x00 byte) can be optimized.

These results made sense, right? Now when we turn on optimizations, things take a surpring turn:

With full optizations (-O3), targeting a generic x86 architecture, we got the following results:

Implementation Total time
asm 10.8750 s
c 3.4280 s
fastc 1.7430 s
std 10.8650 s

Before we proceed, I must say that the performance numbers change from run to run (even when averaged out), so keep that in mind.

Unsurprisingly, the numbers for asm didn’t change. We manually wrote it in Assembly, so turning on compiler optimizations wouldn’t affect it.

The fully optimized c is now faster than fastc with no optimizations (which shows that the C compiler can do a fantastic job optimizing). Which brings me to my first conclusion: oftentimes, writing simple C code is good enough. Now, of course, fastc is still faster than c (although only twice as faster, as opposed to 4x like before). Second conclusion: even with the best optimizing compilers, there’s still a difference between efficient code and innefficient code.

Both c and fastc are now faster than asm. That’s because the compiler knows how to stop all those memory operations.

The big surprise here was std. It is unclear to me why std is so slow, but one thing is clear: there’s no statistical difference between the runtime of asm and std, which leads me to believe that std is doing it one byte at a time. This was so surprising to me that I re-compiled and re-ran the tests several times.

My theory about that is that because full optimizations were on, the compiler was in liberty to change the code as it saw fit. But targeting a generic x86 architecture, it ended up modifying it to something that would run fastest on lower-end machines, whereas fastc is hand-optimized for 32-bit machines. That theory can’t explain c being faster than std. If you can explain it, please leave a comment.

Lastly, with full optizations (-O3), targeting the core2 architecture (-march=core2), we got the following results:

Implementation Total time
asm 10.8500 s
c 3.4270 s
fastc 1.7510 s
std 1.8500 s

asm, c and fastc haven’t changed. asm is hand-written Assembly and shouldn’t change. fastc makes assumptions about the underlying architecture, and therefore shouldn’t change. c not changing is a bit surprising. std is back to normal. This brings us to our third conclusion: always set the target architecture type.

Interestingly, our fastc implementation was able to beat std. So does that mean that everyone should write their own strlen? No. Remember, we are operating on over 7 Gbs of textual data here. Which means that for your average string, you will not be able to tell the difference between fastc and std, as long as you set the target architecture type!

In case you are curious, the fastc64 (the modified version of fastc to operate on 64-bit words) running in 32-bit mode is slower than fastc32. The total time for fastc64 running in 64-bit mode is 0.8770 s, which makes it, very unsurprisingly, twice as fast as the fastc32 running in 32-bit mode. Forth conclusion: if compiling to 64-bit processors, make use of it. Almost nobody does.

And last, but not at all least, don’t run strlen 100,000 times over the same string. If you expect that you’ll need to access the size of the string several times, consider storing it. No code is faster than no code.

Advertisements