2012.03.23 - inlining, qsort, and std::sort
C++ enthusiasts are often quick to claim std::sort is faster than qsort. The typical citation
they provide is from Scott Meyers' book "Effective STL", Item 46: "At runtime, sort makes
inline calls to its comparison function while qsort calls its comparison function through a
pointer."
His statement, however, is only true when the definition of qsort is unavailable. Because
templates usually aren't external, the compiler tends to have the necessary definitions at
compile time and can inline as appropriate. What Scott failed to mention is that if the
definition of qsort and the comparison function are available, C compilers can perform the same
inlining!
This claim is often met with skepticism. The experiment below demonstrates
this skepticism is unwarranted.
The setup:
- 1. Use an implementation of qsort similar to std::sort. The selected version can be seen here.
- 2. In the C test program, make on version of qsort external, and the other version inline.
- 3. Implement an equivalent C++ version with equivalent data types and comparisons.
- 4. Invoke each version on a randomly sorted array of one hundred million elements. The array contents and ordering should be identical for each program before and after the sort.
The C source code and disassembly can be found
here. The C++
source and disassembly can be found
Compilation was done with GCC version 4.5.0, using the following command for the C code:
Prompt> gcc -O3 -Wall -Winline --std=c99 qsort_main.c external_qsort.c
Similarly, the C++ version was compiled with:
Prompt> g++ -O3 -Wall stlsort.cpp
When invoking the C version, the program takes two arguments. The first indicates whether to use
inline or external qsort (0=external, 1=inline). The second sets how many elements to sort. The
C++ version only takes how many elements to sort.
Finally, the test was run, each program sorted 100 000 000 elements. The results can be seen
below:
# External qsort
Prompt> qsort 0 100000000
Total time: 31.290000
# Inline qsort
Prompt> qsort 1 100000000
Total time: 11.228000
# STL sort
Prompt> stlsort 100000000
Total time: 9.291000
When the C compiler has the same information the C++ one has the programs perform nearly the same. When the compiler is forced to use an indirect function call, the performance is terrible. Looking at the disassembly, the call to external qsort has the following code:
movl $_foo_compare, 12(%esp)
movl $4, 8(%esp)
movl %esi, 4(%esp)
movl -308(%ebp), %ecx
movl %ecx, (%esp)
call _external_qsort
Here it is obvious no inlining is performed. The comparison function address is pushed onto the stack, and a call made to an ignorant external qsort. The remaining assembly for the function main contains no direct references to either inline_qsort or foo_comparison since both were completely inlined. Inlined instances of foo_compare can easily be found if the comparison code is modified to make a call to the function foo_token. Looking at the disassembled version, inlined instances can be found simply by searching for foo_token. For instance:
movl $LC0, (%esp)
call _foo_token
movl (%edi), %eax
cmpl %eax, (%ebx)
jns L28
movl %ebx, %edi
addl $4, %ebx
cmpl %ebx, %esi
jae L29
Many such inlined instances are seen scattered throughout the main function.
Hopefully the above experiment demonstrates the inaccuracy of Scott Meyers' initial claim.