Preface
I was wondering how big the overhead is before and/or after program execution. It was unreasonable to start with anything other than the simplest program with a blank main
:
1 | int main() { return 1; } |
Theoretically, it is possible to compile a void main in C, which would remove the return value. However, that is not valid according to the ISO standard (C++ §3.6.1/2 | C §5.1.2.2/1). So we will skip it.
When we try to disassemble this code, we get an assembly with 5 instructions:
1 | push rbp // push base pointer on stack to stabilize stack frame |
Technically speaking, stack and stack pointer operations are not needed in this context, so the code could be reduced to:
1 | mov eax, 1 |
Now, guess how many instructions it takes to execute the above program. The answer is .
This value is hard to believe, considering that the program itself is just 5 instructions. We will try to find out what and why this happens.
Analysis Tools
Mainly, two tools will be used for the analysis valgrind --tool=callgrind
and perf
. For visualization, I used Callgrind because of its GUI tool and the greater clarity of the results. The most important factor I chose was instruction count because it is machine-independent and deterministic. The runtime environment doesn’t have to be exactly the same to perform profiling, which is not true for cycles and especially for execution time.
Callgrind Call Map & Call List of dynamically linked program
As we can see, 98.22% of instructions were made outside the program. In ld-linux-x86-64.so which is a dynamic Linux linker . We see methods like _dl_sysdep_start, dl_start, dl_main, dl_relocate_object, dl_lookup_symbol_x. These are part of the dynamic loading process. Their goal is to load, init, relocate and used in prog contained, for example in libc.so. Later, we can see handle_amd or handle_intel are involved in the initialization and detection of specific processor functions (like SSE extension support, AVX, etc.). Even if your program does not use these functions directly, the system must initialize the CPU to adapt to the appropriate hardware environment.
Running perf stat of static & dynamic approach comparison
Let’s run perf stat
to have some unified result to compare it later
1 | 0.42 msec task-clock:u # 0.571 CPUs utilized |
Okay, if most of the program time is taken up by the linker and loading dynamic libraries, let’s do it statically -> gcc -static prog.cpp -o prog
1 | 0.21 msec task-clock:u # 0.529 CPUs utilized |
We can see that the number of instructions has decreased 4.26 times. However, the binary size increased from 15776 to 900 224 bytes, which is a huge difference (). This is because the libraries that were previously dynamically linked to the system are now stored in the binary code of the program.
Callee List of statically linked program
I won’t paste the glibc library code here for the sake of cleanliness. But as you can see, the __tunables_init function is the main culprit. The main purpose of this function is to allow you to configure the behavior of the glibc library via environment variables. This allows you to customize certain aspects of the library’s behavior without having to recompile your program. A minimum set of these variables must be initialized because the C runtime doesn’t know in advance that your program is just a blank main function and won’t use most features, so it prepares for all possibilities.
CPU and Memory Tunables
For example, there are tunables related to the CPU and memory management. CPU variables such as cache size and thresholds for optimized copy instructions must be set, regardless of the actual CPU manufacturer. This explains why functions such as handle_intel and intel_check_word are called, even though both my PC and WSL are Ryzen. Their total of 19% can be partly justified by querying hardware or system constants.
Equally important are the variables related to memory management, particularly the entire set of glibc.malloc variables. These parameters control key aspects of memory allocation, such as the size and number of memory arenas, the thresholds for using mmap instead of sbrk, and the behavior of the thread cache. For example, glibc.malloc.arena_max can significantly affect memory usage on multi-core systems, while glibc.malloc.mmap_threshold determines when the system will use mmap to allocate larger blocks of memory.
All tunables you can find here or by calling /lib64/ld-linux-x86-64.so.2 --list-tunables
Conclusion
This is my first technical article. It was intended to illustrate the differences between dynamic and static linking, more or less show the costs and benefits of each, and explain why it happens and what is responsible for it.
In the next one, we will dive deeper into C compiled without glibc and explore the shortest program in assembly and its binary. I really appreciate any feedback, so if you have any comments or suggestions, feel free to leave a comment below ⬇️.