Understanding the Compilation Process in C: A Step by Step Guide

Jan 11, 2024

man in orange and black vest wearing white helmet holding yellow and black power tool — Photo by Jeriden Villegas on Unsplash

2024 will by my 8th year as a programmer, and one area where I don’t have much experience in, is the low level systems programming. To rectify this, I’ve been writing some C programs. One day, while I was compiling a small C program, I realized I didn’t really know what was going on when I entered these commands…

gcc -c my_math.c -o my_math.o

gcc main.c my_math.o -o main

So for my own edification and maybe yours as well, lets see how we go from .c to exe.

A note for Windows Users

If you are a Windows user and want to follow along, you may not have a C compiler installed (I was surprised to). The easiest way to get one is probably with Visual Studio. You can download Visual Studio Community 2022 Edition here.

Once you have it installed, make sure you have the Desktop development with C++ workload included. This will give you a compiler. From there you can run this command in your cmd prompt…

cd C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\Build>

.\vcvarsall.bat x64

Which will set up the environment variables necessary to call cl from the cmd prompt. If you don’t do this, you have to type the long path to your compiler which on my Windows machine with VS 2022 is located at C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.34.31933\bin\Hostx64\x64>.

Once you have a compiler, you can run this build_stages.bat file in the cmd prompt, in the location of your C files, to build the code.

@echo off
REM build_stages.bat

REM Ensure the Visual Studio environment variables are set with vcvarsall!

REM 1. Preprocessing
cl /P main.c
cl /P my_math.c

REM Since MSVC names preprocessed files with the .i extension by default, there's no need to specify output

REM 2. Compilation to Object Files
REM Directly compile to .obj files as MSVC doesn't focus on producing assembly files in the usual workflow.
cl /c main.c
cl /c my_math.c

REM No need for separate assembly step - .obj files are produced in the compilation step

REM 4. Linking
link main.obj my_math.obj /OUT:main.exe

REM End of script

Breaking down the compilation steps

Preprocessing

There are 4 high level steps to compilation. The first is preprocessing. This is done by the C preprocessor. This involves handling directives like #include, #define and #ifdefs. The C preprocessor expands all macros, and includes the contents of header files directly into the source code, generating a preprocessed source file.

To see that this is true, lets compile a C program at just the preprocessing step… We will use three files for this, my_math.h, my_math.c and main.c. The code for all 3 are listed below…

//my_math.h
#ifndef MY_MATH_H
#define MY_MATH_H

int add(int a, int b);

#endif

//my_math.c
#include "my_math.h"

int add(int a, int b) {
    return a + b;
}

//main.c
#include <stdio.h>
#include "my_math.h"

int main() {
    int sum = add(10, 20);
    printf("The sum is %d\n", sum);
    return 0;
}

To preprocess these files we must use gcc with the -E flag (These commands work with Clang too @MacUsers!). To make this easy I’ll write a bash script to do this called build_stages.sh (Windows users use the build_stages.bat file)

#!/bin/bash

# 1. Preprocessing
gcc -E main.c -o main.i
gcc -E my_math.c -o my_math.i

After saving the shell script, I can make it executable on my Linux machine by running chmod +x on build_stages.sh. My bash file is now executable. After running it, my directory looks like this…

terminal with .c files compiled to .i files

I now have two additional files in my directory. A main.i file and a my_math.i file. Taking a look at the my_math.i file we see this…

# 0 "my_math.c"
# 0 "<built-in>"
# 0 "<command-line>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 0 "<command-line>" 2
# 1 "my_math.c"
# 1 "my_math.h" 1



int add(int a, int b);
# 2 "my_math.c" 2

int add(int a, int b) {
    return a + b;
}

The annotations we see in the file are markers inserted by the preprocessor, indicating the origins of each part of the code. These lines aren't actual C code but are used internally by compilers and debuggers to track where each code piece came from, which is especially useful when the code is assembled from multiple files or through complex macro expansions. What is relevant here is that the function signature (aka function prototype) from our my_math.h file, int add(int a, int b); has now been copy and pasted into the my_math.i file.

Now if we look at the main.i file we are greeted with 735 lines of expanded macros, and .h files coming from our our #includes

# 1 "/usr/include/stdio.h" 1 3 4
# 1 "/usr/include/x86_64-linux-gnu/bits/libc-header-start.h" 1 3 4
# 1 "/usr/include/x86_64-linux-gnu/bits/wordsize.h" 1 3 4
# 21 "/usr/include/features-time64.h" 2 3 4
# 1 "/usr/include/x86_64-linux-gnu/bits/timesize.h" 1 3 4
# 486 "/usr/include/features.h" 3 4
# 1 "/usr/include/x86_64-linux-gnu/sys/cdefs.h" 1 3 4
...

These files are what make up my C standard library. If I scroll alllll the way to the bottom I can see what I included in my main.i file.

# 2 "main.c" 2
# 1 "my_math.h" 1




# 4 "my_math.h"
int add(int a, int b);
# 3 "main.c" 2

We can see that the function prototype of my_math.h is included in the main.i file. Now, main knows there is a function named add that takes two integers as parameters and returns an integer. This is important because main.c calls add(10, 20), and it needs to know about the function's existence and its signature to compile correctly.

Compilation to Assembly

The next step after preprocessor expansion is compilation with the -S flag. The compiler checks the syntax and semantics of the C code in main.i and my_math.i. It doesn't link the files yet, so it doesn't need the actual definition of add from my_math.i to compile main.i. It just needs to know the function's prototype, which it has from the header file. Thus, main.i is compiled knowing that there should be a function add matching the prototype. This .i file is compiled into assembly instructions specific to the target architecture. Lets add the code to do this in our build_stages.sh and run it to get our .s files.

#!/bin/bash

# 1. Preprocessing
gcc -E main.c -o main.i
gcc -E my_math.c -o my_math.i

# 2. Compilation
gcc -S main.i -o main.s
gcc -S my_math.i -o my_math.s

After rerunning my .sh file, my directory now looks like this.

terminal with .i files compiled to .s files

If we open up our my_math.s file now, we see the actual assembly that was generated for this program.

	.file	"my_math.c"
	.text
	.globl	add
	.type	add, @function
add:
.LFB0:
	.cfi_startproc
	endbr64
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset 6, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register 6
	movl	%edi, -4(%rbp)
	movl	%esi, -8(%rbp)
	movl	-4(%rbp), %edx
	movl	-8(%rbp), %eax
	addl	%edx, %eax
	popq	%rbp
	.cfi_def_cfa 7, 8
	ret
	.cfi_endproc
.LFE0:
	.size	add, .-add
	.ident	"GCC: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0"
	.section	.note.GNU-stack,"",@progbits
	.section	.note.gnu.property,"a"
	.align 8
	.long	1f - 0f
	.long	4f - 1f
	.long	5
0:
	.string	"GNU"
1:
	.align 8
	.long	0xc0000002
	.long	3f - 2f
2:
	.long	0x3
3:
	.align 8
4:

As a reminder. This is what we got after we included our .h file in our my_math.c file…

int add(int a, int b);

int add(int a, int b) {
    return a + b;
}

That’s a lot of assembly for what amounts to 4 lines of code! Granted, there is some stuff in their to help with debugging and identifying the platform, but that’s still a lot of code! Take a moment to thank our programming fore bearers who wrote assembly so that we could code in JavaScript and Python.

Our main.s file is much the same but it also defines the entry point for our program. Surprisingly, while the two .i files were over 1000 lines combined, the main.s and the my_math.s are about 1/10th the size in assembly.

Assembly into an object file

Moving on to our third step, we get to assembling an object file. Lets add these lines to our build_stages.sh file and run…

#!/bin/bash

# 1. Preprocessing
gcc -E main.c -o main.i
gcc -E my_math.c -o my_math.i

# 2. Compilation
gcc -S main.i -o main.s
gcc -S my_math.i -o my_math.s

# 3. Assembly
gcc -c main.s -o main.o
gcc -c my_math.s -o my_math.o

The .s files are assembled into machine code, producing object files. These files contain binary code but aren't yet executable. Unfortunately, now that it is binary, we can no longer open up the file and view it.

garabage output of opening a binary file in Emacs

Instead we can use a GNU tool called objectdump to bring it back to assembly. Calling objdump -d on my_math.o, we get this…

A note for Windows Users. You can use dumpbin instead of objectdump
```
dumpbin /DISASM my_math.obj
```

disassembly printed to standard out in a terminal

The output is smaller than the original my_math.s because it strips out comments, assembler directives, and other extraneous information, leaving only the essential machine code and related symbols. This streamlined version focuses on the actual instructions and data used by the CPU.

Linking

Finally it is time to link! Lets add the final step to our build_stages.sh file.

#!/bin/bash

# 1. Preprocessing
gcc -E main.c -o main.i
gcc -E my_math.c -o my_math.i

# 2. Compilation
gcc -S main.i -o main.s
gcc -S my_math.i -o my_math.s

# 3. Assembly
gcc -c main.s -o main.o
gcc -c my_math.s -o my_math.o

# 4. Linking
gcc main.o my_math.o -o main

Calling our build_stages.sh one final time, it combines the object files into a single executable file. During linking, it resolves references to the add function in main.o by finding its definition in my_math.o. The linker ensures that all function calls in the object files are correctly matched to their definitions. Once all the references are resolved, it finalizes the executable, which can then be run. As a reminder here is what we had in our main file…

#include <stdio.h>
#include "my_math.h"

int main() {
    int sum = add(10, 20);
    printf("The sum is %d\n", sum);
    return 0;
}

And here is the output of calling main…

the final output of our main program. 30

All of that for one printed sum. Compilers are amazing…

Call To Action 📣

Hi 👋 my name is Diego Crespo and I like to talk about technology, niche programming languages, and AI. I have a Twitter and a Mastodon, if you’d like to follow me on other social media platforms. If you liked the article, consider liking and subscribing. And if you haven’t why not check out another article of mine listed below! Thank you for reading and giving me a little of your valuable time. A.M.D.G

Share Deus In Machina

C Strings and my slow descent to madness

Diego Crespo

April 6, 2023

C Strings and my slow descent to madness

Credit to Lerk For pointing out that "有り難う" means "thank you" instead of "hello". I got my examples mixed up when I was coding and missed this. Thanks for the correction! I’ve been on a C kick recently as I learn the intricacies involved in low level programming. As a Data Scientist/Python Programmer I work with strings all the time. People say that handling…

Read full story

Deus In Machina