Compiler
As we discussed in the previous sections, the compiler accepts the translation unit prepared by the preprocessor and generates the corresponding assembly instructions. When multiple C sources are compiled into their equivalent assembly code, the existing tools in the platform, such as the assembler and the linker, manage the rest by making relocatable object files out of the generated assembly code and finally linking them together (and possibly with other object files) to form a library or an executable file.
As an example, we spoke about as
and ld
as two examples among the many available tools in Unix for C development. These tools are mainly used to create platform-compatible object files. These tools exist necessarily outside of gcc
or any other compiler. By existing outside of any compiler, we actually mean that they are not developed as a part of gcc
(we have chosen gcc
as an example) and they should be available on any platform even without having gcc
installed. gcc
only uses them in its compilation pipeline, and they are not embedded into gcc
.
That is because the platform itself is the most knowledgeable entity that knows about the instruction set accepted by its processor and the operating system-specific formats and restrictions. The compiler is not usually aware of these constraints unless it wants to do some optimization on the translation unit. Therefore, we can conclude that the most important task that gcc
does is to translate the translation unit into assembly instructions. This is what we actually call compilation.
One of the challenges in C compilation is to generate correct assembly instructions that can be accepted by the target architecture. It is possible to use gcc
to compile the same C code for various architectures such as ARM, Intel x86, AMD, and many more. As we discussed before, each architecture has an instruction set that is accepted by its processor, and gcc
(or any C compiler) is the sole responsible entity that should generate correct assembly code for a specific architecture.
The way that gcc
(or any other C compiler) overcomes this difficulty is to split the mission into two steps, first parsing the translation unit into an relocatable and C-independent data structure called an Abstract Syntax Tree (AST), and then using the created AST to generate the equivalent assembly instructions for the target architecture. The first part is architecture-independent and can be done regardless of the target instruction set. But the second step is architecture-dependent, and the compiler should be aware of the target instruction set. The subcomponent that performs the first step is called a compiler frontend, and the subcomponent that performs the later step is called a compiler backend.
In the following sections, we are going to discuss these steps in more depth. First, let's talk about the AST.
Abstract syntax tree
As we have explained in the previous section, a C compiler frontend should parse the translation unit and create an intermediate data structure. The compiler creates this intermediate data structure by parsing the C source code according to the C grammar and saving the result in a tree-like data structure that is not architecture-dependent. The final data structure is commonly referred to as an AST.
ASTs can be generated for any programming language, not only C, so the AST structure must be abstract enough to be independent of C syntax.
This is enough to change the compiler frontend to support other languages. This is exactly why you can find GNU Compiler Collection (GCC), which gcc
is a part of as the C compiler, or Low-Level Virtual Machine (LLVM), which clang
is a part of as the C compiler, as a collection of compilers for many languages beyond just C and C++ such as Java, Fortran, and so on.
Once the AST is produced, the compiler backend can start to optimize the AST and generate assembly code based on the optimized AST for a target architecture. To get a better understanding of ASTs, we are going to take a look at a real AST. In this example, we have the following C source code:
int main() { int var1 = 1; double var2 = 2.5; int var3 = var1 + var2; return 0; }
Code Box 2-7 [ExtremeC_examples_chapter2_2.c]: Simple C code whose AST is going to be generated
The next step is to use clang
to dump the AST within the preceding code. In the following figure, Figure 2-1, you can see the AST:
Figure 2-1: The AST generated and dumped for example 2.2
So far, we have used clang
in various places as a C compiler, but let's introduce it properly. clang
is a C compiler frontend developed by the LLVM Developer Group for the llvm
compiler backend. The LLVM Compiler Infrastructure Project uses an intermediate representation – or LLVM IR – as its abstract data structure used between its frontend and its backend. LLVM is famous for its ability to dump its IR data structure for research purposes. The preceding tree-like output is the IR generated from the source code of example 2.2.
What we have done here is introduce you to the basics of AST. We are not going through the details of the preceding AST output because each compiler has its own AST implementation. We would require several chapters to cover all of the details on this, and that is beyond the scope of this book.
However, if you pay attention to the above figure, you can find a line that starts with -FunctionDecl
. This represents the main
function. Before that, you can find meta information regarding the translation unit passed to the compiler.
If you continue after FunctionDecl
, you will find tree entries – or nodes – for declaration statements, binary operator statements, the return statement, and even implicit cast statements. There are lots of interesting things residing in an AST, with countless things to learn!
Another benefit of having an AST for source code is that you can rearrange the order of instructions, prune some unused branches, and replace branches so that you have better performance but preserve the purpose of the program. As we pointed out before, it is called optimization and it is usually done to a certain configurable extent by any C compiler.
The next component that we are going to discuss in more detail is the assembler.