Learning the frontend phases with Clang
To transform a source code program into LLVM IR bitcode, there are a few intermediate steps the source code must pass through. The following figure illustrates all of them, and they are the topics of this section:
Lexical analysis
The very first frontend step processes the source code's textual input by splitting language constructs into a set of words and tokens, removing characters such as comments, white spaces, and tabs. Each word or token must be part of the language subset, and reserved language keywords are converted into internal compiler representations. The reserved words are defined in include/clang/Basic/TokenKinds.def
. For example, see the definition of the while
reserved word and the <
symbol, two known C/C++ tokens, highlighted in the TokenKinds.def
excerpt here:
TOK(identifier) // abcde123 // C++11 String Literals. TOK(utf32_string_literal) // U"foo" … PUNCTUATOR(r_paren, ")") PUNCTUATOR...