Implementing a lexer
Lexer is a part of the first phase in compiling a program. Lexer tokenizes a stream of input in a program. Then parser consumes these tokens to construct an AST. The language to tokenize is generally a context-free language. A token is a string of one or more characters that are significant as a group. The process of forming tokens from an input stream of characters is called tokenization. Certain delimiters are used to identify groups of words as tokens. There are lexer tools to automate lexical analysis, such as LEX. In the TOY lexer demonstrated in the following procedure is a handwritten lexer using C++.
Getting ready
We must have a basic understanding of the TOY language defined in the recipe. Create a file named toy.cpp
as follows:
$ vim toy.cpp
All the code that follows will contain all the lexer, parser, and code generation logic.
How to do it…
While implementing a lexer, types of tokens are defined to categorize streams of input strings (similar to states...