For the end user, using a DSL is surely easier than writing XML code. However, the developer of the DSL is now left with the task of implementing it.
Implementing a DSL means developing a program that is able to read text written in that DSL, parse it, process it, and then possibly interpret it or generate code in another language. Depending on the aim of the DSL, this may require several phases, but most of these phases are typical of all implementations.
In this section, we only sketch the main aspects of implementing a DSL. For a deeper introduction to language implementations and the theory behind, we refer the interested reader to the book Aho et al. 2007.
First of all, when reading a program written in a programming language, the implementation has to make sure that the program respects the syntax of that language.
To this aim, we need to break the program into tokens. Each token is a single atomic element of the language; this can be a keyword (such as class in Java), an identifier (such as a Java class name), or a literal, that is, a fixed value. Examples of literals are string literals, typically surrounded by quotes, integer literals, and boolean literals (for example, true
and false
). Other kinds of tokens are operators (such as arithmetic and comparison operators) and separators (such as parentheses and terminating semicolons).
For instance, in the preceding example, employed
is a keyword, the parentheses are separators, James
is an identifier, and 50
is an integer literal.
The process of converting a sequence of characters into a sequence of tokens is called lexical analysis, and the program or procedure that performs such analysis is called a lexical analyzer, lexer, or simply a scanner. This analysis is usually implemented by using regular expressions syntax.
Having the sequence of tokens from the input file is not enough, we must make sure that they form a valid statement in our language; that is, they respect the syntactic structure expected by the language. This phase is called parsing or syntactic analysis. The program or procedure that performs such analysis is called a parser.
Let's recall the DSL to describe the name and age of various people and a possible input text:
In this example, each line of the input must respect the following structure:
- two identifiers
- the separator (
- one integer literal
- the separator )
- the optional keyword
employed
In our language, tokens are separated by white spaces, and lines are separated by a newline character.
You can now deduce that the parser relies on the lexer. In fact, the parser asks the lexer for tokens and tries to build valid statement of the language.
If you have never implemented a programming language, you might be scared at this point by the task of implementing a parser, for instance, in Java. You are probably right, since this is not easy. The DSL we just used as an example is very small and still it would require some effort to implement. What if your DSL has to deal also with, say, arithmetic expressions? In spite of their apparently simple structure, arithmetic expressions are recursive by their own nature; thus a parser implemented in Java would have to deal with recursion as well, and, in particular, it should avoid possible endless loops.
There are tools to deal with parsing so that you do not have to implement a parser by hand. In particular, there are DSLs to specify the grammar of the language, and from this specification, they automatically generate the code for the lexer and parser. For this reason, these tools are called parser generators or compiler-compilers. In this context, such specifications are called grammars. A grammar is a set of rules that describe the form of the elements that are valid according to the language syntax.
Here are some examples of tools for specifying grammars.
Bison and Flex (Levine 2009) are the most famous in the C context: from a high level specification of the syntactic structure (Bison) and lexical structure (Flex) of a language, they generate the parser and lexer in C, respectively. Bison is an implementation of Yacc (Yet Another Compiler-compiler, Brown et al. 1995), and there are variants for other languages as well, such as Racc for Ruby.
In the Java world, the most well-known is probably ANTLR (ANother Tool for Language Recognition) pronounced Antler, (see the book Parr 2007). This allows the programmer to specify the grammar of the language in one single file (without separating the syntactic and lexical specifications in different files), and then it automatically generates the parser in Java.
Just to have an idea of what the specification of grammars looks like in ANTLR, here is a very simplified grammar for an expression language for arithmetic expressions with sum and multiplication (as we will see in Chapter 9, Type Checking we cannot write a recursive rule such as the one shown in the following snippet):
Even if you do not understand all the details, it should be straightforward to get its meaning—an expression is either an integer literal or (recursively) two expressions with an operator in between (either *
or +
).
From such a specification, you automatically get the Java code that will parse such expressions.
The Abstract Syntax Tree (AST)
Parsing a program is only the first stage in a programming language implementation. Once the program is checked as correct from the syntactic point of view, the implementation will have to do something with the elements of the program.
First of all, the overall correctness of a program cannot always be determined during parsing. One of the correctness checks that cannot be performed during parsing is type checking, that is, checking that the program is correct with respect to types. For instance, in Java, you cannot assign a string value to an integer variable, or you can only assign instances of a variable's declared type or subclasses thereof.
Trying to embed type checking in a grammar specification could either make the specification more complex, or it could be simply impossible, since some type checks can be performed only when other program parts have already been parsed.
Type checking is part of the semantic analysis of a program. This often includes managing the symbol table, that is, for instance, handling the variables that are declared and that are visible only in specific parts of the program (think of fields in a Java class and their visibility in methods).
For these reasons, during parsing, we should also build a representation of the parsed program and store it in memory so that we can perform the semantic analysis on the memory representation without needing to parse the same text over and over again. A convenient representation in memory of a program is a tree structure called the Abstract Syntax Tree (AST). The AST represents the abstract syntactic structure of the program. Being abstract, the tree representation does not represent many details of the original program, such as grouping parentheses and formatting spaces. In this tree, each node represents a construct of the program.
Once the AST is stored in memory, we will not need to parse the program anymore, and we can perform all the additional semantic checks on the AST. If all the checks succeed, we can use the AST for the final stage of the implementation, which can be the interpretation of the program or code generation.
In order to build the AST, we need two additional things.
First of all, we need the code for representing the nodes of such a tree. If we are using Java, this means that we need to write some Java classes, typically one for each language construct. For instance, for the expression language, we might write one class for the integer literal and one for the binary expression. Remember that since the grammar is recursive, we need a base class for representing the abstract concept of an expression. For example:
Then, we need to annotate the grammar specification with actions that construct the AST during the parsing. These actions are basically Java code blocks embedded in the grammar specification itself. The following is just an (simplified) example, and it does not necessarily respect the actual ANTLR syntax: