Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds
Arrow up icon
GO TO TOP
Implementing Domain-Specific Languages with Xtext and Xtend

You're reading from   Implementing Domain-Specific Languages with Xtext and Xtend Learn how to implement a DSL with Xtext and Xtend using easy-to-understand examples and best practices.

Arrow left icon
Product type Paperback
Published in Aug 2016
Publisher Packt
ISBN-13 9781786464965
Length 426 pages
Edition 2nd Edition
Languages
Tools
Arrow right icon
Author (1):
Arrow left icon
Lorenzo Bettini Lorenzo Bettini
Author Profile Icon Lorenzo Bettini
Lorenzo Bettini
Arrow right icon
View More author details
Toc

Table of Contents (18) Chapters Close

Preface Preface to the second edition
1. Implementing a DSL 2. Creating Your First Xtext Language FREE CHAPTER 3. Working with the Xtend Programming Language 4. Validation 5. Code Generation 6. Customizing Xtext Components 7. Testing 8. An Expression Language 9. Type Checking 10. Scoping 11. Continuous Integration 12. Xbase 13. Advanced Topics 14. Conclusions
A. Bibliography
Index

Implementing a DSL

For the end user, using a DSL is surely easier than writing XML code. However, the developer of the DSL is now left with the task of implementing it.

Implementing a DSL means developing a program that is able to read text written in that DSL, parse it, process it, and then possibly interpret it or generate code in another language. Depending on the aim of the DSL, this may require several phases, but most of these phases are typical of all implementations.

In this section, we only sketch the main aspects of implementing a DSL. For a deeper introduction to language implementations and the theory behind, we refer the interested reader to the book Aho et al. 2007.

Note

From now on, throughout the book, we will not distinguish, unless strictly required by the context, between DSL and programming language.

Parsing

First of all, when reading a program written in a programming language, the implementation has to make sure that the program respects the syntax of that language.

To this aim, we need to break the program into tokens. Each token is a single atomic element of the language; this can be a keyword (such as class in Java), an identifier (such as a Java class name), or a literal, that is, a fixed value. Examples of literals are string literals, typically surrounded by quotes, integer literals, and boolean literals (for example, true and false). Other kinds of tokens are operators (such as arithmetic and comparison operators) and separators (such as parentheses and terminating semicolons).

For instance, in the preceding example, employed is a keyword, the parentheses are separators, James is an identifier, and 50 is an integer literal.

The process of converting a sequence of characters into a sequence of tokens is called lexical analysis, and the program or procedure that performs such analysis is called a lexical analyzer, lexer, or simply a scanner. This analysis is usually implemented by using regular expressions syntax.

Having the sequence of tokens from the input file is not enough, we must make sure that they form a valid statement in our language; that is, they respect the syntactic structure expected by the language. This phase is called parsing or syntactic analysis. The program or procedure that performs such analysis is called a parser.

Let's recall the DSL to describe the name and age of various people and a possible input text:

James Smith (50)
John Anderson (40) employed

In this example, each line of the input must respect the following structure:

  • two identifiers
  • the separator (
  • one integer literal
  • the separator )
  • the optional keyword employed

In our language, tokens are separated by white spaces, and lines are separated by a newline character.

You can now deduce that the parser relies on the lexer. In fact, the parser asks the lexer for tokens and tries to build valid statement of the language.

If you have never implemented a programming language, you might be scared at this point by the task of implementing a parser, for instance, in Java. You are probably right, since this is not easy. The DSL we just used as an example is very small and still it would require some effort to implement. What if your DSL has to deal also with, say, arithmetic expressions? In spite of their apparently simple structure, arithmetic expressions are recursive by their own nature; thus a parser implemented in Java would have to deal with recursion as well, and, in particular, it should avoid possible endless loops.

There are tools to deal with parsing so that you do not have to implement a parser by hand. In particular, there are DSLs to specify the grammar of the language, and from this specification, they automatically generate the code for the lexer and parser. For this reason, these tools are called parser generators or compiler-compilers. In this context, such specifications are called grammars. A grammar is a set of rules that describe the form of the elements that are valid according to the language syntax.

Here are some examples of tools for specifying grammars.

Bison and Flex (Levine 2009) are the most famous in the C context: from a high level specification of the syntactic structure (Bison) and lexical structure (Flex) of a language, they generate the parser and lexer in C, respectively. Bison is an implementation of Yacc (Yet Another Compiler-compiler, Brown et al. 1995), and there are variants for other languages as well, such as Racc for Ruby.

In the Java world, the most well-known is probably ANTLR (ANother Tool for Language Recognition) pronounced Antler, (see the book Parr 2007). This allows the programmer to specify the grammar of the language in one single file (without separating the syntactic and lexical specifications in different files), and then it automatically generates the parser in Java.

Just to have an idea of what the specification of grammars looks like in ANTLR, here is a very simplified grammar for an expression language for arithmetic expressions with sum and multiplication (as we will see in Chapter 9, Type Checking we cannot write a recursive rule such as the one shown in the following snippet):

expression
  : INT
  | expression '*' expression
  | expression '+' expression
;

Even if you do not understand all the details, it should be straightforward to get its meaning—an expression is either an integer literal or (recursively) two expressions with an operator in between (either * or +).

From such a specification, you automatically get the Java code that will parse such expressions.

The Abstract Syntax Tree (AST)

Parsing a program is only the first stage in a programming language implementation. Once the program is checked as correct from the syntactic point of view, the implementation will have to do something with the elements of the program.

First of all, the overall correctness of a program cannot always be determined during parsing. One of the correctness checks that cannot be performed during parsing is type checking, that is, checking that the program is correct with respect to types. For instance, in Java, you cannot assign a string value to an integer variable, or you can only assign instances of a variable's declared type or subclasses thereof.

Trying to embed type checking in a grammar specification could either make the specification more complex, or it could be simply impossible, since some type checks can be performed only when other program parts have already been parsed.

Type checking is part of the semantic analysis of a program. This often includes managing the symbol table, that is, for instance, handling the variables that are declared and that are visible only in specific parts of the program (think of fields in a Java class and their visibility in methods).

For these reasons, during parsing, we should also build a representation of the parsed program and store it in memory so that we can perform the semantic analysis on the memory representation without needing to parse the same text over and over again. A convenient representation in memory of a program is a tree structure called the Abstract Syntax Tree (AST). The AST represents the abstract syntactic structure of the program. Being abstract, the tree representation does not represent many details of the original program, such as grouping parentheses and formatting spaces. In this tree, each node represents a construct of the program.

Once the AST is stored in memory, we will not need to parse the program anymore, and we can perform all the additional semantic checks on the AST. If all the checks succeed, we can use the AST for the final stage of the implementation, which can be the interpretation of the program or code generation.

In order to build the AST, we need two additional things.

First of all, we need the code for representing the nodes of such a tree. If we are using Java, this means that we need to write some Java classes, typically one for each language construct. For instance, for the expression language, we might write one class for the integer literal and one for the binary expression. Remember that since the grammar is recursive, we need a base class for representing the abstract concept of an expression. For example:

public interface Expression { }

public class Literal implements Expression {
  Integer value;
  // constructor and set methods...
}

public class BinaryExpression implements Expression {
  Expression left, right;
  String operator;
  // constructor and set methods...
}

Then, we need to annotate the grammar specification with actions that construct the AST during the parsing. These actions are basically Java code blocks embedded in the grammar specification itself. The following is just an (simplified) example, and it does not necessarily respect the actual ANTLR syntax:

expression:
  INT { $value = new Literal(Integer.parseInt($INT.text)); }
| left=expression '*' right=expression {
  $value = new BinaryExpression($left.value, $right.value);
  $value.setOperator("*");
}
| left=expression '+' right=expression {
  $value = new BinaryExpression($left.value, $right.value);
  $value.setOperator("+");
}
;
lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image