# This is Jawher Moussa's blog

in which he writes about technical stuff

# The Litil chronicle - Chapter 3.2, Recursive Descent Parsing

In The previous post, I talked about some of the theory behind parsing. This post however will be about how to actually write down a parser, which given a list of tokens generated by the lexer would generate an AST.

For example, given the following Litil program:

``````let bool2int x = if x then 1 else 0
``````

Feeding it to the lexer would generate the following token stream:

1. `KEYWORD(let)`
2. `NAME(bool2int)`
3. `NAME(x)`
4. `SYM(=)`
5. `KEYWORD(if)`
6. `NAME(x)`
7. `KEYWORD(then)`
8. `NUM(1)`
9. `KEYWORD(else)`
10. `NUM(0)`

Feeding this token stream to the parser we’re going to create would generate this AST1:

As we’ve seen in the previous post, there are different kinds of parsers: Top-down (LL for example), Bottom-up (LR for example) and many concrete methods and algorithms to implement them.

I chose to use the recursive descent parsing in litil for the following reasons:

• Such parsers are easy to write by hand
• The resulting code is readable and easy to evolve and maintain

## From grammar to parser

What follows is how to create a recursive descent parser from an EBNF grammar.

We’re going to create a parser for a subset of litil. Here’s its EBNF grammar:

``````program → letBinding*

letBinding → 'let' NAME NAME* '=' expression
expression → 'if' expression 'then' expression 'else' expression | NAME | 'true' | 'false' | NUM
``````

In the language defined by the grammar above, a program is a set of let bindings.
A let binding starts with the `let` keyword followed by a name, an optional list of arguments, the `=` symbol followed by an expression.

An expression can either be a name (referencing a variable by its name), a numeric literal (5, 42, etc.), a boolean literal (`true` or `false`) or an `if` expression.
An `if` expression starts with the `if` keyword followed by a condition (which is an expression).
The branch values are specified as an expression after the `then` and `else` keywords.

Some examples of programs written in this language:

``````let x = 5

let b = true

let y = if true then 45 else x
``````

Granted, it is not a very useful language (it can’t even add 2 numbers together or compare them).
However, its grammar is compact and simple and it will serve for the purpose of demonstrating how to write a recursive descent parser.

### The framework

We’ll start by creating a class for the parser that takes a lexer as its constructor argument:

``````public class LitilParser {

this.lexer = lexer;
}
}
``````

Next, we’ll add a field that will hold the current token the lexer just read and a couple of functions that operate on it:

``````public class LitilParser {

private Token token;

this.lexer = lexer;
}

private boolean found(Token.Type key) {
Token tk = lexer.peek(1);
if (tk.type == key) {
lexer.pop();
token = tk;
return true;
} else {
return false;
}
}

private boolean found(Token.Type key, String value) {
Token tk = lexer.peek(1);
if (tk.type == key && tk.text.equals(value)) {
lexer.pop();
token = tk;
return true;
} else {
return false;
}
}
}
``````

The `found` function checks if the upcoming token’s type from the lexer matches its argument.
If that’s the case, the token is consumed and stored in the `token` field, and `true` is returned.
Else, no token gets consumed and `false` is returned.

Here’s an example of how this function can be used:

``````if(found(Token.Type.NAME)) {
String name = token.value;
}
``````

There’s a second variant that also checks the upcoming token’s value besides its type:

``````if(found(Token.Type.KEYWORD, "let")) {
// try and parse a let binding
}
``````

We’ll also need `expect`, which behaves like `found`, except that it throws an exception instead of returning false:

``````public class LitilParser {

private void expect(Token.Type key) {
if (!found(key)) {
throw error("Was expecting a token of type '" + key + "' but got " + lexer.peek(1));
}
}

private void expect(Token.Type key, String value) {
if (!found(key, value)) {
throw error("Was expecting '" + value + "'(" + key + ")");
}
}

private ParseException error(String msg) {
Token where = token == null ? lexer.peek(1) : token;
throw new ParseException(msg, lexer.getCurrentLine(), where.row, where.col);
}
}
``````

The `error` function constructs a `ParseException`, which holds an error message, the current line being parsed and the parse coordinates (row and column):

``````public class LitilParser {

private ParseException error(String msg) {
Token where = token == null ? lexer.peek(1) : token;
throw new ParseException(msg, lexer.getCurrentLine(), where.row, where.col);
}
}
``````

Here’s the `ParseException` declaration:

``````public class ParseException extends RuntimeException {

public ParseException(String message, String line, int row, int col) {
super(String.format("%s (@%d:%d)\n%s\n%s^\n", message, row, col, line, ParseException.indent(col - 1)));
}

private static String indent(int n) {
StringBuilder res = new StringBuilder();
for (int i = 0; i < n; i++) {
res.append(" ");
}
return res.toString();
}
}
``````

`ParseException` is a runtime exception (unchecked) that formats its arguments into a handy display format.
For instance, given this invalid input (due to a missing let binding name):

``````let = true
``````

The `ParseException` would generate a message resembling the following:

``````ParseException: Was expecting a token of type 'NAME' but got SYM ('=') @ 2:5 (@2:5)
let = true
^
``````

### Coding the production rules

Once we have the basic framework I showed above, converting the EBNF grammar into a parser becomes as easy as:

1. Create a method for every production rule using the left hand side non terminal’s name as the method’s name
2. For the right hand part:
1. A terminal maps to a call to `expect`
2. A non-terminal maps to a call to its production rule method
3. A choice (`|`) or an optional part (`?`) is handled with the `found` primitive and the host language’s `if` construct
4. A repetition is handled with the host language’s `while` construct

Let’s apply this to the grammar we defined above:

``````program → letBinding*
letBinding → 'let' NAME NAME* '=' expression
expression → 'if' expression 'then' expression 'else' expression | NAME | 'true' | 'false' | NUM
``````

Applying the rule 1, we create 3 methods, one for every product rule:

``````public void program() {
// TODO
}

public void letBinding() {
// TODO
}

public void expr() {
// TODO
}
``````

Let’s implement the `program` method.
As a reminder, here’s the right hand part of its production rule:

``````program → letBinding*
``````

i.e. zero or more repetitions of the letBinding non terminal.
Applying the rule 2.4, we know we’ll use a `while` loop the handle the repetition. We’ll also use a special feature in litil’s lexer, which produces a `NEWLINE` token for every non-empty line in the program.
Here’s how `program` should be implemented:

``````public void program() {
while (found(Token.Type.NEWLINE)) {
letBinding();
}
}
``````

Or in english:

• as long a we get a `NEWLINE` token: that’s the `while(found(NEWLINE))` part
• parse a let binding: this is done by simply calling the `letBinding` method (rule 2.2)

How about `letBinding` ?

Here’s its production rule:

``````letBinding → 'let' NAME NAME* '=' expression
``````

This cleanly maps to the following java code:

``````public void letBinding() {
expect(Token.Type.KEYWORD, "let");
expect(Token.Type.NAME);
while (found(Token.Type.NAME)) {
// NOP
}
expect(Token.Type.SYM, "=");
expr();
}
``````

We start by handling the ‘let’ terminal. We simply apply the rule 2.1 by calling `expect` to expect the `let` keyword.
Same for the name that follows it. We simply call `expect(NAME)`.

We then have to handle `NAME*` (the optional arguments list). This is achieved with a `while` loop and the `found` primitive (rule 2.4).
We then expect the `=` symbol and finally, parse an expression by calling the `expr` method.

Here’s the production rule for an expression:

``````expression → 'if' expression 'then' expression 'else' expression | NAME | 'true' | 'false' | NUM
``````

To handle the choice, we apply rule 2.3 and use the `if` construct with the `found` primitive:

``````private void expr() {
if (found(Token.Type.KEYWORD, "if")) {
expr();
expect(Token.Type.KEYWORD, "then");
expr();
expect(Token.Type.KEYWORD, "else");
expr();
} else if (found(Token.Type.NAME)) {
//NOP
} else if (found(Token.Type.NUM)) {
//NOP
} else if (found(Token.Type.BOOL)) {
//NOP
} else {
throw error("Unrecognized expression");
}
}
``````

That’s it. Seriously, that’s it. We just coded a recursive descent parser for the grammar given above.

To actually run this parser, we can use this snippet:

``````public static void main(String[] args) throws Exception {
String code = "let y = if true then 45 else x"
p.program();
}
``````

Simply instantiate a lexer and a parser, and call the `program` method to start the parsing process.

### AST

There still one thing missing though: in its current form, the parser can only ensure that a program is valid according to a grammar or throw a `ParseException` if it isn’t.

That’s useful in itself, but what we’d really like is to have it produce an AST so we can do more useful stuff later, like type checking or executing it for example.

In this case, and according the EBNF grammar defined above, we’ll define these AST node types:

#### Program

A program is a list of let bindings:

``````public class Program {
public final List<LetBinding> instructions;

public Program(List<LetBinding> instructions) {
this.instructions = instructions;
}
}
``````

#### LetBinding

A let binding can be described with an object containing the following properties:

Property Type Description
name `String` the name of the the variable or function being defined
args `List<String>` the list of argument names
body `Expression` the expression forming the value of the let binding

or in Java:

``````public class LetBinding {
public final String name;
public final List<String> args;
public final Expression body;

public LetBinding(String name, List<String> args, Expression body) {
this.name = name;
this.args = args;
this.body = body;
}
}
``````

#### Expression

Now, expressions are a bit trickier because of the choice in the production rule:

``````expression → 'if' expression 'then' expression 'else' expression | NAME | 'true' | 'false' | NUM
``````

So we’re going to need a base type:

``````public abstract class Expression {
}
``````

And a sub class for every choice:

##### EIf

`EIf` should contain the following properties:

Property Type Description
cond `Expression` the if condition
thenExpr `Expression` the expression to return when the `then` branch matches
elseExpr `Expression` the expression to return when the `else` branch matches

Which translates to the following Java code:

``````public class EIf extends Expression {
public final Expression cond;
public final Expression thenExpr;
public final Expression elseExpr;

public EIf(Expression cond, Expression thenExpr, Expression elseExpr) {
this.cond = cond;
this.thenExpr = thenExpr;
this.elseExpr = elseExpr;
}
}
``````
##### EName

Just a class with a name field:

``````public class EName extends Expression {
public final String name;

public EName(String name) {
this.name = name;
}
}
``````
##### EBool

Same for boolean literals expressions:

``````public class EBool extends Expression {
public final boolean value;

public EBool(boolean value) {
this.value = value;
}
}
``````
##### ENum

And also the same for numeric literals expressions:

``````public class ENum extends Expression {
public final long value;

public ENum(long value) {
this.value = value;
}
}
``````

### Generating the AST

Now that we have defined our AST node types, we need to make the parser construct it.

We’ll start by modifying the parse methods to return the correct AST type instead of `void`.

For example, where `expr()` used to return void:

``````private Expression expr() {
:
:
}
``````

We’ll make it return `Expression` instead:

``````private Expression expr() {
:
:
}
``````

We’ll also need to modify its implementation to actually return an `Expression`.

Let’s start with the `if` expressions parsing.
Here’s what we have got so far:

``````if (found(Token.Type.KEYWORD, "if")) {
expr(); // parse the if condition
expect(Token.Type.KEYWORD, "then");
expr(); // parse the then result
expect(Token.Type.KEYWORD, "else");
expr(); // parse the else result
} else
``````

What we’ll need to do is construct an `EIf` node using the condition and branches expressions:

``````if (found(Token.Type.KEYWORD, "if")) {
Expression cond = expr();
expect(Token.Type.KEYWORD, "then");
Expression thenExpr = expr();
expect(Token.Type.KEYWORD, "else");
Expression elseExpr = expr();
return new EIf(cond, thenExpr, elseExpr);
} else
``````

That was the hardest one.
The remaining ones (names, numeric and boolean literals) can be implemented as follow:

`````` else if (found(Token.Type.NAME)) {
return new EName(token.text);
} else if (found(Token.Type.NUM)) {
return new ENum(Long.parseLong(token.text));
} else if (found(Token.Type.BOOL)) {
return new EBool(Boolean.parseBoolean(token.text));
} else
``````

Now, here’s how the complete `expr` method:

``````private Expression expr() {
if (found(Token.Type.KEYWORD, "if")) {
Expression cond = expr();
expect(Token.Type.KEYWORD, "then");
Expression thenExpr = expr();
expect(Token.Type.KEYWORD, "else");
Expression elseExpr = expr();
return new EIf(cond, thenExpr, elseExpr);
} else if (found(Token.Type.NAME)) {
return new EName(token.text);
} else if (found(Token.Type.NUM)) {
return new ENum(Long.parseLong(token.text));
} else if (found(Token.Type.BOOL)) {
return new EBool(Boolean.parseBoolean(token.text));
} else {
throw error("Unrecognized expression");
}
}
``````

How about the `let` binding construct ? As a reminder, here’s how we coded the `let` parsing mthod:

``````public void letBinding() {
expect(Token.Type.KEYWORD, "let");
expect(Token.Type.NAME);
while (found(Token.Type.NAME)) {
// NOP
}
expect(Token.Type.SYM, "=");
expr();
}
``````

To get it to produce an AST node, we start by changing its return type from `void` to `LetBinding`:

``````public LetBinding letBinding() {

}
``````

And then, same as with the `if` parsing method, we’re going to collect the necessary information to construct the result:

• the `let` binding name
• the arguments names
• the body’s expression

The `let` binding name was parsed using the `expect` method:

``````expect(Token.Type.KEYWORD, "let");
expect(Token.Type.NAME);
``````

The `expect` function, as explained earlier, reads the next token produced by the lexer, and checks that its type and value match its arguments. If a match is found, the consumed token is then stored in the `token` field.

Hence, the `let` binding’s name can be retrieved after the `expect` call in `token.text`:

``````expect(Token.Type.KEYWORD, "let");
expect(Token.Type.NAME);
String name = token.text;
``````

The same applies to the `found` method, i.e. if it matches, the cosumed token is stored in the `token` field.
Thus, to collect the arguments names, we need to modify the empty `while` loop’s body to add the current token’s text into a list:

``````List<String> args = new ArrayList<String>();
while (found(Token.Type.NAME)) {
}
``````

It only remains to store the `let` bindings’s body expression into a variable:

``````Expression body = expr();
``````

And finally to construct and return a `LetBinding` instance using the collected name, argument list and body:

``````return new LetBinding(name, args, body);
``````

And we’re done ! Here’s the complete listing of the `let` parsing method:

``````public LetBinding letBinding() {
expect(Token.Type.KEYWORD, "let");
expect(Token.Type.NAME);
String name = token.text;

List<String> args = new ArrayList<String>();
while (found(Token.Type.NAME)) {
}
expect(Token.Type.SYM, "=");
Expression body = expr();
return new LetBinding(name, args, body);
}
``````

Easy peasy !

I think you got how it works by now. So I’ll simply paste the modified version of the program parsing method below:

``````public Program program() {
List<LetBinding> instructions = new ArrayList<LetBinding>();
while (found(Token.Type.NEWLINE)) {
}
return new Program(instructions);
}
``````

Now that the different parsing methods, and mainly `program` return the AST, we can go back to the `main` function to use the return value:

``````public static void main(String[] args) throws Exception {
String code = "let y = if true then 45 else x"