Harlan Haskins

Software Engineer | About | Projects | Resume | Contact

Building a Compiler in Swift with LLVM, Part 2: AST and the Parser

In Part 1 of this tutorial, we built a lexer in Swift that can tokenize the Kaleidoscope language. Now we're going to write a parser that can turn code from this language into an Abstract Syntax Tree, or AST.

This is the second part in a 4-part series where we'll build a compiler for the basic form of the Kaleidoscope sample language.

Quick Links

AST Structure

An AST represents the structure of your program as it was written in the source text. ASTs usually have different data structures for the different kinds of declarations, statements, and expressions in the language. In Kaleidoscope, we're going to have three kinds of structures.

Prototypes

A function prototype includes the name of the function and the names of the function's parameters.

For example, in the function definition:

def foo(a, b, c) 0;

the prototype will have the name "foo", and the parameters ["a", "b", "c"].

We'll make it a struct that includes both of those values:

struct Prototype {
    let name: String
    let params: [String]
}

This means we can use this prototype as both the prototype of an extern declaration and inside a function definition. Speaking of...

Externs and Definitions

Kaleidoscope allows the programmer to specify existing named functions in the C standard library without having to implement those functions.

For example, if I want to use the math.h function:

double sqrt(double n);

then I can declare its existence in Kaleidoscope as:

extern sqrt(n);

Because Kaleidoscope only has Double-typed numbers, we can get away with not specifying types in the declaration. All parameters are implicitly double, and all functions return double.

A Kaleidoscope function definition has two parts: a prototype and an expression. The expression will have variables that reference the parameters given to the function when it's called.

We can represent it with another struct, this time called Definition:

struct Definition {
    let prototype: Prototype
    let expr: Expr
}

Expressions

There are a few different kinds of expressions in the Kaleidoscope language. You have:

  • Numbers, which have a Double as a value
  • Variables
  • Binary operators, which have two subexpressions and an operator
  • Function Calls, which have a name and an array of expressions.
  • If-else expressions, which behave like ternary expressions in Swift

Because there is a small set of cases, we can represent the expressions with another Swift enum:

indirect enum Expr {
    case number(Double)
    case variable(String)
    case binary(Expr, BinaryOperator, Expr)
    case call(String, [Expr])
    case ifelse(Expr, Expr, Expr)
}

Parser

Kaleidoscope has a simple grammar that can be represented in Backus-Naur Form:

<prototype>  ::= <identifier> "(" <params> ")"
<params>     ::= <identifier>
               | <identifier>, <params>
<definition> ::= "def" <prototype> <expr> ";"
<extern>     ::= "extern" <prototype> ";"
<operator>   ::= "+" | "-" | "*" | "/" | "%"
<expr>       ::= <binary> | <call> | <identifier> | <number> | <ifelse>
               | "(" <expr> ")"
<binary>     ::= <expr> <operator> <expr>
<call>       ::= <identifier> "(" <arguments> ")"
<ifelse>     ::= "if" <expr> "then" <expr> "else" <expr>
<arguments>  ::= <expr>
               | <expr> "," <arguments>

The names on the left of the ::= are called "terms", and the values on the right are the possible values that may represent these terms.

In this form, each term can be substituted with the value to the right.

We'll be building a recursive descent parser in Swift that can turn tokens emitted by our Lexer into the AST we declared earlier.

A Note on Recursive Descent Parsers

Notice in the grammar above that <expr> has <binary> as a case, and that <binary> has two <expr>s inside. That's where the "recursive" comes in. The parsers for <expr> and <binary> are going to call into each other.

Kaleidoscope's Parser

First, begin with a class definition for our Parser. It needs to keep a list of tokens and a current index into that list.

class Parser {
    let tokens: [Token]
    var index = 0

    init(tokens: [Token]) {
        self.tokens = tokens
    }
}

Next, just like our Lexer, we're going to want some abstractions that make it easier to parse each of the terms. Let's define those in the class:

var currentToken: Token? {
    return index < tokens.count ? tokens[index] : nil
}

func advance() {
    index += 1
}

/// Eats the specified token if it's the currentToken,
/// otherwise throws an error.
func parse(_ token: Token) throws {
    guard let tok = currentToken else {
        throw ParseError.unexpectedEOF
    }
    guard token == tok else {
        throw ParseError.unexpectedToken(token)
    }
    advance()
}

We'll also use a ParseError enum that specifies a few errors our parser can throw:

enum ParseError: Error {
    case unexpectedToken(Token)
    case unexpectedEOF
}

Okay, now we're going to parse each term we use in the grammar. We can combine some terms that are only used in one place. First, we'll want a function that parses a single identifier. We need to ensure that the current token is an .identifier token, and extract the name from it.

func parseIdentifier() throws -> String {
    guard let token = currentToken else {
        throw ParseError.unexpectedEOF
    }
    guard case .identifier(let name) = token else {
        throw ParseError.unexpectedToken(token)
    }
    advance()
    return name
}

Once we have this, we can make use of it inside the parsers for other terms.

Next, we'll write the parser for Prototypes. We'll want to parse:

  • a single identifier, then
  • a left paren, then
  • a comma-separated list of identifiers, and finally
  • a right paren.

But wait — we've seen this pattern before in the grammar. Left paren, comma-separated list, then right paren... that's very similar to the grammar for function call arguments! The only difference is the function call arguments contain a list of expressions instead of a list of identifiers.

Because we dislike code duplication, we can use some more advanced Swift features to reuse the logic: generics and higher-order functions.

Let's make a function:

func parseCommaSeparated<TermType>(_ parseFn: () throws -> TermType) throws -> [TermType]

This function is generic, which means it will collect whatever type the passed-in function returns into a list.

func parseCommaSeparated<TermType>(_ parseFn: () throws -> TermType) throws -> [TermType] {
    try parse(.leftParen)
    var vals = [TermType]()
    while let tok = currentToken, tok != .rightParen {
        let val = try parseFn()
        if case .comma? = currentToken {
            try parse(.comma)
        }
        vals.append(val)
    }
    try parse(.rightParen)
    return vals
}

Doing this, our parser for the prototype term is simple:

func parsePrototype() throws -> Prototype {
    let name = try parseIdentifier()
    let params = try parseCommaSeparated(parseIdentifier)
    return Prototype(name: name, params: params)
}

And now the way we parse extern definitions is simple, just parse the .extern token, then return a prototype:

func parseExtern() throws -> Prototype {
    try parse(.extern)
    let proto = try parsePrototype()
    try parse(.semicolon)
    return proto
}

Expressions

Parsing expressions is slightly more complicated. Since there are different kinds of expressions, we'll have to switch on the token to determine which term we're going to parse.

It'll look like this:

  • If we see a left paren, then we're going to eat it, recurse to parse the inner expression, then eat the right paren.
  • If we see a number, then we can eat it and make a .number expression.
  • If we see an identifier, then we can eat it.
    • If after that we see a left paren, then we can parse the arguments to a function call and make a .call expression
    • Otherwise, make a .variable expression.
  • If we see the .if token, then we eat it, parse the condition expression, eat the .then, parse the then expression, eat the .else, and parse the else expression.
  • If we see any other token, throw an error.

We keep track of this expression, then check for an operator token.

  • If we see an operator token, then eat it and recurse to parse another expression. We then package these two expressions into a .binary expression.
  • Otherwise, return the first expression we parsed.

The full code for this is attached at the bottom.

Definitions

We're almost done!

The last thing we need to parse are function definitions. Recall that a function definition includes a prototype and an expression for the body. We already have parsers for both of those terms, so we can just re-use those parsers and be done.

func parseDefinition() throws -> Definition {
    try parse(.def)
    let prototype = try parsePrototype()
    let expr = try parseExpr()
    let def = Definition(prototype: prototype, expr: expr)
    try parse(.semicolon)
    return def
}

All Together Now

Now that we have parsers for each of the terms in the language, we can make a struct that holds all the file-level declarations in the program. These can be function definitions, extern declarations, or expressions.

We just switch over the current token until we have no more tokens left. If the code was malformed during the parsing, it will throw an error and we will bubble it up to the caller.

If the code was correct, then we just read until there are no more tokens and return a File class with the file-level definitions.

func parseFile() throws -> File {
    let file = File()
    while let tok = currentToken {
        switch tok {
        case .extern:
            file.addExtern(try parseExtern())
        case .def:
            file.addDefinition(try parseDefinition())
        default:
            let expr = try parseExpr()
            try consume(.semicolon)
            file.addExpression(expr)
        }
    }
    return file
}

Let's Give It a Shot!

let toks = Lexer(input: "extern sqrt(n); def foo(n) (n * sqrt(n * 200) + 57 * n % 2);").lex()
let topLevel = try Parser(tokens: toks).parseTopLevel()
print(topLevel)
// TopLevel(externs: [Prototype(name: "sqrt", params: ["n"])], definitions: [Definition(prototype: Prototype(name: "foo", params: ["n"]), expr: Expr.binary(Expr.variable("n"), BinaryOperator.times, Expr.binary(Expr.call("sqrt", [Expr.binary(Expr.variable("n"), BinaryOperator.times, Expr.number(200.0))]), BinaryOperator.plus, Expr.binary(Expr.number(57.0), BinaryOperator.times, Expr.binary(Expr.variable("n"), BinaryOperator.mod, Expr.number(2.0))))))])

It works!

Now we have a parser and syntax tree. The next step is to generate LLVM IR for this code and then we'll have a real functioning compiler!

Code Listing

The full code for the tutorial so far is listed below and available as a gist: