Filip Smola

Categories

  • Basilisk

Tags

  • C++
  • IR generation
  • LLVM
  • compiler
  • lexer
  • parser

Yesterday I have completed the first development version of basilisk, my custom programming language and LLVM frontend. This marks the first milestone on my road to learning about how programming languages work and how compilers are implemented.

In its current state, basilisk is not very useful as it only has the most basic features required to write a program that executes. Most notably, it’s missing any data types other than double and has nearly no way to control the flow of execution (such as conditional blocks). But it is a complete system from a lexer to an object code emitter, which makes it a great base to build upon.

For detailed information about the language, see the Grammar and Language descriptions in the GitHub repository. This article describes how the compiler is designed.

General Design

The compiler is an LLVM frontend, which allows me to use all compatible architectures as targets as well as leverage LLVM’s in-built optimization passes. This means that the system is split into three parts:

  • Lexer, which takes the source code and splits it into tokens.
  • Parser, which forms an Abstract Syntax Tree (AST) from the tokens.
  • Code generator, which generates LLVM IR from the AST.

There is also a single executable that uses these three parts to process a source file and using LLVM functionality output an object file.

Both the lexer and parser use function pointers provided to them to get input, and the lexer also uses one to produce otuput. This allows just about any buffer-like data structure to be easily used with these systems. A great example of this is how vector-backed buffers are used for unit testing of the lexer, while a file input stream is used in the compiler executable. Another advantage of this approach is that the lexer and parser can, to a certain degree, operate in parallel with the lexer producing tokens and parser consuming them.

Due to my plans of slowly extending both the language and the compiler, an important objective is extensibility of the code. Adding new tokens, AST nodes and generating code from them should be as easy as possible, so that the language can one day become useful beyond a learning experience.

Lexer

Basilisk’s lexer was hand-written for this project. I considered using a lexer generator, but all of the available options were too restrictive about how the grammar has to be presented. Furthermore, generating a lexer wouldn’t teach me anything about how they are built.

The current design uses three provided functions:

  • char get() — returns and consumes the next character in the input,
  • char peek() — returns the next character in the input without consuming it,
  • void append(Token) — appends a token to the output.

It uses peek() to examine the next character and decide what kind of token it could be a part of. Single-character tokens (e.g. semicolon) are immediately recognised and dealt with. Tokens consisting of multiple characters are dealt with by specialised functions that use the characters to progressively constrain the set of possible tokens.

For example “returning” is treated as possibly an identifier or a return keyword. The function reads the maximum length identifier and then decides based on the actual read string whether to produce a RETURN or an IDENTIFIER token.

The lexing continues until end of file character is encountered, when a special token is appended to the output to notify the parser, and the lexer terminates.

Parser

The parser is hand-written as well, for the same reasons as the lexer. It uses two provided functions similar to the ones used in the lexer:

  • Token get() — returns and consumes the next token in the input,
  • Token peek(unsigned int) — returns the nth token from the next without consuming it.

No append function is used, as the output is a root node of an Abstract Syntax Tree. The parameter for the peek function is needed because the grammar I use is not LL(1). Because of the lack of data types, function definition and function call start with the same tokens (e.g. f(a , b, c) and f(x, y, z);) leading to the need for a lookahead to discover whether the parenthesised expression list is followed by a left curly bracket or a semicolon.

While the lexer is structured as a relatively small set of functions, the parser is a set of classes that parse categories of AST nodes. Due to the nested nature of expressions, classes result in a much clearer system. These classes use eachother following a rough-grained hierarchy of the AST nodes, for example StatementParser makes use of ExpressionParser in most of its methods. Whenever one parser object makes use of another, it passes it pointers to its provided functions to make sure they use the same token source. Therefore any token consumptions of the contained parser are effortlessly propagated to its creator.

This all results in a decoupled systems with separate sections of code dedicated to specialized tasks, which should make the parser easier to extend and debug.

AST Visitor

When designing the code generator, I found that the visitor pattern would be the best choice for this. As a bonus, it would allow me to implement some language-specific optimizations and agreement passes in the future.

For this purpose, I created an abstract base class Visitor with an overload of the visit(...) method for each AST node type. The visit(Node&) is the only pure virtual method and represents the most general visit action. All other overloads have a simple implementation that casts the argument as the node type’s parent in the AST node hierarchy and calls the appropriate visit(...) overload for that type. For example, visit(statements::Return&) calls visit(Statement&) if not overridden. This way, if a concrete Visitor implementation doesn’t override the visit(...) method for a particular AST node type, any calls to it will be handled by the most specific visit(...) overload that it does override. Because the general type Node has no parent in the AST node hierarchy, visit(Node&) needs to be pure virtual.

This allows a concrete Visitor implementation to work only on a subset of the AST nodes, implicitly disregarding all other nodes by overriding only the desired visit(...) overloads and implementing visit(Node) as follows:

void visit(Node &) override {}

A simple visitor example is the PrintVisitor in basilisk::ast::util. It prints an AST into an indented string. This implementation is much easier to expand for new nodes than an earlier implementation which used instanceof to get the node types.

Code Generator

While I was familiar with the concepts of lexers and parsers from my courses, interaction with LLVM was completely new to me. Some familiarity with assembly helped, but this was still by far the most confusing part of the frontend. The LLVM Kaleidoscope tutorial was a great help with this.

The code generator is a set of visitors that operate on subsets of the AST node hierarchy (for example all expressions) and modify an LLVM module. Function and expression visitors each process a set of AST node types to generate the relevant LLVM constructs, for example expression nodes to generate llvm::Value representing the expression result. Program visitor prepares the module by adding the standard library (at this moment just a single print function) and the global variable initializer function. It then passes all definitions to a function visitor (which handles variable definitions because they are just wrappers around assignment statements at the moment).

Problems Encountered

Due to the only data type being a double, a wrapper function is needed to satisfy the requirement for the main function to return int or void. When a function named main is present in the source, it gets renamed to main_. Once all definitions are read, a main function is created by the program visitor that just calls main_ and casts the returned value into an integer. While this is a hacky solution, it only needs to be in place until more data types are added and so is sufficient for now.

Not necessarily a problem but more of an ambiguity is the ability to redefine global variables. I have found that there is no inherent problem with having multiple global variable definitions. As it currently is, there is nothing in the language or compiler that stops you from doing this. The variable will use the last initializer provided and the program will start with the variable having that value. This can however be confusing and hard to read, so I wouldn’t recommend doing it. In a later version when combining multiple source files is added, probably working like #include in C, I might add a condition to only allow a single global variable definition as it could get very hard to manage once multiple source files are in play.

Future Extensions

The first planned language extension is more data types. This includes at least the typical primitives like int, bool, … and hopefully arrays and structs as well. These are important because a lot of further features rely on them. In keeping with the general theme, I want to make these as extensible as possible. I also plan to have these data types mirror those used in LLVM (e.g. i32) to provide as much fine grained control as possible. The main immediate benefit of this will be resolving the problem with the main function, allowing me to remove the current hacky solution.

Another feature that should be fairly straight-forward to implement is nested blocks. These provide good control over the scope, making the code more readable and allowing better variable naming. This will also give me some training towards later implementing namespaces, as well as providing a base on which the next feature can be built.

The previous two features come together to allow me to implement conditional statements and loops. This includes if-then-else blocks, and while, do-while and for loops. Once these are implemented, the utility of the language should jump considerably and open the door to further features.

When it comes to compiler extensions, adding more information to tokens is a priority. Currently they only contain the type, and optionally content, of the token. This severly limits the information that can be included in error messages. I want to include at least the source file name, and line and character positions of the token. These would allow error messages to show where the problem is, greatly simplifying debugging of any basilisk programs.