Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(docs): Update compiler walkthrough #2092

Merged
47 changes: 35 additions & 12 deletions docs/contributor/compiler_walkthrough.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,49 +4,68 @@ This guide will take you through all of the phases of the compiler to give you a

We'll largely be following the `next_state` function in [compile.re](https://github.com/grain-lang/grain/blob/main/compiler/src/compile.re).

## An overview of the compiler

The Grain compiler is a [multi-stage](https://en.wikipedia.org/wiki/Multi-pass_compiler) compiler, which means instead of converting directly from Grain syntax into `wasm` code, we send the input program through multiple phases, transforming from one intermediate representation to the next until we get to the final output. This approach allows us to have a more maintainable compiler and perform deeper analysis of the source code, which lets us provide better errors and better code output.

spotandjake marked this conversation as resolved.
Show resolved Hide resolved
## File Structure
All files directly related to the compiler can be found in `compiler/src` with a map of the sub-folders found below:
* `src/parsing` - all code related to parsing and lexing
* `src/typed` - all code related to typechecking and the typed phases of the compiler
* `src/codegen` - all code related to generating both the mashtree and final wasm output which is the last two compilation steps before linking
* `src/linking` - the grain linker and code responsible for linking the intermediate wasm modules into the final wasm output
* `src/diagnostics` - all code related to parsing and handling comments for `graindoc`
* `src/formatting` - all the relevant code to the grain formatter
* `src/language_server` - all relevant code to the language server
* `src/utils` - all of our common helpers used in various places throughout the compiler

## Lexing

Lexing is the process of breaking up a string input into tokens. A Grain program string is tokenized into things like:
The first stage of the compiler is [Lexing](https://en.wikipedia.org/wiki/Lexical_analysis), which is the process of breaking up an input string into tokens that are easier for us to later parse into an abstract syntax tree. A Grain program string is tokenized into things like:

- keywords (`let`, `import`, `data`, `assert`, etc.)
- constants (`17`, `'foobar'`, etc.)
- delimiters (`{`, `}`, `[`, `]`, `,`, `;`, etc.)
- keywords (`let`, `include`, `type`, `assert`, etc.)
- constants (`17`, `"foobar"`, `1uL`, `'a'`, etc.)
- delimiters (`{`, `}`, `[`, `]`, `,`, etc.)
- operators (`*`, `+`, `==`, `&&`, etc.)
- identifiers (`myVar`, `List`, etc.)
- comments (`# this is a comment`, etc.)

To make this happen, we use [ocamllex](https://caml.inria.fr/pub/docs/manual-ocaml/lexyacc.html). `ocamllex` is a tool that generates OCaml code to do this based on rules we've defined in [parsing/lexer.re](https://github.com/grain-lang/grain/blob/main/compiler/src/parsing/lexer.re).
The grain compiler uses [sedlex](https://github.com/ocaml-community/sedlex) to build tokenization rules from easy to maintain patterns that we've defined in [parsing/lexer.re](https://github.com/grain-lang/grain/blob/main/compiler/src/parsing/lexer.re)

## Parsing

Once we've got our tokens, we move on to parsing. The goal of parsing is to take our tokens and produce an abstract syntax tree, or AST. An AST is a representation of the program's structure in a format that's easier for us to work with than the tokens themselves. For example, think of the tokens `1` `+` `2` `+` `3`. In reality, the `+` operator only works on two operands at once, so after parsing, we end up with a tree that looks like this:
After the program has been tokenized, we move to the [parsing](https://en.wikipedia.org/wiki/Parsing) stage. The goal of parsing is to convert our lexed tokens to an [abstract syntax tree](https://en.wikipedia.org/wiki/Abstract_syntax_tree), or AST. An AST is a more contextual representation of the given program structure in a format that is easier to analyze and understand compared to individual tokens. For example, operator precedence isn't very clear in a tokenized program. For the list of tokens `` `1`, `+`, `2`, `*`, `3` ``, it's unclear that the multiplication should happen first. After parsing, we end up with a tree that looks like this:

```plaintext
add
/ \
add 3
times
/ \
add 3
/ \
1 2
```

Writing a parser by hand is great when you've got a stable grammar, but the language is still rapidly evolving. Using a parser generator allows us to iterate quickly. [Menhir](http://gallium.inria.fr/~fpottier/menhir/) is an excellent production-grade parser generator that produces OCaml code for a parser based on some rules we've defined. We call these rules a "grammar" and you can find the grammar for the Grain language in [parsing/parser.mly](https://github.com/grain-lang/grain/blob/main/compiler/src/parsing/parser.mly). If you'd like to learn more about BNF grammars, check out [this resource](http://people.cs.ksu.edu/~schmidt/300s05/Lectures/GrammarNotes/bnf.html).
Writing a parser by hand is great when you've got a stable language grammar, but Grain is in rapid development. To allow us to quickly make changes to the language, we use a parser generator. [Menhir](http://gallium.inria.fr/~fpottier/menhir/) is an excellent production-grade parser generator that produces OCaml code for a parser based on parser rules we've defined. We call these rules a "grammar" and you can find the grammar for the Grain language in [parsing/parser.mly](https://github.com/grain-lang/grain/blob/main/compiler/src/parsing/parser.mly). If you'd like to learn more about BNF grammars, check out [this resource](http://people.cs.ksu.edu/~schmidt/300s05/Lectures/GrammarNotes/bnf.html). Menhir offers great support for specific errors which can be configured through the [parsing/parser.messages](https://github.com/grain-lang/grain/blob/main/compiler/src/parsing/parser.messages) file.

The definition for the Grain AST (which we often refer to as the parsetree) can be found in [parsing/parsetree.re](https://github.com/grain-lang/grain/blob/main/compiler/src/parsing/parsetree.re).

## Well-formedness

This is just a fancy term for asking the question "does this program—for the most part—make sense?" In Grain, type identifers must always start with a capital letter, so there's a well-formedness check that enforces this. In general, we like to be as lenient as possible while parsing and provide helpful error messages from well-formedness checks. If a user writes a program like `data foo = ...`, it's much better to say `Error: 'foo' should be capitalized` rather than `Syntax error`.
This is just a fancy term for asking the question "does this program—for the most part—make sense?" In Grain, type identifers must always start with a capital letter, so there's a well-formedness check that enforces this. In general, we like to be as lenient as possible while parsing and provide helpful error messages from well-formedness checks. If a user writes a program like `type foo = ...`, it's much better to say `Error: 'foo' should be capitalized` rather than `Syntax error`.

You can find the Grain well-formedness checks in [parsing/well_formedness.re](https://github.com/grain-lang/grain/blob/main/compiler/src/parsing/well_formedness.re).

## Typechecking

Grain implements a [Hindley-Milner type system](https://en.wikipedia.org/wiki/Hindley%E2%80%93Milner_type_system). This is by far the most academically challenging step of the compilation process. As such, the Grain typechecker is largely borrowed from the [OCaml compiler](https://github.com/ocaml/ocaml) (yay open source 🎉). This is the process that infers the type of all Grain expressions, and makes sure they line up.
Grain implements a [Hindley-Milner type system](https://en.wikipedia.org/wiki/Hindley%E2%80%93Milner_type_system). This is by far the most academically challenging step of the compilation process. As such, the Grain typechecker is largely borrowed from the [OCaml compiler](https://github.com/ocaml/ocaml) (yay open source 🎉). [Typechecking](https://en.wikipedia.org/wiki/Type_system#Type_checking) is the process of verifying that the program consistently applies the right kinds of functions to the right kinds of data. The Grain typechecker is also responsible for inferring the type of all Grain expressions. For example, the typechecker will infer the type of `add` in `let add = (x, y) => x + y` to be `(a: Number, b: Number) => Number`. If the user calls `add` with an invalid type such as `add(1, "test")` it will throw an error. Typechecking drastically reduces the number of bugs encountered at runtime.

The internals pretty much never need to be touched 🙏, though it's sometimes necessary to make changes to how we make calls to the typechecker in [typed/typemod.re](https://github.com/grain-lang/grain/blob/main/compiler/src/typed/typemod.re) or [typed/typecore.re](https://github.com/grain-lang/grain/blob/main/compiler/src/typed/typecore.re).

After typechecking a module, we're left with a typedtree. You can find the definition in [typed/typedtree.re](https://github.com/grain-lang/grain/blob/main/compiler/src/typed/typedtree.re).

## Typed Well-formedness

After typechecking, we have more information about the program. We do a second well-formedness pass to further weed out invalid programs. This takes place in [types/typed_well_formedness.re](https://github.com/grain-lang/grain/blob/main/compiler/src/typed/typed_well_formedness.re)

## Linearization

In this step, we convert the typedtree into [A-normal Form](https://en.wikipedia.org/wiki/A-normal_form), or ANF. This purpose of this step is to create a linear set of expressions that could be performed in order from start to finish. For example, given the expression `foo(3 * 4, bar(5))`, we'd want to produce:
Expand Down Expand Up @@ -127,6 +146,10 @@ The code generation (or codegen) step is where we generate the actual WebAssembl

If you're curious about the wasm spec in general, you can check it out [here](https://webassembly.github.io/spec/core/index.html).

## Linking

Each Grain source file is compiled to a Grain-specific wasm file. To create the final program, we merge all of the files together in a step known as linking. This takes place in [linking/link.re](https://github.com/grain-lang/grain/blob/main/compiler/src/linking/link.re)

## Emission

Lastly, we write the wasm to a file. And that's it! If you want an in-depth dive into any of the stages of the compiler, feel free to ask in the Discord and someone will be more than happy to walk you through it.
Loading