Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bytecode Implementation #1

Merged
merged 13 commits into from
May 21, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 34 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@

- A hobbyist functional programming language and interpreter project done in Golang as a way of understanding Golang.

> Inspired by Thorsten Ball's books, "Writing an Interpreter in Go" and "Writing a Compiler in Go". All credit goes to him for the inspiration.

## Introduction
- The goal of this project is to turn a `tree-walking, on-the-fly evaluating interpreter` into a `bytecode compiler` and a `virtual machine` that executes the bytecode.

Expand Down Expand Up @@ -300,8 +302,9 @@ cyrus("name"); // Cyrus
- The above might sound counter-intuitive, that interpreters and compilers are opposites, but while their approach is different, they share a lot of things in their construction. They both have a frontend that reads in source code in the source language, and turns it into a data structure.

- In both, compiler and interpreter, the frontend is usually made up of a lexer and a parser, that generate a syntax tree. In the frontend they have similarities, after that when they both traverse the AST, their paths diverge.

- Let's take a look at the lifecycle of code being translated to machine code below:
![Compiler Lifecycle](/assets/compiler-lifecycle.png)
![Compiler Lifecycle](/assets/compiler-lifecycle.png)

1. The source code is tokenized and parsed by the lexer and parser respectively. This is the frontend. The source code is turned from text to AST.

Expand Down Expand Up @@ -378,4 +381,33 @@ cyrus("name"); // Cyrus

- We can see we need to implement two instruction types in total: One for pushing to the stack and another for adding values in the stack.

- Let's define the opcodes and how they are encoded in bytecode, then extend the compiler to generate instructions, then create a VM that decodes and executes the instructions. We'll create a new package `code` to define the bytecode instructions and the compiler.
- Let's define the opcodes and how they are encoded in bytecode, then extend the compiler to generate instructions, then create a VM that decodes and executes the instructions. We'll create a new package `code` to define the bytecode instructions and the compiler.

- What we know is that bytecode is made up of instructions, which are a series of bytes, and a single instruction is 1 byte wide.

- In our `code` package we create instructions - a slice of bytes - and an `Opcode` byte. We define `Instructions []byte`, because its far more easy to work around with a `[]byte`, and treat it implicitly than encode definitions in Go's type system.

- `Bytecode` definition is missing because we'd run into a nasty import-cycle if we defined it in the `code` package. We will define it in the `compiler` package, later.

- What if later on we wanted to push other things to the stack from our Chui code? String literals, for example. Putting those into the bytecode is also possible, since it’s just made of bytes, but it would also be a lot of bloat and would sooner
or later become unwieldy.

- That's where `constants` come into play. In this context, “constant” is short for “constant expression” and refers to expressions whose value doesn’t change, is constant, and can be determined at compile time:

![Constants](/assets/constants.png)

- This means we don't have to run the program to know what expressions evaluate to. A compiler can find them in the code and store the value they evaluate to. Then it can reference the constants in the instructions it generates, instead of embedding values directly in them. The resulting data structure is an integer abd can serve as an index to the data structure that holds all constants, known as `constant pool`, which is what our compiler will do.

- When we get an integer literal (a constant expression) during compiling, we’ll evaluate it & keep track of the resulting *object.Integer, by storing it in memory and assigning it a number.
- In the bytecode instructions we’ll refer to the *object.Integer by this number, when compiling is done and we pass the instructions to the VM for execution, we’ll also hand over all the constants found putting them in a data structure – our constant pool – where the number that has been assigned to each constant can be used as an index to retrieve it.

- Each definition will have an Op prefix and the value in reference will be determined by `iota`, it (`iota`) will generate increasing byte values, because we don’t care about the actual values our opcodes represent. They only need to
be distinct from each other and fit in one byte, `iota` makes sure of that for us.

- The definition for `OpConstant` says that its only operand is two bytes wide, making it a `uint16`, limiting the maximum value to `65535`. - If we include `0` the number of representable values is then `65536`, which should be enough, since I don’t think we’re going to reference
more than `65536` constants in our Chui programs.
- This means using a `uint16` instead of, say, a
`uint32`, helps keep the resulting instructions smaller, because of less unused bytes.

- We want end to end as soon as possible, and not a system that can only be turned on once it’s feature-complete, our goal in this [PR #1](https://github.com/Cyrus-0101/chui/pull/1) is to build the smallest possible compiler, that should only do one thing for now: produce two `OpConstant` instructions that later instruct the VM to correctly load the integers 2 and 2 on to the stack.
- In order to achieve that, the minimal compiler has to: traverse the AST passed, find the *ast.IntegerLiteral nodes, evaluate them by turning them into *object.Integer objects, add the objects to the `constant pool`, and finally emit `OpConstant` instructions that reference the constants in said pool.
Binary file added assets/constants.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
142 changes: 142 additions & 0 deletions src/code/code.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
// Package code provides functionality for working with bytecode instructions.
//
// It defines the Instructions type, which is a slice of bytes, and the Opcode type, which is a byte.
package code

import (
"bytes"
"encoding/binary"
"fmt"
)

type Instructions []byte

type Opcode byte

const (
OpConstant Opcode = iota
OpAdd
)

// Definition represents the definition of an opcode, including its name and the widths of its operands, which is used to determine how many bytes to read to extract the operands.
type Definition struct {
Name string
OperandWidths []int
}

// definitions maps opcodes to their definitions.
var definitions = map[Opcode]*Definition{
OpConstant: {"OpConstant", []int{2}},
OpAdd: {"OpAdd", []int{}},
}

// Lookup() retrieves the definition of an opcode.
func Lookup(op byte) (*Definition, error) {
def, ok := definitions[Opcode(op)]
if !ok {
return nil, fmt.Errorf("opcode %d undefined", op)
}

return def, nil
}

// Make() creates a bytecode instruction from an opcode and its operands.
func Make(op Opcode, operands ...int) []byte {
def, ok := definitions[op]

if !ok {
return []byte{}
}

instructionLen := 1

for _, w := range def.OperandWidths {
instructionLen += w
}

instruction := make([]byte, instructionLen)
instruction[0] = byte(op)
offset := 1

for i, o := range operands {
width := def.OperandWidths[i]

switch width {

case 2:
binary.BigEndian.PutUint16(instruction[offset:], uint16(o))
}

offset += width
}

return instruction
}

// String() returns a string representation of the bytecode instructions, including the offset of each instruction in the bytecode.
func (ins Instructions) String() string {
var out bytes.Buffer

i := 0

for i < len(ins) {
def, err := Lookup(ins[i])

if err != nil {
fmt.Fprintf(&out, "ERROR: %s\n", err)
continue
}

operands, read := ReadOperands(def, ins[i+1:])

fmt.Fprintf(&out, "%04d %s\n", i, ins.fmtInstruction(def, operands))

i += 1 + read
}

return out.String()
}

// fmtInstruction() formats an instruction for printing.
func (ins Instructions) fmtInstruction(def *Definition, operands []int) string {
operandCount := len(def.OperandWidths)

if len(operands) != operandCount {
return fmt.Sprintf("ERROR: operand len %d does not match defined %d\n",
len(operands), operandCount)
}

switch operandCount {

case 0:
return def.Name

case 1:
return fmt.Sprintf("%s %d", def.Name, operands[0])
}

return fmt.Sprintf("ERROR: unhandled operandCount for %s\n", def.Name)
}

// ReadOperands() reads the operands of an instruction.
func ReadOperands(def *Definition, ins Instructions) ([]int, int) {
operands := make([]int, len(def.OperandWidths))
offset := 0

for i, width := range def.OperandWidths {
switch width {

case 2:
operands[i] = int(ReadUint16(ins[offset:]))
}

offset += width
}

return operands, offset
}

// ReadUint16() reads a uint16 from a byte slice.
func ReadUint16(ins Instructions) uint16 {
return binary.BigEndian.Uint16(ins)
}
84 changes: 84 additions & 0 deletions src/code/code_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
package code

import "testing"

func TestMake(t *testing.T) {
tests := []struct {
op Opcode
operands []int
expected []byte
}{
{OpConstant, []int{65534}, []byte{byte(OpConstant), 255, 254}},
{OpAdd, []int{}, []byte{byte(OpAdd)}},
}

for _, tt := range tests {
instruction := Make(tt.op, tt.operands...)

if len(instruction) != len(tt.expected) {
t.Errorf("instruction has wrong length. want=%d, got=%d",
len(tt.expected), len(instruction))
}

for i, b := range tt.expected {
if instruction[i] != tt.expected[i] {
t.Errorf("wrong byte at pos %d. want=%d, got=%d",
i, b, instruction[i])
}
}
}
}

func TestInstructionsString(t *testing.T) {
instructions := []Instructions{
Make(OpAdd),
Make(OpConstant, 2),
Make(OpConstant, 65535),
}

expected := `0000 OpAdd
0001 OpConstant 2
0004 OpConstant 65535
`
concatted := Instructions{}

for _, ins := range instructions {
concatted = append(concatted, ins...)
}

if concatted.String() != expected {
t.Errorf("instructions wrongly formatted.\nwant=%q\ngot=%q",
expected, concatted.String())
}
}

func TestReadOperands(t *testing.T) {
tests := []struct {
op Opcode
operands []int
bytesRead int
}{
{OpConstant, []int{65535}, 2},
}
for _, tt := range tests {
instruction := Make(tt.op, tt.operands...)

def, err := Lookup(byte(tt.op))

if err != nil {
t.Fatalf("definition not found: %q\n", err)
}

operandsRead, n := ReadOperands(def, instruction[1:])

if n != tt.bytesRead {
t.Fatalf("n wrong. want=%d, got=%d", tt.bytesRead, n)
}

for i, want := range tt.operands {
if operandsRead[i] != want {
t.Errorf("operand wrong. want=%d, got=%d", want, operandsRead[i])
}
}
}
}
103 changes: 103 additions & 0 deletions src/compiler/compiler.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
// Package compiler provides functionality for compiling AST nodes into bytecode instructions.
//
// It emits the result of the compilation, including the emitted instructions and the constant pool.
package compiler

import (
"chui/ast"
"chui/code"
"chui/object"
"fmt"
)

type Compiler struct {
instructions code.Instructions
constants []object.Object
}

func New() *Compiler {
return &Compiler{
instructions: code.Instructions{},
constants: []object.Object{},
}
}

func (c *Compiler) Compile(node ast.Node) error {
switch node := node.(type) {

case *ast.Program:
for _, s := range node.Statements {
err := c.Compile(s)

if err != nil {
return err
}
}

case *ast.ExpressionStatement:
err := c.Compile(node.Expression)

if err != nil {
return err
}

case *ast.InfixExpression:
err := c.Compile(node.Left)

if err != nil {
return err
}

err = c.Compile(node.Right)

if err != nil {
return err
}

switch node.Operator {
case "+":
c.emit(code.OpAdd)

default:
return fmt.Errorf("unknown operator %s", node.Operator)
}

case *ast.IntegerLiteral:
integer := &object.Integer{Value: node.Value}
c.emit(code.OpConstant, c.addConstant(integer))
}

return nil
}

func (c *Compiler) addConstant(obj object.Object) int {
c.constants = append(c.constants, obj)

return len(c.constants) - 1
}

func (c *Compiler) emit(op code.Opcode, operands ...int) int {
ins := code.Make(op, operands...)
pos := c.addInstruction(ins)

return pos
}

func (c *Compiler) addInstruction(ins []byte) int {
posNewInstruction := len(c.instructions)
c.instructions = append(c.instructions, ins...)

return posNewInstruction
}

func (c *Compiler) Bytecode() *Bytecode {
return &Bytecode{
Instructions: c.instructions,
Constants: c.constants,
}
}

type Bytecode struct {
Instructions code.Instructions
Constants []object.Object
}
Loading