Writing a minimal Lua implementation with a virtual machine from scratch in Rust by finite_jest

Share This Article

Sed ut perspiciatis unde.

By the end of this guide we’ll have a minimal, working implementation
of a small part of Lua from scratch. It will be able to run the
following program (among others):

function fib(n)
   if n < 2 then
      return n;
   end

   local n1 = fib(n-1);
   local n2 = fib(n-2);
   return n1 + n2;
end

print(fib(30));

This is my second project in Rust and only the third time I've
invented an instruction set so don't take my style as gospel. However,
I have found some Rust parsing tutorials overly complex so I'm hoping
you'll find this one simpler.

All source code is available on Github.

Entrypoint

Running cargo init will give the boilerplate necessary. In
src/main.rs we'll accept a file name from the command line, perform
lexical analysis to retrieve all tokens from the file, perform grammar
analysis on the tokens to retrieve a tree structure, compile the tree
to a linear set of virtual machine instructions, and finally interpret
the virtual machine instructions.

mod eval;
mod lex;
mod parse;

use std::env;
use std::fs;

fn main() {
    let args: Vec = env::args().collect();
    let contents = fs::read_to_string(&args[1]).expect("Could not read file");

    let raw: Vec = contents.chars().collect();

    let tokens = match lex::lex(&raw) {
        Ok(tokens) => tokens,
        Err(msg) => panic!("{}", msg),
    };

    let ast = match parse::parse(&raw, tokens) {
        Ok(ast) => ast,
        Err(msg) => panic!("{}", msg),
    };

    let pgrm = eval::compile(&raw, ast);

    eval::eval(pgrm);
}

Easy peasy. Now let's implement lex.

Lexical analysis

Lexical analysis drops whitespace (Lua is not whitespace
sensitive) and chunks all source code characters into their
smallest possible meaningful pieces like commas, numbers, identifiers,
keywords, etc.

In order to have useful error messages, we'll keep track of state in
the file with a Location struct that implements increment and
debug.

This goes in src/lex.rs.

#[derive(Copy, Clone, Debug)]
pub struct Location {
    col: i32,
    line: i32,
    index: usize,
}

The increment function will update line and column numbers as well
as the current index in the file.

impl Location {
    fn increment(&self, newline: bool) -> Location {
        if newline {
            Location {
                index: self.index + 1,
                col: 0,
                line: self.line + 1,
            }
        } else {
            Location {
                index: self.index + 1,
                col: self.col + 1,
                line: self.line,
            }
        }
    }

And the debug function will dump the current line with a pointer in
text to the current column along with a message.

    pub fn debug>(&self, raw: &[char], msg: S) -> String {
        let mut line = 0;
        let mut line_str = String::new();
        // Find the whole line of original source
        for c in raw {
            if *c == 'n' {
                line += 1;

                // Done discovering line in question
                if !line_str.is_empty() {
                    break;
                }

                continue;
            }

            if self.line == line {
                line_str.push_str(&c.to_string());
            }
        }

        let space = " ".repeat(self.col as usize);
        format!("{}nn{}n{}^ Near here", msg.into(), line_str, space)
    }
}

The smallest individual unit after lexical analysis is a token which
is either a keyword, number, identifier, operator, or syntax. (This
implementation is clearly skipping lots of real Lua syntax like
strings.)

#[derive(Debug, PartialEq, Eq, Clone)]
pub enum TokenKind {
    Identifier,
    Syntax,
    Keyword,
    Number,
    Operator,
}

#[derive(Debug, Clone)]
pub struct Token {
    pub value: String,
    pub kind: TokenKind,
    pub loc: Location,
}

The top-level lex function will iterate over the file and call a lex
helper for each kind of token, returning an array of all tokens on
success. In between lexing it will "eat whitespace".

pub fn lex(s: &[char]) -> Result, String> {
    let mut loc = Location {
        col: 0,
        index: 0,
        line: 0,
    };
    let size = s.len();
    let mut tokens: Vec = vec![];

    let lexers = [
        lex_keyword,
        lex_identifier,
        lex_number,
        lex_syntax,
        lex_operator,
    ];
    'outer: while loc.index < size {
        loc = eat_whitespace(s, loc);
        if loc.index == size {
            break;
        }

        for lexer in lexers {
            let res = lexer(s, loc);
            if let Some((t, next_loc)) = res {
                loc = next_loc;
                tokens.push(t);
                continue 'outer;
            }
        }

        return Err(loc.debug(s, "Unrecognized character while lexing:"));
    }

    Ok(tokens)
}

Whitespace

Eating whitespace is just incrementing the location while we see a
space, tab, newline, etc.

fn eat_whitespace(raw: &[char], initial_loc: Location) -> Location {
    let mut c = raw[initial_loc.index];
    let mut next_loc = initial_loc;
    while [' ', 'n', 'r', 't'].contains(&c) {
        next_loc = next_loc.increment(c == 'n');
        if next_loc.index == raw.len() {
            break;
        }
        c = raw[next_loc.index];
    }

    next_loc
}

Numbers

Lexing numbers iterates through the source starting at a position
until it stops seeing decimal digits (this implementation only
supports integers).

fn lex_number(raw: &[char], initial_loc: Location) -> Option<(Token, Location)> {
    let mut ident = String::new();
    let mut next_loc = initial_loc;
    let mut c = raw[initial_loc.index];
    while c.is_digit(10) {
        ident.push_str(&c.to_string());
        next_loc = next_loc.increment(false);
        c = raw[next_loc.index];
    }

If there are no digits in the string then this is not a number.

    if !ident.is_empty() {
        Some((
            Token {
                value: ident,
                loc: initial_loc,
                kind: TokenKind::Number,
            },
            next_loc,
        ))
    } else {
        None
    }
}

Identifiers

Identifiers are any collection of alphabet characters, numbers, and
underscores.

fn lex_identifier(raw: &Vec, initial_loc: Location) -> Option<(Token, Location)> {
    let mut ident = String::new();
    let mut next_loc = initial_loc;
    let mut c = raw[initial_loc.index];
    while c.is_alphanumeric() || c == '_' {
        ident.push_str(&c.to_string());
        next_loc = next_loc.increment(false);
        c = raw[next_loc.index];
    }

But they cannot start with a number.

    // First character must not be a digit
    if ident.len() > 0 && !ident.chars().next().unwrap().is_digit(10) {
        Some((
            Token {
                value: ident,
                loc: initial_loc,
                kind: TokenKind::Identifier,
            },
            next_loc,
        ))
    } else {
        None
    }
}

Keywords

Keywords are alphabetical like identifiers are but they cannot be
reused as variables by the user.

fn lex_keyword(raw: &[char], initial_loc: Location) -> Option<(Token, Location)> {
    let syntax = ["function", "end", "if", "then", "local", "return"];

    let mut next_loc = initial_loc;
    let mut value = String::new();
    'outer: for possible_syntax in syntax {
        let mut c = raw[initial_loc.index];
        next_loc = initial_loc;
        while c.is_alphanumeric() || c == '_' {
            value.push_str(&c.to_string());
            next_loc = next_loc.increment(false);
            c = raw[next_loc.index];

            let n = next_loc.index - initial_loc.index;
            if value != possible_syntax[..n] {
                value = String::new();
                continue 'outer;
            }
        }

        // Not a complete match
        if value.len() < possible_syntax.len() {
            value = String::new();
            continue;
        }

        // If it got to this point it found a match, so exit early.
        // We don't need a longest match.
        break;
    }

    if value.is_empty() {
        return None;
    }

Aside from matching a list of strings we have to make sure
there is a complete match. For example function1 is not a keyword,
it's a valid identifier. Whereas function 1 is a valid set of tokens
(the keyword function and the number 1), even if it's not a valid
Lua grammar.

    // If the next character would be part of a valid identifier, then
    // this is not a keyword.
    if next_loc.index < raw.len() - 1 {
        let next_c = raw[next_loc.index];
        if next_c.is_alphanumeric() || next_c == '_' {
            return None;
        }
    }

    Some((
        Token {
            value: value,
            loc: initial_loc,
            kind: TokenKind::Keyword,
        },
        next_loc,
    ))
}

Syntax

Syntax (in this context) is just language junk that isn't
operators. Things like commas, parenthesis, etc.

fn lex_syntax(raw: &[char], initial_loc: Location) -> Option<(Token, Location)> {
    let syntax = [";", "=", "(", ")", ","];

    for possible_syntax in syntax {
        let c = raw[initial_loc.index];
        let next_loc = initial_loc.increment(false);
        // TODO: this won't work with multiple-character syntax bits like >= or ==
        if possible_syntax == c.to_string() {
            return Some((
                Token {
                    value: possible_syntax.to_string(),
                    loc: initial_loc,
                    kind: TokenKind::Syntax,
                },
                next_loc,
            ));
        }
    }

    None
}

Operators

Operators are things like plus, minus, and less than
symbols. Operators are syntax but it helps us later on to break these
out into a seperate type of token.

fn lex_operator(raw: &[char], initial_loc: Location) -> Option<(Token, Location)> {
    let operators = ["+", "-", "<"];

    for possible_syntax in operators {
        let c = raw[initial_loc.index];
        let next_loc = initial_loc.increment(false);
        // TODO: this won't work with multiple-character operators like >= or ==
        if possible_syntax == c.to_string() {
            return Some((
                Token {
                    value: possible_syntax.to_string(),
                    loc: initial_loc,
                    kind: TokenKind::Operator,
                },
                next_loc,
            ));
        }
    }

    None
}

And now we're all done lexing!

Grammar analysis

Parsing finds grammatical (tree) patterns in a flat list of
tokens. This is called a syntax tree or abstract syntax tree (AST).

The boring part is defining the tree. Generally speaking (and
specifically for this project), the syntax tree is a list of
statements. Statements can be function definitions or expression
statements or if statements or return statements or local
declarations.

This goes in src/parse.rs.

#[derive(Debug)]
pub enum Statement {
    Expression(Expression),
    If(If),
    FunctionDeclaration(FunctionDeclaration),
    Return(Return),
    Local(Local),
}

pub type Ast = Vec;

There's almost nothing special at all about the rest of the tree
definitions.

#[derive(Debug)]
pub enum Literal {
    Identifier(Token),
    Number(Token),
}

#[derive(Debug)]
pub struct FunctionCall {
    pub name: Token,
    pub arguments: Vec,
}

#[derive(Debug)]
pub struct BinaryOperation {
    pub operator: Token,
    pub left: Box,
    pub right:

Writing a minimal Lua implementation with a virtual machine from scratch in Rust by finite_jest

Writing a minimal Lua implementation with a virtual machine from scratch in Rust by finite_jest

Share This Article

Newsletter

Entrypoint

Lexical analysis

Whitespace

Numbers

Identifiers

Keywords

Syntax

Operators

Grammar analysis

HackTech

Leave a comment Cancel reply

Editor's Choice

Writing a minimal Lua implementation with a virtual machine from scratch in Rust by finite_jest

Writing a minimal Lua implementation with a virtual machine from scratch in Rust by finite_jest

Share This Article

Newsletter

Entrypoint

Lexical analysis

Whitespace

Numbers

Identifiers

Keywords

Syntax

Operators

Grammar analysis

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter