By the end of this guide we’ll have a minimal, working implementation
of a small part of Lua from scratch. It will be able to run the
following program (among others):
function fib(n)
if n < 2 then
return n;
end
local n1 = fib(n-1);
local n2 = fib(n-2);
return n1 + n2;
end
print(fib(30));
This is my second project in Rust and only the third time I've
invented an instruction set so don't take my style as gospel. However,
I have found some Rust parsing tutorials overly complex so I'm hoping
you'll find this one simpler.
All source code is available on Github.
Entrypoint
Running cargo init
will give the boilerplate necessary. In
src/main.rs
we'll accept a file name from the command line, perform
lexical analysis to retrieve all tokens from the file, perform grammar
analysis on the tokens to retrieve a tree structure, compile the tree
to a linear set of virtual machine instructions, and finally interpret
the virtual machine instructions.
mod eval;
mod lex;
mod parse;
use std::env;
use std::fs;
fn main() {
let args: Vec = env::args().collect();
let contents = fs::read_to_string(&args[1]).expect("Could not read file");
let raw: Vec = contents.chars().collect();
let tokens = match lex::lex(&raw) {
Ok(tokens) => tokens,
Err(msg) => panic!("{}", msg),
};
let ast = match parse::parse(&raw, tokens) {
Ok(ast) => ast,
Err(msg) => panic!("{}", msg),
};
let pgrm = eval::compile(&raw, ast);
eval::eval(pgrm);
}
Easy peasy. Now let's implement lex
.
Lexical analysis
Lexical analysis drops whitespace (Lua is not whitespace
sensitive) and chunks all source code characters into their
smallest possible meaningful pieces like commas, numbers, identifiers,
keywords, etc.
In order to have useful error messages, we'll keep track of state in
the file with a Location
struct that implements increment
and
debug
.
This goes in src/lex.rs
.
#[derive(Copy, Clone, Debug)]
pub struct Location {
col: i32,
line: i32,
index: usize,
}
The increment
function will update line and column numbers as well
as the current index in the file.
impl Location {
fn increment(&self, newline: bool) -> Location {
if newline {
Location {
index: self.index + 1,
col: 0,
line: self.line + 1,
}
} else {
Location {
index: self.index + 1,
col: self.col + 1,
line: self.line,
}
}
}
And the debug
function will dump the current line with a pointer in
text to the current column along with a message.
pub fn debug>(&self, raw: &[char], msg: S) -> String {
let mut line = 0;
let mut line_str = String::new();
// Find the whole line of original source
for c in raw {
if *c == 'n' {
line += 1;
// Done discovering line in question
if !line_str.is_empty() {
break;
}
continue;
}
if self.line == line {
line_str.push_str(&c.to_string());
}
}
let space = " ".repeat(self.col as usize);
format!("{}nn{}n{}^ Near here", msg.into(), line_str, space)
}
}
The smallest individual unit after lexical analysis is a token which
is either a keyword, number, identifier, operator, or syntax. (This
implementation is clearly skipping lots of real Lua syntax like
strings.)
#[derive(Debug, PartialEq, Eq, Clone)]
pub enum TokenKind {
Identifier,
Syntax,
Keyword,
Number,
Operator,
}
#[derive(Debug, Clone)]
pub struct Token {
pub value: String,
pub kind: TokenKind,
pub loc: Location,
}
The top-level lex
function will iterate over the file and call a lex
helper for each kind of token, returning an array of all tokens on
success. In between lexing it will "eat whitespace".
pub fn lex(s: &[char]) -> Result, String> {
let mut loc = Location {
col: 0,
index: 0,
line: 0,
};
let size = s.len();
let mut tokens: Vec = vec![];
let lexers = [
lex_keyword,
lex_identifier,
lex_number,
lex_syntax,
lex_operator,
];
'outer: while loc.index < size {
loc = eat_whitespace(s, loc);
if loc.index == size {
break;
}
for lexer in lexers {
let res = lexer(s, loc);
if let Some((t, next_loc)) = res {
loc = next_loc;
tokens.push(t);
continue 'outer;
}
}
return Err(loc.debug(s, "Unrecognized character while lexing:"));
}
Ok(tokens)
}
Whitespace
Eating whitespace is just incrementing the location while we see a
space, tab, newline, etc.
fn eat_whitespace(raw: &[char], initial_loc: Location) -> Location {
let mut c = raw[initial_loc.index];
let mut next_loc = initial_loc;
while [' ', 'n', 'r', 't'].contains(&c) {
next_loc = next_loc.increment(c == 'n');
if next_loc.index == raw.len() {
break;
}
c = raw[next_loc.index];
}
next_loc
}
Numbers
Lexing numbers iterates through the source starting at a position
until it stops seeing decimal digits (this implementation only
supports integers).
fn lex_number(raw: &[char], initial_loc: Location) -> Option<(Token, Location)> {
let mut ident = String::new();
let mut next_loc = initial_loc;
let mut c = raw[initial_loc.index];
while c.is_digit(10) {
ident.push_str(&c.to_string());
next_loc = next_loc.increment(false);
c = raw[next_loc.index];
}
If there are no digits in the string then this is not a number.
if !ident.is_empty() {
Some((
Token {
value: ident,
loc: initial_loc,
kind: TokenKind::Number,
},
next_loc,
))
} else {
None
}
}
Identifiers
Identifiers are any collection of alphabet characters, numbers, and
underscores.
fn lex_identifier(raw: &Vec, initial_loc: Location) -> Option<(Token, Location)> {
let mut ident = String::new();
let mut next_loc = initial_loc;
let mut c = raw[initial_loc.index];
while c.is_alphanumeric() || c == '_' {
ident.push_str(&c.to_string());
next_loc = next_loc.increment(false);
c = raw[next_loc.index];
}
But they cannot start with a number.
// First character must not be a digit
if ident.len() > 0 && !ident.chars().next().unwrap().is_digit(10) {
Some((
Token {
value: ident,
loc: initial_loc,
kind: TokenKind::Identifier,
},
next_loc,
))
} else {
None
}
}
Keywords
Keywords are alphabetical like identifiers are but they cannot be
reused as variables by the user.
fn lex_keyword(raw: &[char], initial_loc: Location) -> Option<(Token, Location)> {
let syntax = ["function", "end", "if", "then", "local", "return"];
let mut next_loc = initial_loc;
let mut value = String::new();
'outer: for possible_syntax in syntax {
let mut c = raw[initial_loc.index];
next_loc = initial_loc;
while c.is_alphanumeric() || c == '_' {
value.push_str(&c.to_string());
next_loc = next_loc.increment(false);
c = raw[next_loc.index];
let n = next_loc.index - initial_loc.index;
if value != possible_syntax[..n] {
value = String::new();
continue 'outer;
}
}
// Not a complete match
if value.len() < possible_syntax.len() {
value = String::new();
continue;
}
// If it got to this point it found a match, so exit early.
// We don't need a longest match.
break;
}
if value.is_empty() {
return None;
}
Aside from matching a list of strings we have to make sure
there is a complete match. For example function1
is not a keyword,
it's a valid identifier. Whereas function 1
is a valid set of tokens
(the keyword function
and the number 1
), even if it's not a valid
Lua grammar.
// If the next character would be part of a valid identifier, then
// this is not a keyword.
if next_loc.index < raw.len() - 1 {
let next_c = raw[next_loc.index];
if next_c.is_alphanumeric() || next_c == '_' {
return None;
}
}
Some((
Token {
value: value,
loc: initial_loc,
kind: TokenKind::Keyword,
},
next_loc,
))
}
Syntax
Syntax (in this context) is just language junk that isn't
operators. Things like commas, parenthesis, etc.
fn lex_syntax(raw: &[char], initial_loc: Location) -> Option<(Token, Location)> {
let syntax = [";", "=", "(", ")", ","];
for possible_syntax in syntax {
let c = raw[initial_loc.index];
let next_loc = initial_loc.increment(false);
// TODO: this won't work with multiple-character syntax bits like >= or ==
if possible_syntax == c.to_string() {
return Some((
Token {
value: possible_syntax.to_string(),
loc: initial_loc,
kind: TokenKind::Syntax,
},
next_loc,
));
}
}
None
}
Operators
Operators are things like plus, minus, and less than
symbols. Operators are syntax but it helps us later on to break these
out into a seperate type of token.
fn lex_operator(raw: &[char], initial_loc: Location) -> Option<(Token, Location)> {
let operators = ["+", "-", "<"];
for possible_syntax in operators {
let c = raw[initial_loc.index];
let next_loc = initial_loc.increment(false);
// TODO: this won't work with multiple-character operators like >= or ==
if possible_syntax == c.to_string() {
return Some((
Token {
value: possible_syntax.to_string(),
loc: initial_loc,
kind: TokenKind::Operator,
},
next_loc,
));
}
}
None
}
And now we're all done lexing!
Grammar analysis
Parsing finds grammatical (tree) patterns in a flat list of
tokens. This is called a syntax tree or abstract syntax tree (AST).
The boring part is defining the tree. Generally speaking (and
specifically for this project), the syntax tree is a list of
statements. Statements can be function definitions or expression
statements or if statements or return statements or local
declarations.
This goes in src/parse.rs
.
#[derive(Debug)]
pub enum Statement {
Expression(Expression),
If(If),
FunctionDeclaration(FunctionDeclaration),
Return(Return),
Local(Local),
}
pub type Ast = Vec;
There's almost nothing special at all about the rest of the tree
definitions.
#[derive(Debug)]
pub enum Literal {
Identifier(Token),
Number(Token),
}
#[derive(Debug)]
pub struct FunctionCall {
pub name: Token,
pub arguments: Vec,
}
#[derive(Debug)]
pub struct BinaryOperation {
pub operator: Token,
pub left: Box ,
pub right:
(Token,> p>(Token,>