“lex and yacc” are two terms that are used together so often that they can seem
indivisible. lex
takes a string as input and divides it up
into lexemes (roughly speaking “words”) and yacc
takes a sequence
of lexemes and parses them. Since some parsing approaches such as
recursive descent parsing unify these phases, why do lex
and yacc
split them apart?
Running example
Let’s start with a simple language, which allows us to assign simple
expressions to variables:
x = 1; y = x;
Using lrlex
from the Rust grmtools libraries, I can
define a lexer file for this fragment as follows:
%% [a-z]+ "ID" [0-9]+ "INT" = "=" ; ";" [ tnr]+ ;
If there were any declarations (i.e. settings) they would come before the
“%%
” line: there are none in this example. Each line after the
“%%
” defines a lexing rule. Lexing rules consist of a regular
expression and either an identifier or “;”. For example, the rule [0-9]+
says that every non-empty sequence of lower-case digits defines an
"INT"
integer, assigning it the token type “INT
“. If a lexing rule has
“;” instead of an identifier, any text matching the regular expression is
discarded: as is common, in this example the lexer simply discards whitespace
(spaces, tabs, and newlines).
Helpfully, we can run lrlex
on our lex file and input to see
what lexemes it produces:
$ lrlex /tmp/t.l /tmp/t ID x = = INT 1 ; ; ID y = = ID x ; ;
Each line tells us the token type and the lexeme (i.e. the text matched).
We can now define a simple yacc grammar which allows us to write
files with multiple “statements” (i.e. variable definitions or assignments):
%% stmts: stmts stmt | stmt ; stmt: "ID" "=" expr ";" ; expr: "INT" | "ID" ;
If you’re unfamiliar with yacc grammars, the stmts
rules is a long-winded
way of writing the more regex-like “stmts: stmt+
” i.e. that a valid input is one or more
statements.
We can can then use grmtools’ nimbleparse
to parse
our input relative to our lexer and grammar:
$ nimbleparse /tmp/t.l /tmp/t.y /tmp/t stmts stmts stmt ID x = = expr INT 1 ; ; stmt ID y = = expr ID x ; ;
The indentation in nimbleparse
shows us the tree structure: if
you’re not sure what you’re looking at, you might find it helpfully to mentally
rotate the tree 90° clockwise.
Overlapping lexing rules
Let’s now assume that I want to introduce mathematical constants into our
language, specifically the constant “pi”:
z = pi;
Since pi
is a constant, I want to forbid people from
assigning to it. In other words I want this expression to cause an error:
pi = 3;
I could do this in a “type checking” phase after parsing, but there are good
reasons why I might want to make my parser reject such inputs, including: I’m
lazy and don’t want to add a type checking phase; it’s easier and faster for
external tools (e.g. your IDE) to spot incorrect input if it’s encoded
in the parser; and it’s generally easier for users to understand valid inputs
from a grammar than it is from a description of type rules.
Let’s give it a go by changing our lexer to explicitly identify pi:
%% [a-z]+ "ID" [0-9]+ "INT" = "=" ; ";" [ tnr]+ ; pi "PI"
and our grammar to only allow pi in the right hand side of an expression:
%% stmts: stmts stmt | stmt ; stmt: "ID" "=" expr ";" ; expr: "INT" | "ID" | "PI" ;
Unfortunately when I run it on the input I expected to be rejected, it is parsed without error:
$ nimbleparse /tmp/t.l /tmp/t.y /tmp/t stmts stmt ID pi = = expr INT 3 ; ;
Clearly I’ve done something wrong, but what?
From experience, when a yacc grammar doesn’t do what’s expected, a lot of
people immediately start trying to understand the LR parsing algorithm.
I’ve surprised 3 people in the last month by saying, quite truthfully, that I
no longer remember how LR parsing works. I’m sure I could refresh my memory
fairly quickly if I really needed to, but I have not found my forgetfulness to
be a problem because fixing these sorts of problems rarely
requires understanding the parsing