Why Split Lexing and Parsing into Two Separate Phases? by ltratt

Share This Article

Sed ut perspiciatis unde.

“lex and yacc” are two terms that are used together so often that they can seem
indivisible. lex takes a string as input and divides it up
into lexemes (roughly speaking “words”) and yacc takes a sequence
of lexemes and parses them. Since some parsing approaches such as
recursive descent parsing unify these phases, why do lex and yacc
split them apart?

Running example

Let’s start with a simple language, which allows us to assign simple
expressions to variables:

x = 1;
y = x;

Using lrlex from the Rust grmtools libraries, I can
define a lexer file for this fragment as follows:

%%
[a-z]+ "ID"
[0-9]+ "INT"
= "="
; ";"
[ tnr]+ ;

If there were any declarations (i.e. settings) they would come before the
“%%” line: there are none in this example. Each line after the
“%%” defines a lexing rule. Lexing rules consist of a regular
expression and either an identifier or “;”. For example, the rule [0-9]+ "INT" says that every non-empty sequence of lower-case digits defines an
integer, assigning it the token type “INT“. If a lexing rule has
“;” instead of an identifier, any text matching the regular expression is
discarded: as is common, in this example the lexer simply discards whitespace
(spaces, tabs, and newlines).

Helpfully, we can run lrlex on our lex file and input to see
what lexemes it produces:

$ lrlex /tmp/t.l /tmp/t
ID x
= =
INT 1
; ;
ID y
= =
ID x
; ;

Each line tells us the token type and the lexeme (i.e. the text matched).

We can now define a simple yacc grammar which allows us to write
files with multiple “statements” (i.e. variable definitions or assignments):

%%
stmts: stmts stmt | stmt ;
stmt: "ID" "=" expr ";" ;
expr: "INT" | "ID" ;

If you’re unfamiliar with yacc grammars, the stmts rules is a long-winded
way of writing the more regex-like “stmts: stmt+” i.e. that a valid input is one or more
statements.

We can can then use grmtools’ nimbleparse to parse
our input relative to our lexer and grammar:

$ nimbleparse /tmp/t.l /tmp/t.y /tmp/t
stmts
 stmts
  stmt
   ID x
   = =
   expr
    INT 1
   ; ;
 stmt
  ID y
  = =
  expr
   ID x
  ; ;

The indentation in nimbleparse shows us the tree structure: if
you’re not sure what you’re looking at, you might find it helpfully to mentally
rotate the tree 90° clockwise.

Overlapping lexing rules

Let’s now assume that I want to introduce mathematical constants into our
language, specifically the constant “pi”:

z = pi;

Since pi is a constant, I want to forbid people from
assigning to it. In other words I want this expression to cause an error:

pi = 3;

I could do this in a “type checking” phase after parsing, but there are good
reasons why I might want to make my parser reject such inputs, including: I’m
lazy and don’t want to add a type checking phase; it’s easier and faster for
external tools (e.g. your IDE) to spot incorrect input if it’s encoded
in the parser; and it’s generally easier for users to understand valid inputs
from a grammar than it is from a description of type rules.

Let’s give it a go by changing our lexer to explicitly identify pi:

%%
[a-z]+ "ID"
[0-9]+ "INT"
= "="
; ";"
[ tnr]+ ;
pi "PI"

and our grammar to only allow pi in the right hand side of an expression:

%%
stmts: stmts stmt | stmt ;
stmt: "ID" "=" expr ";" ;
expr: "INT" | "ID" | "PI" ;

Unfortunately when I run it on the input I expected to be rejected, it is parsed without error:

$ nimbleparse /tmp/t.l /tmp/t.y /tmp/t
stmts
 stmt
  ID pi
  = =
  expr
   INT 3
  ; ;

Clearly I’ve done something wrong, but what?

From experience, when a yacc grammar doesn’t do what’s expected, a lot of
people immediately start trying to understand the LR parsing algorithm.
I’ve surprised 3 people in the last month by saying, quite truthfully, that I
no longer remember how LR parsing works. I’m sure I could refresh my memory
fairly quickly if I really needed to, but I have not found my forgetfulness to
be a problem because fixing these sorts of problems rarely
requires understanding the parsing

Why Split Lexing and Parsing into Two Separate Phases? by ltratt

Why Split Lexing and Parsing into Two Separate Phases? by ltratt

Share This Article

Newsletter

Running example

Overlapping lexing rules

HackTech

Leave a comment Cancel reply

Editor's Choice

Why Split Lexing and Parsing into Two Separate Phases? by ltratt

Why Split Lexing and Parsing into Two Separate Phases? by ltratt

Share This Article

Newsletter

Running example

Overlapping lexing rules

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter