A grammar is left recursive if it comes in the following form
[
langle S rangle ::= langle S rangle a mid b
]
When parsing or generating a string using this grammar (using a backtracking algorithm),
where $ langle S rangle $ is the start symbol, you
are likely to enter a state of infinite recursion. Any ideas on how to fix this?
[
langle S rangle ::= b mid langle S rangle a
]
Rewriting it in above form won’t change a thing, except the fact that it’ll try to read or
generate the terminal $b$ first. How do we fix this then? The root of our problem is
left recursion, so if we can somehow make it right recursive then it’ll fix our issue.
Before reading any further, give it a try yourself.
—
title: example string “aaa”
—
stateDiagram-v2
[*] –> S
S –> b
S –> Sa
Sa –> S
b –> [*]
For given example string $text{aaa}$ the parser will never terminate. Not even to inform
us that the string does not belong to language generated by corresponding grammar.
Language Analysis
Let’s start by generating some sample strings in the given language, and try to discover
a pattern. The following are expansion paths for $ langle S rangle $ :
[
begin{align}
L(S) &= b \
L(S) &= langle S rangle a rightarrow
langle S rangle a a rightarrow
langle S rangle a a … a rightarrow
b a a … a
end{align}
]
So we have two possible expansions $b$ or $b a^*$. The symbol $*$ means as many repetitions
of anything that comes before it. It can be nothing, like an empty string, or it can be
something. So, this lets us know that a left recursive grammar always starts with $b$,
and then you generate or match as many $a$ as possible. The interesting thing about $a^*$
is that it can be generated with right recursion as well, with a rule like :
[
langle N rangle = a langle N rangle mid epsilon
]
stateDiagram-v2
[*] –> N
N –> aN : prepend a
aN –> N : expand N
N –> [*]
Now this can also be generated with a left recursion as well, but we’re doing all this
just to avoid it, so we only care about right recursion. Now combining all the information
we’ve gather up until now, we can
[
begin{align}
langle S rangle &::= b langle S’ rangle \
langle S’ rangle &::= a langle S’ rangle mid a
end{align}
]
This grammar as you can see is completely right recursive.
—
title: example string “aaa”
—
stateDiagram-v2
[*] –> S
S –> bS’
S’ –> aS’
S’ –> a
bS’ –> S’
aS’ –> S’
a –> [*]
This new machine will quickly terminate when it notices that the example string $ text{aaa} $ does not start
with $b$ like in our language.
Real Life Example
So, I’ve been (re)writing a C++ demangler for RizinOrg’s rz-libdemangle
which was previously licensed to GPL, and I’m re-writing it to relicense it to LGPL v3, which is more
permissive for commercial usage. You are allowed to link to precompiled binaries licensed with LGPLv3.
While re-writing grammar for Itanium ABI, I encountered
a left recursive grammar.
[
begin{align}
langle text{prefix} rangle &::= langle text{unqualified-name} rangle \
&quad mid langle text{prefix} rangle langle text{unqualified-name} rangle \
&quad mid langle text{template-prefix} rangle langle text{template-args} rangle \
&quad mid langle text{closure-prefix} rangle \
&quad mid langle text{template-param} rangle \
&quad mid langle text{decltype} rangle \
&quad mid langle text{substitution} rangle
end{align}
]
Notice how production number $6$ is left recursive in nature. Wanna know how I fixed it in code?
You’ll have to follow this post a bit more. Real life examples are bit more complex than theory.
Life happend here as well, and made some things not-so-straightforward.
A mutual left recursion happens when two rules mutually call each other by expanding
first non terminal as other rule. It generally looks like this :
[
begin{align}
langle S rangle &::= langle A rangle mid langle B rangle \
langle A rangle &::= langle B rangle k mid X \
langle B rangle &::= langle A rangle m mid Y
end{align}
]
Notice how production $13$ and $14$ call each other, making it a mutual recursion.
Using a similar analysis as done for simple left recursion, we can remove left recursion
here as well, by making each rule right recursion only.
stateDiagram-v2
[*] –> S
S –> A
S –> B
A –> Bk
Bk –> B
B –> Am
Am –> A
A –> X
B –> Y
X –> [*]
Y –> [*]
You can visually see the mutual recursion in path $ textbf{A} rightarrow text{Bk} rightarrow textbf{B} rightarrow text{Am} rightarrow textbf{A} $,
wherein $A$ ends up calling $B$ which eventually calls $A$.
Language Analysis
If you try to follow the pattern, then you’ll get the following possible languages
[
begin{align}
L(A) &= X (mk)^* \
L(A) &= Y k(mk)^* \
L(B) &= Y (km)^* \
L(B) &= X m(km)^*
end{align}
]
Noticing this pattern again, I can clearly see the right recursion this time again,
so, I’ll re-write the mutually left recursive grammar as follows
[
begin{align}
langle S rangle & ::= & langle S1 rangle mid langle S2 rangle \
langle S1 rangle & ::= & langle A rangle mid langle A rangle langle R1 rangle \
langle S2 rangle & ::= & langle B rangle mid langle B rangle langle R2 rangle \
langle A rangle & ::= & X mid Y k \
langle B rangle & ::= & Y mid X m \
langle R1 rangle & ::= & m k langle R1 rangle mid m k \
langle R2 rangle & ::= & k m langle R2 rangle mid k m
end{align}
]
Again, notice that since the only recursive productions $20$ and $21$ are right
recursive only, the whole grammar is right recursive, and at the same time,
the grammar is no more mutually left recursive.
stateDiagram-v2
[*] –> S
S –> S1
S –> S2
S1 –> A
S1 –> AR1
AR1 –> R1 : expand R1
R1 –> mkR1
mkR1 –> R1
S2 –> B
S2 –> BR2
BR2 –> R2 : expand R2
R2 –> kmR2
kmR2 –> R2
A –> X
A –> Yk
B –> Y
B –> Xm
X –> [*]
Y –> [*]
Yk –> [*]
Xm –> [*]
R1 –> mk
R2 –> km
mk –> [*]
km –> [*]
Real Life Example
So, while (re)writing the demangler, I came across some rules that are mutually
recursive. Take a look at the following grammar, and comment what you see :
[
begin{align}
langle text{prefix} rangle &::= langle text{unqualified-name} rangle \
&quad mid langle text{prefix} rangle langle text{unqualified-name} rangle \
&quad mid langle text{template-prefix} rangle langle text{template-args} rangle \
&quad mid langle text{closure-prefix} rangle \
&quad mid langle text{template-param} rangle \
&quad mid langle text{decltype} rangle \
&quad mid langle text{substitution} rangle
end{align}
]
[
begin{align}
langle text{template-prefix} rangle &::= langle text{template unqualified-name} rangle \
&quad mid langle text{prefix} rangle langle text{template unqualified-name} rangle \
&quad mid langle text{template-param} rangle \
&quad mid langle text{substitution} rangle
end{align}
]
[
begin{align}
langle text{closure-prefix} rangle &::= [ langle text{prefix} rangle ] langle text{variable or member unqualified-name} rangle M \
&quad mid langle text{variable template template-prefix} rangle langle text{template-args} rangle M
end{align}
]
Did you notice it? $langle text{prefix} rangle$ is mutually left recursive with $langle text{template-prefix} rangle$
and $langle text{closure-prefix} rangle$ at the same time, and this is where things get a bit complicated.
The solution is not hard, it’s really simple though. I’ll have to show my original solution though.
For making things easy while writing the demangler, and for easy implementation of grammar rules
and productions, I’ve devised some macros that make it look like I’m using a DSL to write the demangler.
So, to help you understand, I’ll just show you the before and after code, and leave you to diff these out
by yourself to understand. You can see the complete source co
6 Comments
mrkeen
Imo this missed the punchline.
I banged my head on the table for days trying to fix a left-recursive grammar, just by moving bits around and hoping.
The key is precedence! Parse loose expressions first (plus, minus, etc), these loose expressions can in turn call into tighter expressions (mul, div), and so on down to parentheses.
Parentheses do form a loop – that is, they call all the way back to the top – but at long as the parser makes progress i.e. can only proceed if it moves past a '(', no infinite loops and you're golden.
tempodox
An interesting and worthwhile topic. This kind of problem pops up regularly.
Those diagrams are really handy, does anyone know how they were made? I'm always looking for alternatives / companions to the Railroad Diagram Generator (https://rr.red-dove.com/ui).
earleybird
Use an Earley parser. A properly implemented Earley parser will do unambiguous grammars with left/right/mutual recursion in linear time. The worst case ambiguous I've worked with is UBDA (Earley, 1970). UBDA produces the equivalent of all the permutations of a binary tree – which is Catalan(n). If you generate all the combinations as you parse it won't complete before the heat death of the universe for fairly short strings.
briffid
An example of an actual grammar, like arithmetic expressions, would be helpful.
fjfaase
It is possible to write a (top-down) parser that can deal with the left recursion without having to rewrite the grammar itself. Simply split the left recursive ones and not. Then first try to accept a non-left recursive rule and if that succeeds, use the result to try to parse any of the left recursive ones. (If I am not mistaken, this also works for mutual recursion of left recursive rules.) I have implemented this several times.
hyperhello
Left recursive BNF was an interest of mine. For a much less technically oriented solution you can check out https://hypervariety.com/BNFToLPEG/