realise were at best buried, and at worst inadequately explained, in that
previous post.
Summarising the argument
In essence I believe Tiark is making two points:
- Making parsing a major component of teaching compilers is misguided.
- Context-free grammars, and their associated parsing techniques, don’t align
well with real-world compilers, and thus we should deemphasise CFGs (Context-Free
Grammars) and their associated parsing algorithms.
I agree with the first point. Learning has to involve prioritisation: we can’t
learn the underlying theory of everything. Parsing seems to me to be
one of those topics where a fairly superficial understanding of the underlying
theory is sufficient for most people to use it effectively. Personally I would
rather spend the time that a deep understanding of the underlying theory of parsing
requires elsewhere: there are many other parts of compilers that better reward
a deeper understanding .
However, I disagree with the second point — or, at least, I disagree
with part of it. It is true that many real-world languages’ syntaxes are not compatible
with LR parsing (and the like) and must be parsed using
recursive descent parsing.
Without wishing to put words in his mouth, I suspect that Tiark sees this as a virtue of
recursive descent parsing.
In contrast, I have come, over time, to see
languages that require recursive descent parsing as reflecting
a flaw in our approach to language design .
I believe it has led us to design languages
that are not just hard to parse, that don’t just tend to have buggy parsers,
but which are harder than necessary to understand.
Differentiating LR and recursive descent parsing
Let’s start with some basic definitions. A grammar is a
specification of the valid sentences in a language. Note that grammars need not
be executable — indeed, they may be specified formally or informally,
explicitly or implicitly. A parser is an implementation of a grammar, which
checks whether an input is valid with respect to that grammar or not. Grammars
are ambiguous (e.g. a grammar such as E: E "+" E | E "-"
is ambiguous because
E2+3*4
can be parsed as equivalent to
either (2+3)*4
or 2+(3*4)
with it) or unambiguous.
For example, all of us share a common understanding of the grammar for basic
arithmetic expressions. We can then implement an LR or a recursive descent
parser which takes arithmetic expressions as inputs and checks whether they
conform to the arithmetic grammar.
In practise, grammars and parsers are often closer in nature than their
abstract definitions suggest. In one direction, we can automatically derive
parsers from some grammars (e.g. those written to conform to the LR subset of
CFGs). In the other direction, a parser implicitly defines a grammar . There is
a subtle corollary to this: if I manually create a parser from the
specification of a grammar, I may introduce bugs, which means that the parser’s
implicit grammar may not quite match the explicit grammar I thought I was
implementing!
There are various ways of implementing parsers, but for me there are
generally only two which we really need to consider: LR parsing and recursive
descent parsing.
LR parsing defines a subset of CFGs