=encoding utf-8
Apocalypse 5: Pattern Matching
Larry Wall <[email protected]>
Maintainer: Larry Wall <[email protected]> Date: 4 Jun 2002 Last Modified: 18 May 2006 Number: 5 Version: 7
This is the Apocalypse on Pattern Matching, generally having to do with
what we call “regular expressions”, which are only marginally related
to real regular expressions. Nevertheless, the term has grown with the
capabilities of our pattern matching engines, so I’m not going to try
to fight linguistic necessity here. I will, however, generally call
them “regexes” (or “regexen”, when I’m in an Anglo-Saxon mood).
Here are the RFCs covered in this Apocalypse. PSA stands for “problem,
solution, acceptance”, my private rating of how this RFC will fit into
Perl 6. Doubtless I have misclassified your RFC, though the other
ratings are pretty accurate. :-)
RFC PSA Title --- --- ----- 072 aaa Variable-length lookbehind. 093 abb Regex: Support for incremental pattern matching 110 bbb counting matches 112 acc Assignment within a regex 135 acr Require explicit m on matches, even with ?? and // as delimiters. 144 aaa Behavior of empty regex should be simple 145 acr Brace-matching for Perl Regular Expressions 150 acc Extend regex syntax to provide for return of a hash of matched subpatterns 156 aaa Replace first match function (C...?>) with a flag to the match command. 164 ccr Replace =~, !~, m//, s///, and tr// with match(), subst(), and trade() 165 acc Allow Variables in tr/// 166 abc Alternative lists and quoting of things 191 bbc smart container slicing 197 cdr Numeric Value Ranges In Regular Expressions 198 adr Boolean Regexes 261 dbr Pattern matching on perl values 274 acc Generalised Additions to Regexs 276 aaa Localising Paren Counts in qr()s. 308 dar Ban Perl hooks into regexes 316 bcr Regex modifier for support of chunk processing and prefix matching 317 aaa Access to optimisation information for regular expressions 331 acc Consolidate the $1 and 1 notations 332 abc Regex: Make /$/ equivalent to /z/ under the '/s' modifier 348 bcc Regex assertions in plain Perl code 360 acb Allow multiply matched groups in regexes to return a listref of all matches 361 abb Simplifying split()
Interestingly, there were no withdrawn RFCs for pattern matching. That
means either that there were no cork-brained ideas proposed, or that
regex culture is so cork-brained already that the cork-brained ideas
blend right in. I know where my money is… :-)
In fact, regular expression culture is a mess, and I share some of the
blame for making it that way. Since my mother always told me to clean
up my own messes, I suppose I’ll have to do just that.
For prior Apocalypses, I’ve used the RFCs as a springboard for
discussion of my thinking, but this one is special, because none of the
RFCs were courageous enough (or foolhardy enough) to look at the big
picture and propose radical change where it’s needed. But Perl has
often been tagged as a language in which it’s easy to write programs
that are difficult to read, and it’s no secret that regular expression
syntax that has been the chief culprit. Funny that other languages have
been borrowing Perl’s regular expressions as fast as they can…
That’s primarily because we took several large steps in Perl 5 to
enhance regex capabilities. We took one large step forwards with the
/x
option, which allowed whitespace between regex tokens. But we
also took several large steps sideways with the (?...)
extension
syntax. I call them steps sideways, but they were simultaneously steps
forward in terms of functionality and steps backwards in terms of
readability. At the time, I rationalized it all in the name of backward
compatibility, and perhaps that approach was correct for that time and
place. It’s not correct now, since the Perl 6 approach is to break
everything that needs breaking all at once.
And unfortunately, there’s a lot of regex culture that needs breaking.
Regex culture has gone wrong in a variety of ways, but it’s not my
intent to assign blame–there’s plenty of blame to go around, and
plenty of things that have gone wrong that are nobody’s fault in
particular. For example, it’s nobody’s fault that you can’t
realistically complement a character set anymore. It’s just an accident
of the way Unicode defines combining characters. The whole notion of
character classes is mutating, and that will have some bearing on the
future of regular expression syntax.
Given all this, I need to warn you that this Apocalypse is going to be
somewhat radical. We’ll be proposing changes to certain “sacred”
features of regex culture, and this is guaranteed to result in future
shock for some of our more conservative citizens. Do not be alarmed. We
will provide ways for you to continue programming in old-fashioned
regular expressions if you desire. But I hope that once you’ve thought
about it a little and worked through some examples, you’ll like most of
the changes we’re proposing here.
So although the RFCs did contribute greatly to my thinking for this
Apocalypse, I’m going to present my own vision first for where regex
culture should go, and then analyze the RFCs with respect to that
vision.
First, let me enumerate some of the things that are wrong with current
regex culture.
- Too much history
- Too compact and “cute”
- Poor Huffman coding
- Too much reliance on too few metacharacters
- Different things look too similar
- Poor end-weight design
- Too much reliance on modifiers
- Too many special rules and boobytraps
- Backreferences not useful enough
- Too hard to match a literal string
- Two-level interpretation is problematic
- Too little abstraction
- Little support for named captures
- Difficult to use nested patterns
- Little support for grammars
- Inability to define variants
- Poor integration with “real” language
- Missing backtracking controls
- Difficult to define assertions
I’m sure there are other problems, but that’ll do for starters. Let’s
look at each of these in more detail.
Too much history
Most of the other problems stem from trying to deal with a rich
history. Now there’s nothing wrong with history per se, but those of us
who are doomed to repeat it find that many parts of history are
suboptimal and contradictory. Perl has always tried to err on the side
of incorporating as much history as possible, and sometimes Perl has
succeeded in that endeavor.
Cultural continuity has much to be said for it, but what can you do
when the culture you’re trying to be continuous with is itself
discontinuous? As it says in Ecclesiastes, there’s a time to build up,
and a time to tear down. The first five versions of Perl mostly built
up without tearing down, so now we’re trying to redress that omission.
Too compact and “cute”
Regular expressions were invented by computational linguists who love
to write examples like /aa*b*(cd)*ee/
. While these are conducive to
reasoning about pattern matching in the abstract, they aren’t so good
for pattern matching in the concrete. In real life, most atoms are
longer than “a
” or “b
“. In real life, tokens are more
recognizable if they are separated by whitespace. In the abstract,
/a+/
is reducible to /aa*/
. In real life, nobody wants to repeat
a 15 character token merely to satisfy somebody’s idea of theoretical
purity. So we have shortcuts like the +
quantifier to say “one or
more”.
Now, you may rightly point out that +
is something we already have,
and we already introduced /x
to allow whitespace, so why is this
bullet point here? Well, there’s a lot of inertia in culture, and the
problem with /x
is that it’s not the default, so people don’t think
to turn it on when it would probably do a lot of good. The culture is
biased in the wrong direction. Whitespace around tokens should be the
norm, not the exception. It should be acceptable to use whitespace to
separate tokens that could be confused. It should not be considered
acceptable to define new constructs that contain a plethora of
punctuation, but we’ve become accustomed to constructs like
(?<=...)
and (??{...})
and [rnckp{Zl}p{Zp}]
, so we
don't complain. We're frogs who are getting boiled in a pot full of
single-character morphemes, and we don't notice.
Poor Huffman coding
Huffman invented a method of data compaction in which common characters
are represented by a small number of bits, and rarer characters are
represented by more bits. The principle is more general, however, and
language designers would do well to pay attention to the "other" Perl
slogan: Easy things should be easy, and hard things should be possible.
However, we haven't always taken our own advice. Consider those two
regex constructs we just saw:
(?<=...) (??{...})
Which one do you think is likely to be the most common in everyday use?
Guess which one is longer...
There are many examples of poor Huffman coding in current regexes.
Consider these:
(...) (?:...)
Is it really the case that grouping is rarer than capturing? And by two
gobbledygooky character's worth? Likewise there are many constructs
that are the same length that shouldn't be:
(?:...) (?#...)
Grouping is much more important than the ability to embed a comment.
Yet they're the same length currently.
Too much reliance on too few metacharacters
A lot of our Huffman troubles came about because we were trying to
shoehorn new capabilities into an old syntax without breaking anything.
The (?...)
construct succeeded at that goal, but it was new wine in
old wineskins, as they say. More successful was the *?
minimal
matching hack, but it's still symptomatic of the problem that we only
had three characters to choose from that would have worked at that
point in the grammar. We've pretty nearly exhausted the available
backslash sequences.
The waterbed theory of linguistic complexity says that if you push down
one place, it goes up somewhere else. If you arbitrarily limit yourself
to too few metacharacters, the complexity comes out somewhere else. So
it seems obvious to me that the way out of this mess is to grab a few
more metacharacters. And the metacharacters I want to grab are...well,
we'll see in a moment.
Different things look too similar
Consider these constructs:
(??{...}) (?{...}) (?#...) (?:...) (?i:...) (?=...) (?!...) (?<=...) (?...) (?(...)...|...)
These all look quite similar, but some of them do radically different
things. In particular, the (?<...)
does not mean the opposite of
the (?>...)
. The underlying visual problem is the overuse of
parentheses, as in Lisp. Programs are more readable if different things
look different.
Poor end-weight design
In linguistics, the notion of end-weight is the idea that people tend
to prefer sentences where the short things come first and the long
things come last. That minimizes the amount of stuff you have to
remember while you're reading or listening. Perl violates this with
regex modifiers. It's okay when you say something short like this:
s/foo/bar/g
But when you say something like we find in RFC 360:
while ($text =~ /name:s*(.*?)ns* children:s*(?:(?@S+)[, ]*)*ns* favorite colors:s*(?:(?@S+)[, ]*)*n/sigx) {...}
it's not until you read the /sigx
at the end that you know how to
read the regex. This actually causes problems for the Perl 5 parser,
which has to defer parsing the regular expression till it sees the
/x
, because that changes how whitespace and comments work.
Too much reliance on modifiers
The /s
modifier in the previous example changes the meaning of the
.
metacharacter. We could, in fact, do away with the /s
modifier
entirely if we only had two different representations for "any
character", one of which matched a newline, and one which didn't. A
similar argument applies to the /m
modifier. The whole notion of
something outside the regex changing the meaning of the regex is just a
bit bogus, not because we're afraid of context sensitivity, but because
we need to have better control within the regex of what we mean, and in
this case the context supplied outside the regex is not precise enough.
(Perl 5 has a way to control the inner contexts, but it uses the
self-obfuscating (?...)
notation.)
Modifiers that control how the regex is used as a whole do make some
sense outside the regex. But they still have the end-weight problem.
Too many special rules and boobytraps
Without knowing the context, you cannot know what the pattern //
will do. It might match a null string, or it might match the previously
successful match.
The local
operator behaves differently inside regular expressions
than it does outside.
It's too easy to write a null pattern accidentally. For instance, the
following will never match anything but the null string:
/ | foo | bar | baz /x
Even when it's intentional, it may not look intentional:
(a|b|c|)
That's hard to read because it's difficult to make the absence of
something visible.
It's too easy to confuse the multiple meanings of dot. Or the multiple
meanings of ^
, and $
. And the opposite of A
is frequently not
Z
, but z
. Tell me again, when do I say 1
, and when do I say
$1
? Why are they different?
Backreferences not useful enough
Speaking of 1
, backreferences have a number of shortcomings. The
first is actually getting ahold of the right backreference. Since
captures are numbered from the beginning, you have to count, and you
can easily count wrong. For many purposes it would be better if you
could ask for the last capture, or the one before that. Or perhaps if
there were a way to restart the numbering part way through...
Another major problem with backreferences is that you can't easily
modify one to search for a variant. Suppose you match an opening
parenthesis, bracket, or curly. You'll like to search for everything up
to the corresponding closing parenthesis, bracket, or curly, but
there's no way to transmogrify the opening version to the closing
version, because the backref search is hardwired independently of
ordinary variable matching. And that's because Perl doesn't instantiate
$1
soon enough. And that's because Perl relies on variable
interpolation to get subexpressions into regexes. Which leads us to...
Too hard to match a literal string
Since regexes undergo an interpolation pass before they're compiled,
anything you interpolate is forced to be treated as a regular
expression. Often that's not what you want, so we have the klunky
Q$stringE
mechanism to hide regex metacharacters. And that's
because...
Two-level interpretation is problematic
The problem with Q$stringE
arises because of the fundamental
mistake of using interpolation to build regexes instead of letting the
regex control how it treats the variables it references. Regexes aren't
strings, they're programs. Or, rather, they're strings only in the
sense that any piece of program is a string. Just as you have to work
to eval a string as a program, you should have to work to eval a string
as a regular expression. Most people tend to expect a variable in a
regular expression to match its contents literally. Perl violates that
expectation. And because it violates that expectation, we can't make
$1
synonymous with 1
. And interpolated parentheses throw off the
capture count, so you can't easily use interpolation to call subrules,
so we invented (??{$var})
to get around that. But then you can't
actually get at the parentheses captured by the subrule. The
ramifications go on and on.
Too little abstraction
Historically, regular expressions were considered a very low-level
language, a kind of glorified assembly language for the regex engine.
When you're only dealing with ASCII, there is little need for
abstraction, since the shortest way to say [a-z]
is just that. With
the advent of the eighth bit, we started getting into a little bit of
trouble, and POSIX started thinking about names like [:alpha:]
to
deal with locale difficulties. But as with the problem of conciseness,
the culture was still biased away from naming abstractly anything that
could be expressed concretely.
However, it's almost impossible to write a parser without naming
things, because you have to be able to name the separate grammar rules
so that the various rules can refer to each other.
It's difficult to deal with any subset of Unicode without naming it.
These days, if you see [a-z]
in a program, it's probably an outright
bug. It's much better to use a named character property so that your
program will work right in areas that don't just use ASCII.
Even where we do allow names, it tends to be awkward because of the
cultural bias against it. To call a subrule by name in Perl 5 you have
to say this:
(??{$rule})
That has 4 or 5 more characters than it ought to. Dearth of abstraction
produces bad Huffman coding.
Little support for named captures
Make that "no support" in Perl, unless you include assignment to a
list. This is just a part of the bias against naming things. Instead we
are forced to number our capturing parens and count. That works okay
for the top-level regular expression, when we can do list assignment or
assign $1
to $foo
. But it breaks down as soon as you start trying
to use nested regexes. It also breaks down when the capturing
parentheses match more than once. Perl handles this currently by
returning only the last match. This is slightly better than useless,
but not by much.
Difficult to use nested patterns
For many of the reasons we've mentioned, it's difficult to make regexes
refer to each other, and even if you do, it's almost impossible to get
the nested information back out of them. And there are entire classes
of parsing problems that are not solvable without recursive
definitions.
Little support for grammars
Even if it were easier for regexes to refer to other regexes, we'd
still have the problem that those other regexes aren't organized in any
meaningful way. They might be off in variables that come and go at the
whim of the surrounding context.
When we have an organized system of parsing rules, we call it a
grammar. One advantage of having a grammar is that you can optimize
based on the assumption that the rules maintain their relationship to
each other. For instance, if you think of grammar rules as a funny kind
of subroutine, you can write an optimizer to inline some of the
subrules--but only if you know the subrule is fixed in the grammar.
Without support for grammar classes, there's no decent way to think of
deriving one grammar from another. And if you can't derive one grammar
from another, you can't easily evolve your language to handle new kinds
of problems.
Inability to define variants
If we want to have variant grammars for Perl dialects, then what about
regex dialects? Can regexes be extended either at compile time or at
run time? Perl 5 has some rudimentary overloading magic for rewriting
regex strings, but that's got the same problems as source filters for
Perl code; namely that you just get the raw regex source text and have
to parse it yourself. Once again the fundamental assumption is that a
regex is a funny kind of string, existing only at the behest of the
surrounding program.
Do we think of regexes as a real, living language?
Poor integration with rich languages
Let's face it, in the culture of computing, regex languages are mostly
considered second-class citizens, or worse. "Real" languages like C and
C++ will exploit regexes, but only through a strict policy of
apartheid. Regular expressions are our servants or slaves; we tell them
what to do, they go and do it, and then they come back to say whether
they succeeded or not.
At the other extreme, we have languages like Prolog or Snobol where the
pattern matching is built into the very control structure of the
language. These languages don't succeed in the long run because
thinking about that kind of control structure is rather difficult in
actual fact, and one gets tired of doing it constantly. The path to
freedom is not to make everyone a slave.
However, I would like to think that there is some happy medium between
those two extremes. Coming from a C background, Perl has historically
treated regexes as servants. True, Perl has treated them as trusted
servants, letting them move about in Perl society better than any other
C-like language to date. Nevertheless, if we emancipate regexes to
serve as co-equal control structures, and if we can rid ourselves of
the regexist attitudes that many of us secretly harbor, we'll have a
much more productive society than we currently do. We need to empower
regexes with a sense of control (structure). It needs to be just as
easy for a regex to call Perl code as it is for Perl code to call a
regex.
Missing backtracking controls
Perl 5 started to give regexes more control of their own destiny with
the "grab" construct, (?>...)
, which tells the regex engine that
when it fails to match the rest of the pattern, it should not backtrack
into the innards of the grab, but skip back to before it. That's a
useful notion, but there are problems. First, the notation sucks, but
you knew that already. Second, it doesn't go far enough. There's no way
to backtrack out of just the current grouping. There's no way to
backtrack out of just the current rule. Both of these are crucial for
giving first-class status to the control flow of regexes.
Difficult to define assertions
Notionally, a regex is an organization of assertions that either
succeed or fail. Some assertions are easily expressed in traditional
regex language, while others are more easily expressed in a procedural
language like Perl.
The natural (but wrong) solution is to try to reinvent Perl expressions
within regex language. So, for instance, I'm rejecting those RFCs that
propose special assertion syntax for numerics or booleans. The better
solution is to make it easier to embed Perl assertions within regexes.
I've just made a ton of negative assertions about the current state of
regex culture. Now I'd like you to perform a cool mental trick and turn
all those negatives assertions into positive assertions about what I'm
going to say, because I'm not intending to give the rationale again,
but just present the design as it stands. Damian will discuss an
extended example in his Exegesis 5, which will show the big picture of
how these various features work together to produce a much more
readable whole.
So anyway, here's what's new.
Metacharacter Reform
Some things stay the same: (...)
captures text just as it did
before, and the quantifiers *
, +
, and ?
are also unchanged.
The vertical bar |
still separates alternatives. The backslash
still protects the following character from its ordinary
interpretation. The ?
suffix character still does mini