… in which we look at one or two ways to make life easier
when working with Python regular expressions.
tl;dr: You can compose verbose regular expressions using f-strings.
Here’s a real-world example – instead of this:
|
… do this:
|
|
For comparison, the same pattern without f-strings (click to expand).
|
|
It’s better than the non-verbose one,
but even with careful formatting and comments,
the repetition makes it pretty hard to follow
– and wait until you have to change something!
Read on for details and some caveats.
Prerequisites
Formatted string literals (f-strings) were added in Python 3.6,
and provide a way to embed expressions inside string literals,
using a syntax similar to that of str.format():
>>> name = "world"
>>>
>>> "Hello, {name}!".format(name=name)
'Hello, world!'
>>>
>>> f"Hello, {name}!"
'Hello, world!'
Verbose regular expressions (re.VERBOSE) have been around since forever,
and allow writing regular expressions
with non-significant whitespace and comments:
>>> text = "H1 code (AH2b+EUH3) fancy code"
>>>
>>> code = r"[A-Z]*Hd+[a-z]*"
>>> re.findall(code, text)
['H1', 'AH2b', 'EUH3']
>>>
>>> code = r"""
... [A-Z]*H # prefix
... d+ # digits
... [a-z]* # suffix
... """
>>> re.findall(code, text, re.VERBOSE)
['H1', 'AH2b', 'EUH3']
The “one weird trick”
Once you see it, it’s obvious
– you can use f-strings to compose regular expressions:
>>> multicode = fr"""
... (?: ( )? # maybe open paren
... {code} # one code
... (?: + {code} )* # maybe other codes, plus-separated
... (?: ) )? # maybe close paren
... """
>>> re.findall(multicode, text, re.VERBOSE)
['H1', '(AH2b+EUH3)']
It’s so obvious, it only took me three years to do it
after I started using Python 3.6+,
despite using both features during all that time.
Of course, there’s any number of libraries
for building regular expressions;
the benefit of this is that it has zero dependencies,
and zero extra things you need to learn.
Caveats
Hashes and spaces need to be escaped
Because a hash is used to mark the start of a comment,
and spaces are mostly ignored,
you have to represent them in some other way.
The documentation of re.VERBOSE is quite helpful:
When a line contains a
#
that is not in a character class and is not preceded by an unescaped backslash, all characters from the leftmost such#
through the end of the line are ignored.
That is, this won’t work as the non-verbose version:
>>> re.findall("d+#d+", "1#23a")
['1#23']
>>> re.findall("d+ # d+", "1#23a", re.VERBOSE)
['1', '23']
… but these will:
>>> re.findall("d+ [#] d+", "1#23a", re.VERBOSE)
['1#23']
>>> re.findall("d+ # d+", "1#23a", re.VERBOSE)
['1#23']
The same is true for spaces:
>>> re.findall("d+ [ ] d+", "1 23a", re.VERBOSE)
['1 23']
>>> re.findall("d+ d+", "1 23a", re.VERBOSE)
['1 23']
When composing regexes,
ending a pattern on the same line as a comment
might accidentally comment the following line in the enclosing pattern:
>>> one = "1 # comment"
>>> onetwo = f"{one} 2"
>>> re.findall(onetwo, '0123', re.VERBOSE)
['1']
>>> print(onetwo)
1 # comment 2
This can be avoided by always ending the pattern on a new line:
>>> one = """
... 1 # comment
... """
>>> onetwo = f"""
... {one} 2
... """
>>> re.findall(onetwo, '0123', re.VERBOSE)
['12']
While a bit cumbersome,
in real life most patterns would span multiple lines anyway,
so it’s not really an issue.
(Note that this is only needed if you use comments.)
Brace quantifiers need to be escaped
Because f-strings already use braces for replacements,
to represent brace quantifiers you must double the braces:
>>> re.findall("m{2}", "entire mm but only two of mmm")
['mm', 'mm']
>>> letter = "m"
>>> pattern = f"{letter}{{2}}"
>>> re.findall(pattern, "entire mm but only two of mmm")
['mm', 'mm']
I don’t control the flags
Maybe you’d like to use verbose regexes,
but don’t control the flags passed to the re functions
(for example, because you’re passing the regex to an API).
Worry not! The regular ex