The unreasonable effectiveness of f-strings and re.verbose by genericlemon24

Share This Article

Sed ut perspiciatis unde.

… in which we look at one or two ways to make life easier
when working with Python regular expressions.

tl;dr: You can compose verbose regular expressions using f‍-‍strings.

Here’s a real-world example – instead of this:

pattern = r"((?:(s*)?[A-Z]*Hd+[a-z]*(?:s*+s*[A-Z]*Hd+[a-z]*)*(?:s*[):+])?)(.*?)(?=(?:(s*)?[A-Z]*Hd+[a-z]*(?:s*+s*[A-Z]*Hd+[a-z]*)*(?:s*[):+])?(?![^ws])|$)"

… do this:

code = r"""
[A-Z]*H  # prefix
d+      # digits
[a-z]*   # suffix
"""

multicode = fr"""
(?: ( s* )?               # maybe open paren and maybe space
{code}                      # one code
(?: s* + s* {code} )*    # maybe followed by other codes, plus-separated
(?: s* [):+] )?           # maybe space and maybe close paren or colon or plus
"""

pattern = fr"""
( {multicode} )             # code (capture)
( .*? )                     # message (capture): everything ...
(?=                         # ... up to (but excluding) ...
    {multicode}             # ... the next code
        (?! [^ws] )       # (but not when followed by punctuation)
    | $                     # ... or the end
)
"""

For comparison, the same pattern without f‍-‍strings (click to expand).

pattern = r"""
(                       # code (capture)
    # BEGIN multicode

    (?: ( s* )?       # maybe open paren and maybe space

    # code
    [A-Z]*H  # prefix
    d+      # digits
    [a-z]*   # suffix

    (?:                 # maybe followed by other codes,
        s* + s*      # ... plus-separated

        # code
        [A-Z]*H  # prefix
        d+      # digits
        [a-z]*   # suffix
    )*

    (?: s* [):+] )?   # maybe space and maybe close paren or colon or plus

    # END multicode
)

( .*? )                 # message (capture): everything ...

(?=                     # ... up to (but excluding) ...
    # ... the next code

    # BEGIN multicode

    (?: ( s* )?       # maybe open paren and maybe space

    # code
    [A-Z]*H  # prefix
    d+      # digits
    [a-z]*   # suffix

    (?:                 # maybe followed by other codes,
        s* + s*      # ... plus-separated

        # code
        [A-Z]*H  # prefix
        d+      # digits
        [a-z]*   # suffix
    )*

    (?: s* [):+] )?   # maybe space and maybe close paren or colon or plus

    # END multicode

        # (but not when followed by punctuation)
        (?! [^ws] )

    # ... or the end
    | $
)

"""

It’s better than the non-verbose one,
but even with careful formatting and comments,
the repetition makes it pretty hard to follow
– and wait until you have to change something!

Read on for details and some caveats.

Prerequisites

Formatted string literals (f‍-‍strings) were added in Python 3.6,
and provide a way to embed expressions inside string literals,
using a syntax similar to that of str.format():

>>> name = "world"
>>>
>>> "Hello, {name}!".format(name=name)
'Hello, world!'
>>>
>>> f"Hello, {name}!"
'Hello, world!'

Verbose regular expressions (re.VERBOSE) have been around since forever,
and allow writing regular expressions
with non-significant whitespace and comments:

>>> text = "H1 code (AH2b+EUH3) fancy code"
>>>
>>> code = r"[A-Z]*Hd+[a-z]*"
>>> re.findall(code, text)
['H1', 'AH2b', 'EUH3']
>>>
>>> code = r"""
... [A-Z]*H  # prefix
... d+      # digits
... [a-z]*   # suffix
... """
>>> re.findall(code, text, re.VERBOSE)
['H1', 'AH2b', 'EUH3']

The “one weird trick”

Once you see it, it’s obvious
– you can use f‍-‍strings to compose regular expressions:

>>> multicode = fr"""
... (?: ( )?         # maybe open paren
... {code}            # one code
... (?: + {code} )*  # maybe other codes, plus-separated
... (?: ) )?         # maybe close paren
... """
>>> re.findall(multicode, text, re.VERBOSE)
['H1', '(AH2b+EUH3)']

It’s so obvious, it only took me three years to do it
after I started using Python 3.6+,
despite using both features during all that time.

Of course, there’s any number of libraries
for building regular expressions;
the benefit of this is that it has zero dependencies,
and zero extra things you need to learn.

Caveats

Hashes and spaces need to be escaped

Because a hash is used to mark the start of a comment,
and spaces are mostly ignored,
you have to represent them in some other way.

The documentation of re.VERBOSE is quite helpful:

When a line contains a # that is not in a character class and is not preceded by an unescaped backslash, all characters from the leftmost such # through the end of the line are ignored.

That is, this won’t work as the non-verbose version:

>>> re.findall("d+#d+", "1#23a")
['1#23']
>>> re.findall("d+ # d+", "1#23a", re.VERBOSE)
['1', '23']

… but these will:

>>> re.findall("d+ [#] d+", "1#23a", re.VERBOSE)
['1#23']
>>> re.findall("d+ # d+", "1#23a", re.VERBOSE)
['1#23']

The same is true for spaces:

>>> re.findall("d+ [ ] d+", "1 23a", re.VERBOSE)
['1 23']
>>> re.findall("d+   d+", "1 23a", re.VERBOSE)
['1 23']

When composing regexes,
ending a pattern on the same line as a comment
might accidentally comment the following line in the enclosing pattern:

>>> one = "1 # comment"
>>> onetwo = f"{one} 2"
>>> re.findall(onetwo, '0123', re.VERBOSE)
['1']
>>> print(onetwo)
1 # comment 2

This can be avoided by always ending the pattern on a new line:

>>> one = """
... 1 # comment
... """
>>> onetwo = f"""
... {one} 2
... """
>>> re.findall(onetwo, '0123', re.VERBOSE)
['12']

While a bit cumbersome,
in real life most patterns would span multiple lines anyway,
so it’s not really an issue.

(Note that this is only needed if you use comments.)

Brace quantifiers need to be escaped

Because f‍-‍strings already use braces for replacements,
to represent brace quantifiers you must double the braces:

>>> re.findall("m{2}", "entire mm but only two of mmm")
['mm', 'mm']
>>> letter = "m"
>>> pattern = f"{letter}{{2}}"
>>> re.findall(pattern, "entire mm but only two of mmm")
['mm', 'mm']

I don’t control the flags

Maybe you’d like to use verbose regexes,
but don’t control the flags passed to the re functions
(for example, because you’re passing the regex to an API).

Worry not! The regular ex

The unreasonable effectiveness of f-strings and re.verbose by genericlemon24

The unreasonable effectiveness of f-strings and re.verbose by genericlemon24

Share This Article

Newsletter

Prerequisites

The “one weird trick”

Caveats

Hashes and spaces need to be escaped

Brace quantifiers need to be escaped

I don’t control the flags

HackTech

Leave a comment Cancel reply

Editor's Choice

The unreasonable effectiveness of f-strings and re.verbose by genericlemon24

The unreasonable effectiveness of f-strings and re.verbose by genericlemon24

Share This Article

Newsletter

Prerequisites

The “one weird trick”

Caveats

Hashes and spaces need to be escaped

Brace quantifiers need to be escaped

I don’t control the flags

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter