
Regex Isn’t Hard (2023) by asicsp
Regex gets a bad reputation for being very complex. That’s fair, but I also think that if you focus on a certain core
subset of regex, it’s not that hard. Most of the complexity comes from various “shortcuts” that are hard to remember.
If you ignore those, the language itself is fairly small and portable across programming languages.
It’s worth knowing regex because you can get A LOT done in very little code. If I try to replicate what my regex does
using normal procedural code, it’s often very verbose, buggy and significantly slower. It often takes hours or days to
do better than a couple minutes of writing regex.
NOTE: Some languages, like Rust, have parser combinators which can be as good or better than regex in most of the ways I
care about. However, I often opt for regex anyway because it’s less to fit in my brain. There’s a single core subset of
regex that all major programming languages support.
There’s four major concepts you need to know
- Character sets
- Repetition
- Groups
- The
|
,^
and$
operators
Here I’ll highlight a subset of the regex language that’s not hard to understand or remember. Throughout I’ll also tell you what to
ignore. Most of these things are shortcuts that save a little verbosity at the expense of a lot of complexity. I’d rather
verbosity than complexity, so I stick to this subset.
A character set is the smallest unit of text matching available in regex. It’s just one character.
Single characters
a
matches a single character, always lowercase a
. aaa
is 3 consecutive character sets, each matches only a
. Same
with abc
, but the second and third match b
and c
respectively.
Ranges
Match one of a set of characters.
[a]
— same as justa
[abc]
— Matchesa
,b
, orc
.[a-c]
— Same, but using-
to specify a range of characters[a-z]
— any lowercase character[a-zA-Z]
— any lowercase or uppercase character[a-zA-Z0-9!@#$%^&*()-]
— alphanumeric plus any of these symbols:!@#$%^&*()-
Note in that last point how -
comes last. Also note that ^
isn’t the first character in the range, the ^
can become an
operator if it occurs as the first character in a character set or regex.
There’s a parallel to boolean logic here:
ab
means “a
ANDb
”[ab]
meansa
ORb
”
You can build more complex logic using groups and negation.
Negation (^
)
I mention this operator later, but in the context of character sets, it means “everything but these”.
Example:
[^ab]
means “everything buta
orb
[ab^]
means “a
,b
or^
. The^
has to be the first cha