ReadStuffLater uses emojis to tag content †. It’s simple, it’s fun, and it affords basic content organization without encouraging users to spiral into reinvent-Dewey-Decimal territory.

There’s just one problem: data validation. When the client tells my server to tag a record, how can the server confirm the tag is actually an emoji? I mean, I shouldn’t accept and store just anything in that field, right?
This is a much gnarlier problem than it has any right to be. If you want the TL;DR, see what I did and what I wish I’d done, and the a more technical solution!
My first thought was to google around for this, and everyone recommends regex! Everyone! Well that seemed easy.
There is a recent(?) extension to regex that lets you specifically ask, “is this an emoji?”
Except it’s wrong. And also not available everywhere.
const regex = /^p{Emoji}$/gu;
console.log("🙂".match(regex))
console.log("*️⃣".match(regex));
console.log("👨🏾".match(regex));
> Array ["🙂"]
> null
> null
I mean, it produces kinda-okay results if you ask “does this string contain any number of emojis”. But it fails hard when you ask “Is this string made of exactly one emoji, and nothing else?”.
Also, it seems Postgres regex doesn’t support these special character classes, so validation would be strictly at the application layer.
EDIT: Someone showed how to patch the holes in this approach and make it work. Check it out below!
Why does the regex give the wrong answer?
I’m glad you asked! It turns out there isn’t really such a thing as “an emoji”. You have code points, and code point modifiers, and code point combinations.
A great primer on this is Bear Plus Snowflake Equals Polar Bear.
Here’s the dealio: Let’s say we want to display the emoji for a brown man, “👨🏾”. There isn’t a code point for that. Instead we use “👨 ZWJ 🏿”.
ZWJ is “zero-width joiner”. It’s a Unicode byte that gets used in I guess the Indian Devanagari writing system? But it’s also a fundamental building block for emojis.
Its job is “when a mommy code point loves a daddy code point very much, they come together and make a whole new glyph”.
Basically any emoji that includes at least 1 person who isn’t a boring yellow person doing nothing is several characters stapled together with ZWJ. Some other things work this way too.
Some examples include: 👪 (man + woman + boy), 👩✈️ (woman + airplane), and ❤️🔥 (heart + fire).
(And flags are multiple code points that aren’t connected by ZWJ! ††)
(If your computer doesn’t have current or exhaustive emoji fonts (thanks, Linux!), you might see what’s supposed to be a single glyph instead displayed as several emojis side by side, like how my computer shows “Women With Bunny Ears Partying, Type-1-2” as “ 👯 🏻 ♀️”.)
So our regex can’t just check if a string is an emoji: many things we want to identify are several emojis stapled together.
(The way you want to think about your goal here is “graphemes” and “glyphs”, not “characters”.)
Fortunately, when I experimented, it looked like you have to join characters in a specific order, so when you add both skin tone and hair color (“👱🏿♂️”) you can count on it happening in exactly one canonical byte sequence. Otherwise, we’d have to dive into Unicode normalization (a good topic to understand anyway!).
Edit: Someone showed me how to make this work. Check it out below!
Alright, so we can’t just use the special regex “match me some emoji” feature. What about a regex full of Unicode character ranges? StackOverflow sure loves those!
Well, they’re all either too broad or too narrow.
You get stuff like “just capture anything that’s a ‘Unicode other symbol’” (/p{So}+/gu
). This fails for the same reasons as approach #1, and also for the bonus reason that this character class includes symbols that aren’t emojis (‘❤’).
Ah, but some other StackOverflow answer says to just use a regex for Unicode code points! That also fails the same way as approach #1, plus, nobody includes exhaustive code point ranges in their SO answers.
Here’s a partial li