Smuggling arbitrary data through an emoji by paulgb

ByHackTech February 12, 2025

24Comments

Share This Article

Sed ut perspiciatis unde.

Send to HN

This Hacker News comment by GuB-42 intrigued me:

With ZWJ (Zero Width Joiner) sequences you could in theory encode an unlimited amount of data in a single emoji.

Is it really possible to encode arbitrary data in a single emoji?

tl;dr: yes, although I found an approach without ZWJ. In fact, you can encode data in any unicode character. This sentence has a hidden message󠅟󠅘󠄐󠅝󠅩󠄜󠄐󠅩󠅟󠅥󠄐󠅖󠅟󠅥󠅞󠅔󠄐󠅤󠅘󠅕󠄐󠅘󠅙󠅔󠅔󠅕󠅞󠄐󠅝󠅕󠅣󠅣󠅑󠅗󠅕󠄐󠅙󠅞󠄐󠅤󠅘󠅕󠄐󠅤󠅕󠅨󠅤󠄑. (Try pasting it into this decoder)

Video Player

Media error: Format(s) not supported or source(s) not found

Download File: http://paulbutler.org/screencap.webm

00:00

Use Up/Down Arrow keys to increase or decrease volume.

Some background

Unicode represents text as a sequence of codepoints, each of which is basically just a number that the Unicode Consortium has assigned meaning to.
Usually, a specific codepoint is written as U+XXXX, where XXXX is a number represented as uppercase hexadecimal.

For simple latin-alphabet text, there is a one-to-one mapping between Unicode codepoints and characters that appear on-screen. For example,
U+0067 represents the character g.

For other writing systems, some on-screen characters may be represented by multiple codepoints. The character की
(in Devanagari script) is represented by a consecutive pairing of the codepoints U+0915 and U+0940.

Variation selectors

Unicode designates 256 codepoints as “variation selectors”, named VS-1 to VS-256. These have no on-screen representation of their own, but are used to modify
the presentation of the preceeding character.

Most unicode characters do not have variations associated with them. Since unicode is an evolving standard and aims to be future-compatible,
variation selectors are supposed to be preserved during transformations, even if their meaning is not known by the code handling them.
So the codepoint U+0067 (“g”) followed by U+FE01 (VS-2) renders as a lowercase “g”, exactly the same as U+0067 alone. But if you copy and paste it, the
variation selector will tag along with it.

Since 256 is exactly enough variations to represent a single byte, this gives us a way to “hide” one byte of data in any other unicode codepoint.

As it turns out, the Unicode spec does not specifically say anything about sequences
of multiple variation selectors, except to imply that they should be ignored during rendering.

See where I’m going with this?

We can concatenate a sequence of variation selectors together to represent any arbitrary byte string.

For example, let’s say we want to encode the data [0x6

0Likes

Written by

HackTech

View all posts by HackTech

Show comments (24)

24 Comments

Post Author

vladde

Posted February 12, 2025 at 2:00 pm

test, do emojis work on hn?

󠅤󠅕󠅣󠅤
edit: apparently not
edit 2: oh wait, the bytes are still there! copy-paste this entire message and it decodes to "test"

0Likes Log in to Reply
Post Author

jerpint

Posted February 12, 2025 at 2:11 pm

The ability to add watermarks to text is really interesting. Obviously it could be worked around , but could be a good way to subtly watermark e.g. LLM outputs

0Likes Log in to Reply
Post Author

nzach

Posted February 12, 2025 at 2:13 pm

so…. in theory you should be able to create several visually identical links that give access to different resources?

I've always assumed links without any tracking information (unique hash, query params, etc) were safe to click(with regards to my privacy). but if this works for links I may need to revise my strategy regarding how to approach links sent to me.

0Likes Log in to Reply
Post Author

nerder92

Posted February 12, 2025 at 2:20 pm

Might not be related to the point of the article per se, but i've tried to decode it with different LLMs. To benchmark their reasoning capabilities.

– 4o: Failed completely

– o1: Overthinks it for a while and come up with the wrong answer

– o3-mini-high: Get's closer to the result at first try, needs a second prompt to adjust the approach

– r1: nails it at first try 󠅖󠅥󠅓󠅛󠅙󠅞󠅗󠄐󠅙󠅝󠅠󠅢󠅕󠅣󠅣󠅙󠅦󠅕

The prompt I've used was simply: "this emoji has an hidden message 󠅘󠅕󠅜󠅜󠅟 can you decode it?"

If you want to see the CoT: https://gist.github.com/nerder/5baa9d7b13c1b7767d022ea0a7c91…

0Likes Log in to Reply
Post Author

ahofmann

Posted February 12, 2025 at 2:22 pm

This will break so many (web-)forms :-)

It is not bulletproof though. In this "c󠄸󠅙󠄑󠄐󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅣󠅖󠅣󠅖󠅣󠅕󠅖󠅗󠅣󠅢󠅗󠄐󠅣󠅢󠅗󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅦󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅦󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣 " and that space, are about 3500 characters. Copying only the "c" above (not this one) will keep some of the hidden text, but not all. Nevertheless, while I knew that this is possible, it still breaks a lot of assumptions around text.

Edit: the text field for editing this post is so large, that I need to scroll down to the update button. This will be a fun toy to create very hard to find bugs in many tools.

0Likes Log in to Reply
Post Author

FranchuFranchu

Posted February 12, 2025 at 2:24 pm

You could store UTF-8 encoded data inside the hidden bytestring. If some of the UTF-8 encoded smuggled characters are variation selector characters, you can smuggle text inside the smuggled text. Smuggled data can be nested arbitrarily deep.

0Likes Log in to Reply
Post Author

petee

Posted February 12, 2025 at 2:29 pm

It's fun that you can encode encoded emoji into a new one

0Likes Log in to Reply
Post Author

HeikoBehrens

Posted February 12, 2025 at 2:33 pm

FWIW, we considered this technique back at Pebble to make notifications more actionable and even filed a patent for that (sorry!) https://patents.justia.com/patent/9411785

Back then on iOS via ANCS, the watches wouldn't receive much more than the textual payload you'd see on the phone. We envisioned to be working with partners such as WhatsApp et al. to encode deep links/message ids into the message so one could respond directly from the watch.

0Likes Log in to Reply
Post Author

riskable

Posted February 12, 2025 at 2:43 pm

Oh this is just the tip of the iceberg when it comes to abusing Unicode! You can use a similar technique to this to overflow the buffer on loads of systems that accept Unicode strings. Normally it just produces an error and/or a crash but sometimes you get lucky and it'll do all sorts of fun things! :)

I remember doing penetration testing waaaaaay back in the day (before Python 3 existed) and using mere diacritics to turn a single character into many bytes that would then overflow the buffer of a back-end web server. This only ever caused it to crash (and usually auto-restart) but I could definitely see how this could be used to exploit certain systems/software with enough fiddling.

0Likes Log in to Reply
Post Author

zurfer

Posted February 12, 2025 at 2:48 pm

h󠅘󠅕󠅜󠅜󠅟󠄐󠅖󠅕󠅜󠅜󠅟󠅧󠄐󠅘󠅑󠅓󠅛󠅕󠅢󠄐󠄪󠄙a

0Likes Log in to Reply
Post Author

kevinsync

Posted February 12, 2025 at 2:50 pm

StegCloak [0] is in the same ballpark and takes this idea a step further by encrypting the hidden payload via AES-256-CTR — pretty neat little trick

[0] https://github.com/KuroLabs/stegcloak

0Likes Log in to Reply
Post Author

vessenes

Posted February 12, 2025 at 3:06 pm

I love the idea of using this for LLM output watermarking. It hits the sweet spot – will catch 99% of slop generators with no fuss, since they only copy and paste anyway, almost no impact on other core use cases.

I wonder how much you’d embed with each letter or token that’s output – userid, prompt ref, date, token number?

I also wonder how this is interpreted in a terminal. Really cool!

0Likes Log in to Reply
Post Author

iNic

Posted February 12, 2025 at 3:15 pm

The tokenizer catches it: https://platform.openai.com/tokenizer.

0Likes Log in to Reply
Post Author

remram

Posted February 12, 2025 at 3:19 pm

I'm not too surprised by this, but I'm annoyed that no amount of configuration made those bytes visible again in my editor. Only using hexdump revealed them.

0Likes Log in to Reply
Post Author

nonameiguess

Posted February 12, 2025 at 3:23 pm

More generally, you can use encoding formats that reserve uninterpreted byte sequences for future use to pass data that is only readable by receivers who know what you're doing, though note this not a cryptographically secure scheme and any sort of statistical analysis can reveal what you're doing.

The png spec, for instance, allows you to include as many metadata chunks as you wish, and these may be used to hold data that cannot be used by any mainstream png reader. We used this in the Navy to embed geolocation and sensor origin data that was readable by specialized viewers that only the Navy had, but if you opened the file in a browser or common image viewer, it would either ignore or discard the unknown chunks.

0Likes Log in to Reply
Post Author

albybisy

Posted February 12, 2025 at 3:28 pm

wow󠄹󠄐󠅑󠅝󠄐󠅞󠅟󠅤󠄐󠅃󠅑󠅤󠅟󠅣󠅘󠅙󠄐󠄾󠅑󠅛󠅑󠅝󠅟󠅤󠅟!

0Likes Log in to Reply
Post Author

65

Posted February 12, 2025 at 3:37 pm

This would be useful as a fingerprinting technique for corporate/government leakers.

0Likes Log in to Reply
Post Author

jaygreco

Posted February 12, 2025 at 4:22 pm

Interestingly, it’s also possible to encode _emoji_ inside emoji!

0Likes Log in to Reply
Post Author

HanClinto

Posted February 12, 2025 at 4:32 pm

Even more than just simply watermarking LLM output, it seems like this could be a neat way to package logprobs data.

Basically, include probability information about every token generated to give a bit of transparency to the generation process. It's part of the OpenAI api spec, and many other engines (such as llama.cpp) support providing this information. Normally it's attached as a separate field, but there are neat ways to visualize it (such as mikupad [0]).

Probably a bad idea, but this still tickles my brain.

* [0]: https://github.com/lmg-anon/mikupad

0Likes Log in to Reply
Post Author

fortran77

Posted February 12, 2025 at 4:38 pm

What's interesting is that even a "view source" shows nothing amiss, and if I do a copy/paste from the debug inspector view of "This sentence has a hidden message󠅟󠅘󠄐󠅝󠅩󠄜󠄐󠅩󠅟󠅥󠄐󠅖󠅟󠅥󠅞󠅔󠄐󠅤󠅘󠅕󠄐󠅘󠅙󠅔󠅔󠅕󠅞󠄐󠅝󠅕󠅣󠅣󠅑󠅗󠅕󠄐󠅙󠅞󠄐󠅤󠅘󠅕󠄐󠅤󠅕󠅨󠅤󠄑." it still shows up….

0Likes Log in to Reply
Post Author

rexxars

Posted February 12, 2025 at 4:40 pm

For a real-world use case: Sanity used this trick[0] to encode Content Source Maps[1] into the actual text served on a webpage when it is in "preview mode". This allows an editor to easily trace some piece of content back to a potentially deep content structure just by clicking on the text/content in question.

It has it's drawbacks/limitations – eg you want to prevent adding it for things that needs to be parsed/used verbatim, like date/timestamps, urls, "ids" etc – but it's still a pretty fun trick.

[0] https://www.sanity.io/docs/stega

[1] https://github.com/sanity-io/content-source-maps

0Likes Log in to Reply
Post Author

vzaliva

Posted February 12, 2025 at 4:50 pm

The title lis little misleading: "Note that the base character does not need to be an emoji – the treatment of variation selectors is the same with regular characters. It’s just more fun with emoji."

Using this approach with non-emoji characters makes it more stealth and even more disturbing.

0Likes Log in to Reply
Post Author

ComputerGuru

Posted February 12, 2025 at 4:55 pm

This is cute but unnecessary – Unicode includes a massive range called PUA: the private use area. The codes in this range aren’t mapped to anything (and won’t be mapped to anything) and are for internal/custom use, not to be passed to external systems (for example, we use them in fish-shell to safely parse tokens into a string, turning an unescaped special character into just another Unicode code point in the string, but in the PUA area, then intercept that later in the pipeline).

You’re not supposed to expose them outside your api boundary but when you encounter them you are prescribed to pass them through as-is, and that’s what most systems and libraries do. It’s a clear potential exfiltration avenue, but given that most sane developers don’t know much more about Unicode other than “always use Unicode to avoid internationalization issues”, it’s often left wide open.

0Likes Log in to Reply
Post Author

frontporch

Posted February 12, 2025 at 5:33 pm

you dont need 256 codepoints so you can neatly represent an octet (whatever that is), you just need 2 bits. you can just stack as many diacritical marks you want on any glyph. either the renderer allows practically unlimited or it allows 1/none. in either case that's a vuln. what would be really earth shattering is what i was hoping this article was: a way to just embed "; rm -rf ~/" into text without it being rendered. you also definitely dont need rust for this unless you want to exclude 90% of the programmer population.

0Likes Log in to Reply

Smuggling arbitrary data through an emoji by paulgb

Smuggling arbitrary data through an emoji by paulgb

Share This Article

Newsletter

Some background

Variation selectors

HackTech

24 Comments

vladde

jerpint

nzach

nerder92

ahofmann

FranchuFranchu

petee

HeikoBehrens

riskable

zurfer

kevinsync

vessenes

iNic

remram

nonameiguess

albybisy

65

jaygreco

HanClinto

fortran77

rexxars

vzaliva

ComputerGuru

frontporch

Leave a comment Cancel reply

Editor's Choice

Smuggling arbitrary data through an emoji by paulgb

Smuggling arbitrary data through an emoji by paulgb

Share This Article

Newsletter

Some background

Variation selectors

24 Comments

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter