This Hacker News comment by GuB-42 intrigued me:
With ZWJ (Zero Width Joiner) sequences you could in theory encode an unlimited amount of data in a single emoji.
Is it really possible to encode arbitrary data in a single emoji?
tl;dr: yes, although I found an approach without ZWJ. In fact, you can encode data in any unicode character. This sentence has a hidden message󠅟󠅘󠄐󠅝󠅩󠄜󠄐󠅩󠅟󠅥󠄐󠅖󠅟󠅥󠅞󠅔󠄐󠅤󠅘󠅕󠄐󠅘󠅙󠅔󠅔󠅕󠅞󠄐󠅝󠅕󠅣󠅣󠅑󠅗󠅕󠄐󠅙󠅞󠄐󠅤󠅘󠅕󠄐󠅤󠅕󠅨󠅤󠄑. (Try pasting it into this decoder)
Video Player
Some background
Unicode represents text as a sequence of codepoints, each of which is basically just a number that the Unicode Consortium has assigned meaning to.
Usually, a specific codepoint is written as U+XXXX
, where XXXX
is a number represented as uppercase hexadecimal.
For simple latin-alphabet text, there is a one-to-one mapping between Unicode codepoints and characters that appear on-screen. For example,
U+0067
represents the character g
.
For other writing systems, some on-screen characters may be represented by multiple codepoints. The character की
(in Devanagari script) is represented by a consecutive pairing of the codepoints U+0915
and U+0940
.
Variation selectors
Unicode designates 256 codepoints as “variation selectors”, named VS-1 to VS-256. These have no on-screen representation of their own, but are used to modify
the presentation of the preceeding character.
Most unicode characters do not have variations associated with them. Since unicode is an evolving standard and aims to be future-compatible,
variation selectors are supposed to be preserved during transformations, even if their meaning is not known by the code handling them.
So the codepoint U+0067
(“g”) followed by U+FE01
(VS-2) renders as a lowercase “g”, exactly the same as U+0067
alone. But if you copy and paste it, the
variation selector will tag along with it.
Since 256 is exactly enough variations to represent a single byte, this gives us a way to “hide” one byte of data in any other unicode codepoint.
As it turns out, the Unicode spec does not specifically say anything about sequences
of multiple variation selectors, except to imply that they should be ignored during rendering.
See where I’m going with this?
We can concatenate a sequence of variation selectors together to represent any arbitrary byte string.
For example, let’s say we want to encode the data [0x6
24 Comments
vladde
test, do emojis work on hn?
󠅤󠅕󠅣󠅤
edit: apparently not
edit 2: oh wait, the bytes are still there! copy-paste this entire message and it decodes to "test"
jerpint
The ability to add watermarks to text is really interesting. Obviously it could be worked around , but could be a good way to subtly watermark e.g. LLM outputs
nzach
so…. in theory you should be able to create several visually identical links that give access to different resources?
I've always assumed links without any tracking information (unique hash, query params, etc) were safe to click(with regards to my privacy). but if this works for links I may need to revise my strategy regarding how to approach links sent to me.
nerder92
Might not be related to the point of the article per se, but i've tried to decode it with different LLMs. To benchmark their reasoning capabilities.
– 4o: Failed completely
– o1: Overthinks it for a while and come up with the wrong answer
– o3-mini-high: Get's closer to the result at first try, needs a second prompt to adjust the approach
– r1: nails it at first try 󠅖󠅥󠅓󠅛󠅙󠅞󠅗󠄐󠅙󠅝󠅠󠅢󠅕󠅣󠅣󠅙󠅦󠅕
The prompt I've used was simply: "this emoji has an hidden message 󠅘󠅕󠅜󠅜󠅟 can you decode it?"
If you want to see the CoT: https://gist.github.com/nerder/5baa9d7b13c1b7767d022ea0a7c91…
ahofmann
This will break so many (web-)forms :-)
It is not bulletproof though. In this "c󠄸󠅙󠄑󠄐󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅣󠅖󠅣󠅖󠅣󠅕󠅖󠅗󠅣󠅢󠅗󠄐󠅣󠅢󠅗󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅦󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅦󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅗󠅕󠅞󠅤󠅣󠄭󠄭󠄠󠄞󠄡󠄢󠄞󠄨󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅑󠅠󠅙󠄭󠄭󠄠󠄞󠄨󠄞󠄡󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅟󠅠󠅕󠅞󠅑󠅙󠄭󠄭󠄠󠄞󠄡󠄠󠄞󠄡󠄥󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅣󠅙󠅜󠅕󠅢󠅟󠄭󠄭󠄠󠄞󠄧󠄞󠄤󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅑󠅪󠅥󠅢󠅕󠄭󠄭󠄠󠄞󠄥󠄞󠄣󠅜󠅙󠅦󠅕󠅛󠅙󠅤󠄝󠅠󠅜󠅥󠅗󠅙󠅞󠅣󠄝󠅤󠅥󠅢󠅞󠄝󠅔󠅕󠅤󠅕󠅓󠅤󠅟󠅢󠄭󠄭󠄠󠄞󠄣󠄞󠄦󠅠󠅩󠅤󠅘󠅟󠅞󠄝󠅔󠅟󠅤󠅕󠅞󠅦󠄭󠄭󠄡󠄞󠄠󠄞󠄡󠅢󠅕󠅡󠅥󠅕󠅣󠅤󠅣󠄭󠄭󠄢󠄞󠄣󠄢󠄞󠄣 " and that space, are about 3500 characters. Copying only the "c" above (not this one) will keep some of the hidden text, but not all. Nevertheless, while I knew that this is possible, it still breaks a lot of assumptions around text.
Edit: the text field for editing this post is so large, that I need to scroll down to the update button. This will be a fun toy to create very hard to find bugs in many tools.
FranchuFranchu
You could store UTF-8 encoded data inside the hidden bytestring. If some of the UTF-8 encoded smuggled characters are variation selector characters, you can smuggle text inside the smuggled text. Smuggled data can be nested arbitrarily deep.
petee
It's fun that you can encode encoded emoji into a new one
HeikoBehrens
FWIW, we considered this technique back at Pebble to make notifications more actionable and even filed a patent for that (sorry!) https://patents.justia.com/patent/9411785
Back then on iOS via ANCS, the watches wouldn't receive much more than the textual payload you'd see on the phone. We envisioned to be working with partners such as WhatsApp et al. to encode deep links/message ids into the message so one could respond directly from the watch.
riskable
Oh this is just the tip of the iceberg when it comes to abusing Unicode! You can use a similar technique to this to overflow the buffer on loads of systems that accept Unicode strings. Normally it just produces an error and/or a crash but sometimes you get lucky and it'll do all sorts of fun things! :)
I remember doing penetration testing waaaaaay back in the day (before Python 3 existed) and using mere diacritics to turn a single character into many bytes that would then overflow the buffer of a back-end web server. This only ever caused it to crash (and usually auto-restart) but I could definitely see how this could be used to exploit certain systems/software with enough fiddling.
zurfer
h󠅘󠅕󠅜󠅜󠅟󠄐󠅖󠅕󠅜󠅜󠅟󠅧󠄐󠅘󠅑󠅓󠅛󠅕󠅢󠄐󠄪󠄙a
kevinsync
StegCloak [0] is in the same ballpark and takes this idea a step further by encrypting the hidden payload via AES-256-CTR — pretty neat little trick
[0] https://github.com/KuroLabs/stegcloak
vessenes
I love the idea of using this for LLM output watermarking. It hits the sweet spot – will catch 99% of slop generators with no fuss, since they only copy and paste anyway, almost no impact on other core use cases.
I wonder how much you’d embed with each letter or token that’s output – userid, prompt ref, date, token number?
I also wonder how this is interpreted in a terminal. Really cool!
iNic
The tokenizer catches it: https://platform.openai.com/tokenizer.
remram
I'm not too surprised by this, but I'm annoyed that no amount of configuration made those bytes visible again in my editor. Only using hexdump revealed them.
nonameiguess
More generally, you can use encoding formats that reserve uninterpreted byte sequences for future use to pass data that is only readable by receivers who know what you're doing, though note this not a cryptographically secure scheme and any sort of statistical analysis can reveal what you're doing.
The png spec, for instance, allows you to include as many metadata chunks as you wish, and these may be used to hold data that cannot be used by any mainstream png reader. We used this in the Navy to embed geolocation and sensor origin data that was readable by specialized viewers that only the Navy had, but if you opened the file in a browser or common image viewer, it would either ignore or discard the unknown chunks.
albybisy
wow󠄹󠄐󠅑󠅝󠄐󠅞󠅟󠅤󠄐󠅃󠅑󠅤󠅟󠅣󠅘󠅙󠄐󠄾󠅑󠅛󠅑󠅝󠅟󠅤󠅟!
65
This would be useful as a fingerprinting technique for corporate/government leakers.
jaygreco
Interestingly, it’s also possible to encode _emoji_ inside emoji!
HanClinto
Even more than just simply watermarking LLM output, it seems like this could be a neat way to package logprobs data.
Basically, include probability information about every token generated to give a bit of transparency to the generation process. It's part of the OpenAI api spec, and many other engines (such as llama.cpp) support providing this information. Normally it's attached as a separate field, but there are neat ways to visualize it (such as mikupad [0]).
Probably a bad idea, but this still tickles my brain.
* [0]: https://github.com/lmg-anon/mikupad
fortran77
What's interesting is that even a "view source" shows nothing amiss, and if I do a copy/paste from the debug inspector view of "This sentence has a hidden message󠅟󠅘󠄐󠅝󠅩󠄜󠄐󠅩󠅟󠅥󠄐󠅖󠅟󠅥󠅞󠅔󠄐󠅤󠅘󠅕󠄐󠅘󠅙󠅔󠅔󠅕󠅞󠄐󠅝󠅕󠅣󠅣󠅑󠅗󠅕󠄐󠅙󠅞󠄐󠅤󠅘󠅕󠄐󠅤󠅕󠅨󠅤󠄑." it still shows up….
rexxars
For a real-world use case: Sanity used this trick[0] to encode Content Source Maps[1] into the actual text served on a webpage when it is in "preview mode". This allows an editor to easily trace some piece of content back to a potentially deep content structure just by clicking on the text/content in question.
It has it's drawbacks/limitations – eg you want to prevent adding it for things that needs to be parsed/used verbatim, like date/timestamps, urls, "ids" etc – but it's still a pretty fun trick.
[0] https://www.sanity.io/docs/stega
[1] https://github.com/sanity-io/content-source-maps
vzaliva
The title lis little misleading: "Note that the base character does not need to be an emoji – the treatment of variation selectors is the same with regular characters. It’s just more fun with emoji."
Using this approach with non-emoji characters makes it more stealth and even more disturbing.
ComputerGuru
This is cute but unnecessary – Unicode includes a massive range called PUA: the private use area. The codes in this range aren’t mapped to anything (and won’t be mapped to anything) and are for internal/custom use, not to be passed to external systems (for example, we use them in fish-shell to safely parse tokens into a string, turning an unescaped special character into just another Unicode code point in the string, but in the PUA area, then intercept that later in the pipeline).
You’re not supposed to expose them outside your api boundary but when you encounter them you are prescribed to pass them through as-is, and that’s what most systems and libraries do. It’s a clear potential exfiltration avenue, but given that most sane developers don’t know much more about Unicode other than “always use Unicode to avoid internationalization issues”, it’s often left wide open.
frontporch
you dont need 256 codepoints so you can neatly represent an octet (whatever that is), you just need 2 bits. you can just stack as many diacritical marks you want on any glyph. either the renderer allows practically unlimited or it allows 1/none. in either case that's a vuln. what would be really earth shattering is what i was hoping this article was: a way to just embed "; rm -rf ~/" into text without it being rendered. you also definitely dont need rust for this unless you want to exclude 90% of the programmer population.