Hi HN! I've found this visualization tool immensely helpful over the years for getting an intuition for how an LLM “sees” some piece of text, and with a bit of elbow grease decided to move all compute to client side so I could make it publicly available.
I've found it particularly useful for
– Understanding exactly how repetition and patterns affect a small LM's ability to predict correctly
– Understanding different tokenization patterns and how it affects model output
– Getting a general sense of how “hard” different prediction tasks are for GPT-style models
Known problems (that I probably won't fix, since this was a kind of one-off project)
– Doesn't work well with Unicode grapheme clusters that are multiple GPT-2 tokens (e.g. emoji, smart quotes)
– Support for other models (maybe later?)
Read More