Skip to content Skip to footer
Dummy’s Guide to Modern LLM Sampling by nkko

Dummy’s Guide to Modern LLM Sampling by nkko

11 Comments

  • Post Author
    antonvs
    Posted May 4, 2025 at 5:11 pm

    This is great! “Sampling” covers much more than I expected.

  • Post Author
    blt
    Posted May 4, 2025 at 5:27 pm

    This is pretty interesting. I didn't realize so much manipulation was happening after the initial softmax temperature choice.

  • Post Author
    Der_Einzige
    Posted May 4, 2025 at 5:35 pm

    Related to this, our min_p paper was ranked #18 out of 12000 submission at ICLR and got an oral:

    https://iclr.cc/virtual/2025/oral/31888

    Our poster was popular:

    poster: https://iclr.cc/media/PosterPDFs/ICLR%202025/30358.png?t=174…

    oral presentation (watch me roast yoshua bengio on this topic and then have him be the first questioner, 2nd speaker starting around 19:30 min mark. My slides for the presentation are there too and really funny.): https://iclr.cc/virtual/2025/session/31936

    paper: https://arxiv.org/abs/2407.01082

    As one of the min_p authors, I can confirm that Top N sigma is currently the best general purpose sampler by far. Also, temperature can and should be scaled far higher than it is today. Temps of 100 are totally fine with techniques like min_p and top N sigma.

    Also, the special case of top_k = 2 with ultra high temperature (one thing authors recommend against near the end) is very interesting in its own right. Doing it leads to spelling errors every ~10th word – but also seems to have a certain creativity to it that's quite interesting.

  • Post Author
    orbital-decay
    Posted May 4, 2025 at 5:37 pm

    One thing not said here is that samplers have no access to model's internal state. It's basic math applied to the output distribution, which technically carries some semantics but you can't decode it without being as smart as the model itself.

    Certain samplers described here like repetition penalty or DRY are just like this – the model could repeat itself in a myriad of ways, the only way to prevent all of them is better training, not n-gram search or other classic NLP methods. This is basically trying to plug every hole with a finger. How many fingers do you have?

    Hacking the autoregressive process has some some low-hanging fruits like Min-P that can make some improvement and certain nifty tricks possible, but if you're doing it to turn a bad model into a good one, you're doing it wrong.

  • Post Author
    mdp2021
    Posted May 4, 2025 at 6:03 pm

    When the attempt is though to have the LLM output an "idea", not just a "next token", the selection over the logits vector should break that original idea… If the idea is complete, there should be no need to use sampling over the logits.

    The sampling, in this framework, should not happen near the output level ("what will the next spoke word be").

  • Post Author
    jlcases
    Posted May 4, 2025 at 6:18 pm

    [dead]

  • Post Author
    neuroelectron
    Posted May 4, 2025 at 6:34 pm

    Love this and the way everything is mapped out and explained simply really opens up the opportunity for trying new things, and where you can do that effectively.

    For instance, why not use whole words as tokens? Make a "robot" with a limited "robot dialect." Yes, no capacity for new words or rare words, but you could modify the training data and input data to translate those words into the existing vocabulary. Now you have a much smaller mapping that's literally robot-like and kind of gives the user an expectation of what kind of answers the robot can answer well, like C-3PO.

  • Post Author
    simonw
    Posted May 4, 2025 at 8:02 pm

    This is a really useful document – the explanations are very clear and it covers a lot of ground.

    Anyone know who wrote it? It's not credited and it's pubished on a free Markdown pastebin.

    The section on DRY – "repetition penalties" – was interesting to me. I often want LLMs to deliberately output exact copies of their input. When summarizing a long conversation for example I tend to ask for exact quotes that are most illustrative of the points being made. These are easy to fact check later by searching for them in the source material.

    The DRY penalty seems to me that it would run counter to my goal there.

  • Post Author
    tomhowls
    Posted May 4, 2025 at 8:51 pm

    [flagged]

  • Post Author
    smcleod
    Posted May 4, 2025 at 9:05 pm

    I had a go at writing a bit of a sampling guide for Ollama/llama.cpp as well recently, open to any feedback / corrections – https://smcleod.net/2025/04/comprehensive-guide-to-llm-sampl…

  • Post Author
    ltbarcly3
    Posted May 4, 2025 at 9:19 pm

    Calling things modern that are updates to techniques to use technologies only invented a few years ago is borderline illiterate. Modern vs what, classical LLM sampling?

Leave a comment

In the Shadows of Innovation”

© 2025 HackTech.info. All Rights Reserved.

Sign Up to Our Newsletter

Be the first to know the latest updates

Whoops, you're not connected to Mailchimp. You need to enter a valid Mailchimp API key.