Dummy's Guide to Modern LLM Sampling by nkko

Share This Article

Sed ut perspiciatis unde.

Intro Knowledge

Large Language Models (LLMs) work by taking a piece of text (e.g. user prompt) and calculating the next word. In more technical terms, tokens. LLMs have a vocabulary, or a dictionary, of valid tokens, and will reference those in training and inference (the process of generating text). More on that below. You need to understand why we use tokens (sub-words) instead of words or letters first. But first, a short glossary of some technical terms that aren’t explained in the sections below in-depth:

Short Glossary

Logits: The raw, unnormalized scores output by the model for each token in its vocabulary. Higher logits indicate tokens the model considers more likely to come next.
Softmax: A mathematical function that converts logits into a proper probability distribution – values between 0 and 1 that sum to 1.
Entropy: A measure of uncertainty or randomness in a probability distribution. Higher entropy means the model is less certain about which token should come next.
Perplexity: Related to entropy, perplexity measures how “surprised” the model is by the text. Lower perplexity indicates higher confidence.
n-gram: A contiguous sequence of n tokens. For example, “once upon a” is a 3-gram.
Context window (or sequence length): The maximum number of tokens an LLM can process at once, including both the prompt and generated output.
Probability distribution: A function that assigns probabilities to all possible outcomes (tokens) such that they sum to 1. Think of it like percentages: if 1% was 0.01, 50% was 0.5, and 100% was 1.0.

Why tokens?

Your first instinct might be using a vocabulary of words or letters for an LLM. But instead, we use sub-words: some common words are preserved as whole in the vocabulary (e.g. the, or apple might be a single token due to how common they are in the English language), but others are fragmented into common sub-words (e.g. bi-fur-cat-ion Why is this? There are several, very good reasons:

Why not letters?

Many reasons. To name a few: LLMs have a limited context window (amount of tokens it can process at once). With character-level tokenization, even a moderate amount of text would lead to sequence length explosion (too many tokens for too little text). The word tokenization would be 12 tokens instead of, for example, 2 or 3 in a sub-word system. Longer sequences also require much more computation for self-attention. But more importantly, the model would need to learn higher-level patterns spanning a lot more positions. Understanding that t-h-e represents a single concept requires connecting information across three positions instead of one. This may also lead to meaningful relationships becoming more distant. Related concepts might be dozens or hundreds of positions apart.

Why not whole words?

A pure world-level tokenization would need us to create a vocabulary spanning the entire possible word list in the English language, and if we’re doing multiple languages, then many times that. This would make our embedding matrix unreasonably large and expensive. It would also struggle with new or rare words. When a model encounters a word not in its vocabulary, it typically replaces it with an “unknown” token, losing virtually all semantic information. Sub-word tokenization would represent new words by combining existing subword tokens. For example, if we invent a new word called “grompuficious”, a sub-word tokenizer might represent it as g-romp-u-ficious, depending on the tokenizer.
Another thing to mention is morphological awareness: many languages create words by combining morphemes (prefixes, roots, suffixes). E.g., as demonstrated earlier, unhappiness can be broken into un-happi-ness; sub-word tokenization can naturally capture these relationships. It also allows us to perform cross-lingual transfer. For languages with complex morphology or compounding (e.g. German or Finnish where words can be extremely long combinations).

How are the sub-words chosen?

If a language model uses a new tokenizer, the development team may decide to take a representative sample of their training data, and train a tokenizer to find the most commond sub-words in the dataset. They will set a vocabulary size beforehand, and then the tokenizer will try and find enough sub-words to fill up the list.

How does the model generate text?

During training, the model sees many terabytes worth of text and builds an internal probability map for tokens. For example, after it’s seen the tokens for How are are usually followed by the tokens you?, it will learn that to be the most probable next set of tokens. Once this map has been built internally to a satisfactory degree, the training is stopped and a checkpoint is released to the public (or kept private and served from an API, e.g. OpenAI). During inference, the user will provide the LLM with a text, and the LLM, based on the probabilities it’s learned through training, will decide what token comes next. However, it will not decide just one token: it will take into consideration every possible token that exists in its vocabulary, assigns a probability score to each, and (depending on your sampler) will only output the most probable token, i.e. the one with the highest score. This would make for a rather boring output (unless you need determinism), so this is where Sampling comes in.

From Tokens to Text: How LLMs Generate Content

Now that we understand how LLMs break down and represent text using tokens, let’s explore how they actually generate content. The process of text generation in LLMs involve two key steps:

Prediction: For each position, the model calculates the probability distribution over all possible next tokens in its vocabulary.
Selection: The model must choose one token from this distribution to add to the growing text.

The first step is fixed – determined by the model’s parameters after training. However, the second step – token selection – is where sampling occurs. While we could simply always choose the most likely token (known as “greedy” sampling), this tends to produce repetitive, deterministic text. Sampling introduces controlled randomness to make outputs more varied.

Sampling

As explained above, LLMs will pick the most probable token to generate. Sampling is the practice of introducing controlled randomness. With pure “greedy” sampling, it would pick the #1 option every time, but that’s boring! We use sampling methods like temperature, penalties, or truncation to allow for a bit of creative variation. This document will go through every popular sampling method, and explains how all of them word; both from a simple-to-understand and technical perspectives.

Notes on Algorithm Presentations

Throughout this document, algorithms are presented in pseudo-code format that combines mathematical notation with programming concepts. Here are some guidelines to help interpret these:

Notation Guide

L: the logits tensor (raw scores output by the model)
P: probabilities (after applying softmax to logits)
←: assignment operation (equal to = in programming)
∑: summation
|x|: either absolute value or length/size of x, depending on context
x[i]: accessing the i-th element of x
v: logical OR operation
¬: logical NOT operation
∞: infinity (often used to mask out tokens by setting logits to negative infinity)
argmax(x): returns the index of the maximum value in x
∈: “element of” (e.g., x ∈ X means x is an element of set X)

Implementation Considerations

The algorithms provided are written for clarity rather than optimization. Production implementations would typically:

Vectorize operations where possible for efficiency
handle edge cases and numerical stability issues (though parts that need this have occasionally been highlighted in the algorithms below)
Incorporate batch processing for multiple sequences, if necessary for the framework
Cache intermediate results where beneficial

Temperature

Think of this as the “creativity knob” on your LLM. At low temperatures (close to 0), the model becomes very cautious and predictable – it almost always picks the most likely next word. It’s like ordering the same dish at your favourite restaurant every time because you know you’ll like it (or maybe you don’t know any better). At higher temperatures (like 0.7-1.0), the model gets very creative and willing to take chances. It may choose the 3rd or 4th most likely word instead of always the top choice. This makes text more varied and interesting, but also increases the chance of errors. Very high temperatures (above 1.0) make the model wild and unpredictable, unless you use it in conjunction with other sampling methods (e.g. min-p) to reign it in.

Technical:
Temperature works by directly manipulating the probability distribution over the vocabulary. The model produces logits (unnormalized scores) for each token in the vocabulary, which are then divided by the temperature value. When temperature is less than 1, this makes high logits relatively higher and low logits relatively lower, giving us a more peaked distribution where the highest-scoring tokens become even more likely. When temperature exceeds 1, this flattens the distribution, making the probability gap between high and low scoring tokens smaller, thus increasing randomness. After applying temperature, the modified logits are converted to a probability distribution (using softmax) and a token is randomly sampled from this distribution. The mathematical effect is that Temperature T transformers probabilities by essentially raising each probability to the power of 1/T before renormalizing.

Algorithm

Algorithm 1 Temperature Sampling
Required: Logits tensor L, temperature parameter T
Output: Modified logits with adjusted probability distribution
1: T ← max(T, ε)  // Prevent numerical issues with extremely low temperatures
2: if T < 0.1 then
3:     L ← L - max(L) + 1  // Shift range to [-inf, 1] for numerical stability
4: end if
5: L ← L / T  // Apply temperature scaling
6: return L

Presence Penalty

This discourages the model from repeating any token that has appeared before, regardless of how many times it’s been used. Think of it like a party host who wants to make sure everyone gets a turn to speak. If Tim has already spoken once, he gets slightly discouraged from speaking again, whether he spoke once or ten times before. This is generally not recommend, since better penalty strategies exist (see: DRY).

Technical:
Presence Penalty works by applying a fixed penalty to any token that has appeared in the generated text. We first identify which tokens have been used before using the output mask (which is True for tokens that appear at least once). It then subtracts the presence penalty value from the logits to those tokens. This makes previously used tokens less likely to be selected again, regardless of how frequently they’ve appeared. The penalty is applied uniformly to any token that has been used at least once.

Algorithm

Algorithm 2 Presence Penalty
Required: Logits tensor L, output tokens O, penalty weight λp
Output: Modified logits with penalty applied for token presence
1: Vsize ← |L[0]|  // Vocabulary size from logits dimension
2: Moutput ← BinaryMask(O, Vsize)  // Create binary mask where token has appeared at least once
3: P ← λp · Moutput  // Calculate penalty matrix
4: L ← L - P  // Apply presence penalty to logits
5: return L

Frequency Penalty

Discourages tokens based on how many times they’ve already been used. This is simply Presence Penalty but with the number of occurrences being taken into account. The more frequently a word has appeared, the less likely it will appear again.

Technical:
The frequency penalty multiplies the count of each token’s previous occurrences by the penalty value and subtracts this from that token’s logit score. We track how many times each token has appeared in the generated output, and if it’s appeared three times, its logit is reduced 3 x (frequency penalty). This creates a progressive penalty that increases with each repeated use of a token.

Algorithm

Algorithm 3 Frequency Penalty
Required: Logits tensor L, output tokens O, penalty weight

Post Author

antonvs

Posted May 4, 2025 at 5:11 pm

This is great! “Sampling” covers much more than I expected.

0Likes Log in to Reply
Post Author

blt

Posted May 4, 2025 at 5:27 pm

This is pretty interesting. I didn't realize so much manipulation was happening after the initial softmax temperature choice.

0Likes Log in to Reply
Post Author

Der_Einzige

Posted May 4, 2025 at 5:35 pm

Related to this, our min_p paper was ranked #18 out of 12000 submission at ICLR and got an oral:

https://iclr.cc/virtual/2025/oral/31888

Our poster was popular:

poster: https://iclr.cc/media/PosterPDFs/ICLR%202025/30358.png?t=174…

oral presentation (watch me roast yoshua bengio on this topic and then have him be the first questioner, 2nd speaker starting around 19:30 min mark. My slides for the presentation are there too and really funny.): https://iclr.cc/virtual/2025/session/31936

paper: https://arxiv.org/abs/2407.01082

As one of the min_p authors, I can confirm that Top N sigma is currently the best general purpose sampler by far. Also, temperature can and should be scaled far higher than it is today. Temps of 100 are totally fine with techniques like min_p and top N sigma.

Also, the special case of top_k = 2 with ultra high temperature (one thing authors recommend against near the end) is very interesting in its own right. Doing it leads to spelling errors every ~10th word – but also seems to have a certain creativity to it that's quite interesting.

0Likes Log in to Reply
Post Author

orbital-decay

Posted May 4, 2025 at 5:37 pm

One thing not said here is that samplers have no access to model's internal state. It's basic math applied to the output distribution, which technically carries some semantics but you can't decode it without being as smart as the model itself.

Certain samplers described here like repetition penalty or DRY are just like this – the model could repeat itself in a myriad of ways, the only way to prevent all of them is better training, not n-gram search or other classic NLP methods. This is basically trying to plug every hole with a finger. How many fingers do you have?

Hacking the autoregressive process has some some low-hanging fruits like Min-P that can make some improvement and certain nifty tricks possible, but if you're doing it to turn a bad model into a good one, you're doing it wrong.

0Likes Log in to Reply
Post Author

mdp2021

Posted May 4, 2025 at 6:03 pm

When the attempt is though to have the LLM output an "idea", not just a "next token", the selection over the logits vector should break that original idea… If the idea is complete, there should be no need to use sampling over the logits.

The sampling, in this framework, should not happen near the output level ("what will the next spoke word be").

0Likes Log in to Reply
Post Author

jlcases

Posted May 4, 2025 at 6:18 pm

[dead]

0Likes Log in to Reply
Post Author

neuroelectron

Posted May 4, 2025 at 6:34 pm

Love this and the way everything is mapped out and explained simply really opens up the opportunity for trying new things, and where you can do that effectively.

For instance, why not use whole words as tokens? Make a "robot" with a limited "robot dialect." Yes, no capacity for new words or rare words, but you could modify the training data and input data to translate those words into the existing vocabulary. Now you have a much smaller mapping that's literally robot-like and kind of gives the user an expectation of what kind of answers the robot can answer well, like C-3PO.

0Likes Log in to Reply
Post Author

simonw

Posted May 4, 2025 at 8:02 pm

This is a really useful document – the explanations are very clear and it covers a lot of ground.

Anyone know who wrote it? It's not credited and it's pubished on a free Markdown pastebin.

The section on DRY – "repetition penalties" – was interesting to me. I often want LLMs to deliberately output exact copies of their input. When summarizing a long conversation for example I tend to ask for exact quotes that are most illustrative of the points being made. These are easy to fact check later by searching for them in the source material.

The DRY penalty seems to me that it would run counter to my goal there.

0Likes Log in to Reply
Post Author

tomhowls

Posted May 4, 2025 at 8:51 pm

[flagged]

0Likes Log in to Reply
Post Author

smcleod

Posted May 4, 2025 at 9:05 pm

I had a go at writing a bit of a sampling guide for Ollama/llama.cpp as well recently, open to any feedback / corrections – https://smcleod.net/2025/04/comprehensive-guide-to-llm-sampl…

0Likes Log in to Reply
Post Author

ltbarcly3

Posted May 4, 2025 at 9:19 pm

Calling things modern that are updates to techniques to use technologies only invented a few years ago is borderline illiterate. Modern vs what, classical LLM sampling?

0Likes Log in to Reply

Dummy’s Guide to Modern LLM Sampling by nkko

Dummy’s Guide to Modern LLM Sampling by nkko

Share This Article

Newsletter

Intro Knowledge

Short Glossary

Why tokens?

Why not letters?

Why not whole words?

How are the sub-words chosen?

How does the model generate text?

From Tokens to Text: How LLMs Generate Content

Sampling

Notes on Algorithm Presentations

Notation Guide

Implementation Considerations

Temperature

Presence Penalty

Frequency Penalty

HackTech

11 Comments

antonvs

blt

Der_Einzige

orbital-decay

mdp2021

jlcases

neuroelectron

simonw

tomhowls

smcleod

ltbarcly3

Leave a comment Cancel reply

Editor's Choice

Dummy’s Guide to Modern LLM Sampling by nkko

Dummy’s Guide to Modern LLM Sampling by nkko

Share This Article

Newsletter

Intro Knowledge

Short Glossary

Why tokens?

Why not letters?

Why not whole words?

How are the sub-words chosen?

How does the model generate text?

From Tokens to Text: How LLMs Generate Content

Sampling

Notes on Algorithm Presentations

Notation Guide

Implementation Considerations

Temperature

Presence Penalty

Frequency Penalty

11 Comments

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter