This content originally appeared on DEV Community and was authored by Mikhail Berkov
This is a cross-post, you can find the original article on my Medium
When generating text with AI, controlling randomness is key to balancing creativity and coherence. This article explains top-k and top-p (nucleus) sampling — two popular techniques that shape output quality and diversity. With clear Python examples and tuning tips, you’ll learn how to apply these methods to get results that match your goals.
Top-k Sampling
So far, we have covered greedy sampling and probabilistic sampling.
Greedy sampling is deterministic and always picks the most likely token.
Probabilistic sampling is non-deterministic and picks a token from the distribution potentially adjusted by the temperature parameter.
Sometimes, we want a middle ground: sampling probabilistically while constraining the selection to avoid low-quality tokens.
In top-k sampling, we consider only the top k most probable tokens and then sample from this restricted set:
import random
def sample_top_k(probabilities, k):
top_k_probabilities = sorted(probabilities, key=lambda item: item["prob"], reverse=True)[:k]
return random.choices(top_k_probabilities, weights=[item["prob"] for item in top_k_probabilities], k=1)[0]
Let’s use this function in a simple example:
from collections import defaultdict
probabilities = [
{"token": "Apple", "prob": 0.5},
{"token": "Banana", "prob": 0.3},
{"token": "Cherry", "prob": 0.1},
{"token": "Durian", "prob": 0.05},
{"token": "Elderberry", "prob": 0.05},
]
counts = defaultdict(int)
for _ in range(1000):
counts[sample_top_k(probabilities, k=3)["token"]] += 1
print(counts)
This will output something like:
{'Cherry': 110, 'Banana': 312, 'Apple': 578}
Note that we only select from the top 3 tokens—everything else is ignored.
The parameter k is a hyperparameter that you can tune for your task.
The higher k is, the more diverse the output will be.
Top-p Sampling
Top-k sampling is a simple and effective way to limit the tokens considered.
However, since k is fixed, it can be problematic: in some cases, the top k tokens may capture 99% of the probability mass, while in others, only 30%.
To address this, we can use top-p sampling (also known as nucleus sampling).
In top-p sampling, we include just enough tokens to capture a certain probability mass p.
We then sample from this set:
import random
def sample_top_p(probabilities, p):
sorted_probabilities = sorted(probabilities, key=lambda item: item["prob"], reverse=True)
top_p_probabilities = []
cumulative_prob = 0
for item in sorted_probabilities:
top_p_probabilities.append(item)
cumulative_prob += item["prob"]
if cumulative_prob >= p:
break
return random.choices(top_p_probabilities, weights=[item["prob"] for item in top_p_probabilities], k=1)[0]
Let’s use this function in a simple example:
from collections import defaultdict
logprobs = [
{"token": "Apple", "prob": 0.5},
{"token": "Banana", "prob": 0.3},
{"token": "Cherry", "prob": 0.1},
{"token": "Durian", "prob": 0.05},
{"token": "Elderberry", "prob": 0.05},
]
counts = defaultdict(int)
for _ in range(1000):
counts[sample_top_p(logprobs, p=0.9)["token"]] += 1
print(counts)
Here, we include all tokens whose cumulative probability meets or exceeds p=0.9
.
This means that the tokens “Apple”, “Banana” and “Cherry” are included, while “Durian” and “Elderberry” are not.
We can see this in the output:
{'Banana': 356, 'Apple': 531, 'Cherry': 113}
Let’s what happens if we set p=0.8
:
counts = defaultdict(int)
for _ in range(1000):
counts[sample_top_p(logprobs, p=0.8)["token"]] += 1
print(counts)
This will output something like:
{'Apple': 624, 'Banana': 376}
In this case, only the “Apple” and “Banana” tokens are sampled because their cumulative probability is already p=0.8
.
As with k, p is a tunable hyperparameter.
The higher p is, the more diverse the output will be.
In practice, top-p sampling is often preferred over top-k because it’s adaptive—it dynamically includes enough high-probability tokens to capture most of the probability mass.
You can specify the value of p using the top_p
parameter in the OpenAI API:
import os, requests
response = requests.post(
"https://api.openai.com/v1/chat/completions",
headers={
"Authorization": f"Bearer {os.getenv('OPENAI_API_KEY')}",
"Content-Type": "application/json",
},
json={
"model": "gpt-4o",
"messages": [
{"role": "user", "content": "How are you?"}
],
"top_p": 0.9
}
)
response_json = response.json()
content = response_json["choices"][0]["message"]["content"]
print(content)
It is generally recommended to specify either the temperature
or the top_p
parameter, but not both.
If you found this helpful, drop a and hit Follow to get more dev insights in your feed!
This content originally appeared on DEV Community and was authored by Mikhail Berkov