Skip to content
Learn Agentic AI11 min read0 views

Temperature and Sampling: Controlling LLM Output Creativity

Master the sampling parameters that control LLM behavior — temperature, top-p, top-k, frequency penalty, and presence penalty — with practical examples showing when to use each.

How LLMs Choose Their Words

When an LLM generates text, it does not produce words directly. At each step, it computes a probability distribution over its entire vocabulary — typically 50,000 to 100,000 tokens. The model assigns a probability to every possible next token, and then it samples from that distribution. The sampling parameters you set control how that sampling happens, which in turn controls the character of the output.

This is the most practical lever you have for controlling LLM behavior without changing the prompt itself.

Temperature: The Master Dial

Temperature scales the logits (raw scores) before they are converted to probabilities via the softmax function. It is the single most important sampling parameter.

import numpy as np

def softmax_with_temperature(logits, temperature=1.0):
    """
    Apply temperature to logits before softmax.

    temperature < 1.0: sharper distribution (more deterministic)
    temperature = 1.0: default behavior
    temperature > 1.0: flatter distribution (more random)
    """
    scaled_logits = logits / temperature
    exp_logits = np.exp(scaled_logits - np.max(scaled_logits))
    return exp_logits / np.sum(exp_logits)

# Example: model raw logits for 5 candidate tokens
logits = np.array([5.0, 3.0, 1.0, 0.5, 0.1])
tokens = ["the", "a", "this", "my", "that"]

for temp in [0.1, 0.5, 1.0, 1.5, 2.0]:
    probs = softmax_with_temperature(logits, temperature=temp)
    print(f"Temperature {temp}:")
    for token, prob in zip(tokens, probs):
        bar = "#" * int(prob * 50)
        print(f"  {token:6s} {prob:.4f} {bar}")
    print()

At temperature 0.1, the highest-probability token gets almost all the weight — the output becomes nearly deterministic. At temperature 2.0, the probabilities are spread more evenly, and the model frequently picks less-likely tokens.

Temperature 0 is a special case. Most APIs treat it as greedy decoding — always pick the highest-probability token. This makes the output completely deterministic (same input produces the same output):

from openai import OpenAI

client = OpenAI()

# Deterministic output: always produces the same response
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is 2 + 2?"}],
    temperature=0,  # Greedy decoding — deterministic
)
print(response.choices[0].message.content)

# Creative output: varies between runs
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a haiku about programming."}],
    temperature=1.2,  # Higher creativity
)
print(response.choices[0].message.content)

Top-p (Nucleus Sampling): Dynamic Vocabulary Filtering

Top-p sampling, also called nucleus sampling, takes a different approach. Instead of scaling all probabilities, it only considers the smallest set of tokens whose cumulative probability exceeds the threshold p:

def top_p_sampling(logits, p=0.9):
    """
    Nucleus sampling: only consider the top tokens
    whose cumulative probability exceeds p.
    """
    probs = softmax_with_temperature(logits, temperature=1.0)

    # Sort tokens by probability (descending)
    sorted_indices = np.argsort(probs)[::-1]
    sorted_probs = probs[sorted_indices]

    # Find the cutoff where cumulative probability exceeds p
    cumulative = np.cumsum(sorted_probs)
    cutoff_index = np.searchsorted(cumulative, p) + 1

    # Zero out tokens below the cutoff
    allowed_indices = sorted_indices[:cutoff_index]
    filtered_probs = np.zeros_like(probs)
    filtered_probs[allowed_indices] = probs[allowed_indices]

    # Re-normalize
    filtered_probs /= np.sum(filtered_probs)

    return filtered_probs

logits = np.array([5.0, 3.0, 1.0, 0.5, 0.1, -1.0, -2.0])
tokens = ["the", "a", "this", "my", "that", "our", "their"]

for p in [0.5, 0.9, 0.95]:
    probs = top_p_sampling(logits, p=p)
    active = [(t, pr) for t, pr in zip(tokens, probs) if pr > 0.001]
    print(f"top_p={p}: {len(active)} tokens considered: {active}")

The advantage of top-p over temperature is adaptability. When the model is confident (one token has 95% probability), top-p=0.9 keeps only that token. When the model is uncertain (many tokens with similar probabilities), it lets more through. Temperature applies the same scaling regardless of the distribution shape.

Top-k Sampling: Fixed Vocabulary Cutoff

Top-k is the simplest filtering strategy: keep the k highest-probability tokens, discard the rest:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

def top_k_sampling(logits, k=10):
    """Only consider the top k tokens."""
    probs = softmax_with_temperature(logits, temperature=1.0)

    # Find indices of top k tokens
    top_k_indices = np.argsort(probs)[-k:]

    # Zero out everything else
    filtered_probs = np.zeros_like(probs)
    filtered_probs[top_k_indices] = probs[top_k_indices]
    filtered_probs /= np.sum(filtered_probs)

    return filtered_probs

Top-k is less commonly used with modern APIs because it does not adapt to the confidence level. With k=50, the model considers 50 tokens whether it is very confident or very uncertain. Top-p is generally preferred for this reason.

Frequency and Presence Penalties

These parameters address repetition, one of the most common LLM failure modes:

# Frequency penalty: reduces probability proportional to how many times
# a token has already appeared. Higher values = less repetition.
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a paragraph about the ocean."}],
    frequency_penalty=0.5,  # Range: -2.0 to 2.0
)

# Presence penalty: reduces probability of any token that has appeared at all,
# regardless of how many times. Encourages topic diversity.
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Write a paragraph about the ocean."}],
    presence_penalty=0.5,  # Range: -2.0 to 2.0
)

The difference is subtle but important. Frequency penalty penalizes tokens more each time they appear — saying "ocean" three times gets penalized more than saying it once. Presence penalty applies a flat penalty once a token has appeared at all. Use frequency penalty to reduce repetitive phrases within a response and presence penalty to encourage the model to explore new topics.

Practical Parameter Recommendations

Different use cases call for different parameter combinations:

# Factual Q&A: deterministic, focused
factual_params = {
    "temperature": 0,
    "top_p": 1.0,
    "frequency_penalty": 0,
    "presence_penalty": 0,
}

# Code generation: low temperature, slight penalty for repetition
code_params = {
    "temperature": 0.2,
    "top_p": 0.95,
    "frequency_penalty": 0.1,
    "presence_penalty": 0,
}

# Creative writing: higher temperature, topic diversity
creative_params = {
    "temperature": 0.9,
    "top_p": 0.95,
    "frequency_penalty": 0.3,
    "presence_penalty": 0.5,
}

# Brainstorming: high temperature, strong diversity
brainstorm_params = {
    "temperature": 1.2,
    "top_p": 0.9,
    "frequency_penalty": 0.5,
    "presence_penalty": 0.8,
}

# Data extraction / classification: fully deterministic
extraction_params = {
    "temperature": 0,
    "top_p": 1.0,
    "frequency_penalty": 0,
    "presence_penalty": 0,
}

The Interaction Between Temperature and Top-p

A common mistake is setting both temperature and top-p to extreme values simultaneously. They interact in ways that can produce unexpected results:

# GOOD: Use one or the other as your primary control
# Option A: Temperature-based control
{"temperature": 0.3, "top_p": 1.0}   # top_p=1.0 means no filtering

# Option B: top_p-based control
{"temperature": 1.0, "top_p": 0.5}   # temperature=1.0 means no scaling

# AVOID: Both aggressive simultaneously
{"temperature": 0.2, "top_p": 0.5}   # Double restriction — very rigid
{"temperature": 1.5, "top_p": 0.99}  # Temperature adds randomness that top_p barely filters

OpenAI's documentation recommends adjusting either temperature or top-p, but not both. In practice, temperature is the more intuitive control for most developers.

FAQ

What temperature should I use for a production chatbot?

For most production chatbots, start with temperature 0.7 and top_p 1.0. This produces natural-sounding responses with enough variation to avoid feeling robotic, while staying focused enough to be reliable. For customer service bots where accuracy matters more than creativity, drop to 0.3. For creative applications like story generation, go up to 0.9 or 1.0. Always test with real user queries before committing to a value.

Why does temperature 0 sometimes give different outputs?

Floating-point arithmetic on GPUs is not perfectly deterministic across different hardware configurations. Even with temperature 0, tiny numerical differences can cause a different token to be selected when two tokens have very similar probabilities. OpenAI provides a seed parameter that improves determinism but does not guarantee it. For applications requiring exact reproducibility, cache the responses rather than relying on deterministic generation.

Can I change sampling parameters mid-conversation?

Yes. Sampling parameters are set per API call, not per conversation. You can use temperature 0 for a factual lookup, then switch to temperature 0.8 for a creative follow-up. This is a useful technique for multi-step agents that need different modes for different tasks — structured data extraction with temperature 0 followed by user-facing summary generation with temperature 0.7.


#Temperature #Sampling #LLM #PromptEngineering #APIParameters #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.