Skip to content
Learn Agentic AI11 min read0 views

Debugging LLM Responses: When the Model Says Something Wrong or Unexpected

Learn systematic techniques for diagnosing why an LLM produces incorrect or surprising outputs, including prompt debugging, temperature tuning, few-shot correction, and structured output analysis.

The Model Said What?

Every developer building AI agents hits the same wall: the model returns something confidently wrong, hallucinates data that does not exist, or ignores a clear instruction. The instinct is to rewrite the entire prompt from scratch. That is almost never the right first step.

Debugging LLM responses requires the same discipline as debugging traditional software. You isolate the problem, form a hypothesis, test it, and iterate. The difference is that LLMs are stochastic — the same input can produce different outputs — so your debugging toolkit needs to account for non-determinism.

Step 1: Capture the Full Request and Response

Before you change anything, log the exact request that produced the bad output. This means the system prompt, user message, conversation history, tool definitions, and all model parameters:

import json
import openai
from datetime import datetime

class LLMDebugger:
    def __init__(self, client: openai.AsyncOpenAI):
        self.client = client
        self.debug_log = []

    async def chat(self, messages, model="gpt-4o", temperature=1.0, **kwargs):
        request_payload = {
            "model": model,
            "messages": messages,
            "temperature": temperature,
            **kwargs,
        }

        # Capture full request
        debug_entry = {
            "timestamp": datetime.utcnow().isoformat(),
            "request": request_payload,
        }

        response = await self.client.chat.completions.create(**request_payload)

        # Capture full response
        debug_entry["response"] = {
            "content": response.choices[0].message.content,
            "finish_reason": response.choices[0].finish_reason,
            "usage": {
                "prompt_tokens": response.usage.prompt_tokens,
                "completion_tokens": response.usage.completion_tokens,
            },
        }
        self.debug_log.append(debug_entry)
        return response

    def dump_last(self):
        if self.debug_log:
            print(json.dumps(self.debug_log[-1], indent=2))

With the full request captured, you can replay it to see if the problem is deterministic or intermittent.

Step 2: Check Temperature and Sampling

Temperature is the most common hidden cause of inconsistent behavior. A temperature of 1.0 introduces significant randomness. For agent tasks that require precision — tool selection, data extraction, classification — lower the temperature:

# High temperature: creative but unpredictable
response = await client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    temperature=1.0,  # Too high for structured tasks
)

# Low temperature: deterministic and precise
response = await client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    temperature=0.1,  # Suitable for tool calls and extraction
)

Run the same prompt 10 times at your current temperature. If the bad output appears in only 2 of 10 runs, the issue is sampling variance, not a prompt flaw.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Step 3: Isolate the Prompt Section

When the full prompt is long, identify which section is causing the issue. Comment out sections systematically:

def build_diagnostic_prompts(full_system_prompt: str, user_message: str):
    """Generate minimal prompt variants to isolate the problem."""
    sections = full_system_prompt.split("\n## ")
    variants = []

    for i, section in enumerate(sections):
        # Remove one section at a time
        reduced = "\n## ".join(
            s for j, s in enumerate(sections) if j != i
        )
        variants.append({
            "removed_section": i,
            "section_preview": section[:80],
            "messages": [
                {"role": "system", "content": reduced},
                {"role": "user", "content": user_message},
            ],
        })
    return variants

If removing a section fixes the problem, that section contains a conflicting or confusing instruction.

Step 4: Add Few-Shot Examples

When the model consistently misinterprets an instruction, few-shot examples are more effective than adding more explanation. Show the model what you want:

system_prompt = """You are a support agent. Extract the issue category.

Example input: "My payment was charged twice"
Example output: {"category": "billing", "urgency": "high"}

Example input: "How do I change my password?"
Example output: {"category": "account", "urgency": "low"}

Always respond with valid JSON only."""

Few-shot examples anchor the model to a specific output pattern. Two or three examples are usually sufficient.

FAQ

How do I debug a hallucinated tool call where the model invents a tool that does not exist?

Check that your tool definitions include clear, distinct descriptions. Models hallucinate tool names when existing tool descriptions are vague or overlap. Reduce temperature to 0.1 for tool selection and verify that the tools array in your request contains all expected entries. If the model still invents tools, add a system instruction explicitly stating it must only use the tools provided.

Should I always use temperature 0 for deterministic behavior?

Temperature 0 makes the output nearly deterministic but not perfectly so — there can be minor variations due to floating-point arithmetic across different hardware. Use temperature 0 or 0.1 for tasks requiring precision such as classification, extraction, and tool selection. Reserve higher temperatures for creative tasks like content generation where variety is desirable.

How many few-shot examples should I include to fix a recurring output format issue?

Two to three examples are usually enough to anchor the model to a specific format. More than five examples increase token usage without proportional improvement. Place examples near the beginning of the system prompt where they receive the most attention from the model.


#Debugging #LLM #PromptEngineering #AIAgents #Troubleshooting #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.