Debugging Tool Call Failures: Tracing Why Agent Tools Return Errors or Wrong Results

Tools Are the Hands of Your Agent

AI agents do not just generate text — they act. They call APIs, query databases, read files, and execute business logic through tool functions. When a tool call fails, the agent either retries blindly, hallucinates a result, or gives up entirely. None of these outcomes are acceptable in production.

Debugging tool call failures requires visibility into what the model requested, what parameters it sent, and what the tool function actually received and returned.

Building a Tool Call Interceptor

The first step is to wrap your tool execution with comprehensive logging. This interceptor captures every detail of the tool call lifecycle:

import json
import time
import traceback
from typing import Any, Callable
from dataclasses import dataclass, field

@dataclass
class ToolCallRecord:
    tool_name: str
    arguments: dict
    result: Any = None
    error: str | None = None
    duration_ms: float = 0
    timestamp: float = field(default_factory=time.time)

class ToolDebugger:
    def __init__(self):
        self.call_history: list[ToolCallRecord] = []

    def wrap(self, tool_fn: Callable, tool_name: str) -> Callable:
        async def wrapper(**kwargs):
            record = ToolCallRecord(
                tool_name=tool_name,
                arguments=kwargs,
            )
            start = time.perf_counter()
            try:
                result = await tool_fn(**kwargs)
                record.result = result
                record.duration_ms = (time.perf_counter() - start) * 1000
                return result
            except Exception as e:
                record.error = f"{type(e).__name__}: {e}"
                record.duration_ms = (time.perf_counter() - start) * 1000
                raise
            finally:
                self.call_history.append(record)
        return wrapper

    def print_history(self):
        for i, rec in enumerate(self.call_history):
            status = "OK" if rec.error is None else f"FAIL: {rec.error}"
            print(f"[{i}] {rec.tool_name} ({rec.duration_ms:.0f}ms) -> {status}")
            print(f"    Args: {json.dumps(rec.arguments, indent=2)}")

Inspecting Parameter Mismatches

The most common tool call failure is a parameter mismatch. The model sends arguments that do not match what the function expects. This happens when tool descriptions are ambiguous:

from agents import function_tool

# Bad: ambiguous parameter name
@function_tool
def search_orders(query: str) -> str:
    """Search customer orders."""
    # Model might send a natural language query OR an order ID
    pass

# Good: explicit parameters with clear types
@function_tool
def search_orders(
    customer_email: str,
    status: str = "all",
    limit: int = 10,
) -> str:
    """Search orders by customer email.

    Args:
        customer_email: The customer email address to search for.
        status: Filter by status. One of: all, pending, shipped, delivered.
        limit: Maximum number of results to return. Default 10.
    """
    pass

When parameter mismatches occur, compare what the model sent against your function signature. Log the raw tool_calls from the API response:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

async def inspect_tool_calls(response):
    for choice in response.choices:
        msg = choice.message
        if msg.tool_calls:
            for tc in msg.tool_calls:
                print(f"Tool: {tc.function.name}")
                print(f"Raw args: {tc.function.arguments}")
                try:
                    parsed = json.loads(tc.function.arguments)
                    print(f"Parsed: {json.dumps(parsed, indent=2)}")
                except json.JSONDecodeError as e:
                    print(f"INVALID JSON: {e}")

Replay Testing

Once you have captured a failed tool call, replay it in isolation to confirm the root cause:

class ToolReplayTester:
    def __init__(self, debugger: ToolDebugger):
        self.debugger = debugger

    async def replay(self, index: int, tool_registry: dict):
        record = self.debugger.call_history[index]
        tool_fn = tool_registry.get(record.tool_name)
        if not tool_fn:
            print(f"Tool '{record.tool_name}' not found in registry")
            return

        print(f"Replaying: {record.tool_name}")
        print(f"With args: {json.dumps(record.arguments, indent=2)}")
        try:
            result = await tool_fn(**record.arguments)
            print(f"Result: {result}")
        except Exception as e:
            print(f"Error: {e}")
            traceback.print_exc()

Mock Execution for Isolation

When a tool depends on external services, create mock versions that return controlled data. This isolates whether the failure is in your tool logic or the external dependency:

def create_mock_tool(tool_name: str, mock_response: Any):
    async def mock_fn(**kwargs):
        print(f"[MOCK] {tool_name} called with: {kwargs}")
        return mock_response
    return mock_fn

# Replace real tools with mocks for debugging
tool_registry = {
    "search_orders": create_mock_tool(
        "search_orders",
        {"orders": [{"id": "123", "status": "shipped"}]},
    ),
    "send_email": create_mock_tool(
        "send_email",
        {"sent": True, "message_id": "mock-001"},
    ),
}

FAQ

Why does the model sometimes send invalid JSON in tool call arguments?

This typically happens with older or smaller models when tool schemas are complex. Use strict mode in your function definitions if your API supports it, which forces the model to produce valid JSON matching your schema. Also simplify parameter types — avoid deeply nested objects when flat parameters work.

How do I handle the case where the model calls a tool with correct parameters but the tool returns unexpected results?

Add assertion-style checks inside your tool functions that validate the result before returning it. Log both the input parameters and the raw result from any external API your tool calls. This creates an audit trail that shows exactly where the data transformation went wrong.

Should I let the agent retry failed tool calls automatically?

Yes, but with limits. Allow one or two retries for transient failures like network timeouts. For parameter errors, return a clear error message describing what went wrong so the model can self-correct its arguments. Never allow unlimited retries as this wastes tokens and can cause infinite loops.

#Debugging #ToolCalling #AIAgents #Testing #Troubleshooting #AgenticAI #LearnAI #AIEngineering

Debugging Tool Call Failures: Tracing Why Agent Tools Return Errors or Wrong Results

Tools Are the Hands of Your Agent

Building a Tool Call Interceptor

Inspecting Parameter Mismatches

Replay Testing

Mock Execution for Isolation

FAQ

Why does the model sometimes send invalid JSON in tool call arguments?

How do I handle the case where the model calls a tool with correct parameters but the tool returns unexpected results?

Should I let the agent retry failed tool calls automatically?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding