Testing Tool Execution: Verifying Agent Tool Calls and Side Effects

Why Tool Testing Deserves Its Own Strategy

AI agents that call tools interact with the real world — databases, APIs, file systems, payment processors. A bug in tool execution can send wrong emails, delete wrong records, or charge wrong amounts. Unlike text generation errors that are merely embarrassing, tool execution errors have real consequences.

Testing tool execution means verifying three things: the agent calls the right tool, passes the correct parameters, and your code handles the tool's response (or failure) correctly.

Building Testable Tool Interfaces

Design tools with a clean interface that separates the tool definition from its implementation.

from typing import Protocol, Any
from dataclasses import dataclass, field

class ToolExecutor(Protocol):
    def execute(self, name: str, arguments: dict) -> Any: ...

@dataclass
class MockToolExecutor:
    """Records tool calls and returns predetermined responses."""
    responses: dict[str, Any] = field(default_factory=dict)
    call_log: list[dict] = field(default_factory=list)

    def execute(self, name: str, arguments: dict) -> Any:
        self.call_log.append({"name": name, "arguments": arguments})
        if name in self.responses:
            response = self.responses[name]
            if callable(response):
                return response(arguments)
            return response
        raise ValueError(f"No mock response configured for tool: {name}")

Injecting the executor through the constructor makes it trivial to swap the real implementation for the mock in tests.

Verifying Tool Selection

Test that the agent picks the correct tool for a given user request.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

import pytest
from my_agent.core import Agent
from my_agent.tools import MockToolExecutor

@pytest.fixture
def mock_tools():
    return MockToolExecutor(responses={
        "search_orders": [{"id": 1, "status": "shipped"}],
        "cancel_order": {"success": True},
        "get_weather": {"temp": 72, "condition": "sunny"},
    })

def test_order_query_uses_search_tool(mock_tools):
    agent = Agent(tool_executor=mock_tools)
    agent.run("Where is my order #12345?")

    assert len(mock_tools.call_log) >= 1
    tool_names = [c["name"] for c in mock_tools.call_log]
    assert "search_orders" in tool_names

def test_weather_query_does_not_touch_orders(mock_tools):
    agent = Agent(tool_executor=mock_tools)
    agent.run("What is the weather in Chicago?")

    tool_names = [c["name"] for c in mock_tools.call_log]
    assert "search_orders" not in tool_names
    assert "get_weather" in tool_names

Parameter Assertion Patterns

Verify that the agent extracts and passes correct parameters from the user's message.

def test_search_passes_correct_order_id(mock_tools):
    agent = Agent(tool_executor=mock_tools)
    agent.run("Check the status of order #98765")

    search_calls = [c for c in mock_tools.call_log if c["name"] == "search_orders"]
    assert len(search_calls) == 1
    args = search_calls[0]["arguments"]
    assert args["order_id"] == "98765" or args.get("query") == "98765"

def test_date_range_parsing(mock_tools):
    agent = Agent(tool_executor=mock_tools)
    agent.run("Show me all orders from last week")

    search_calls = [c for c in mock_tools.call_log if c["name"] == "search_orders"]
    args = search_calls[0]["arguments"]
    assert "start_date" in args, "Agent should extract a date range"
    assert "end_date" in args

Testing Side Effects Safely

For tools that modify state, use a spy pattern to verify the call would happen without actually executing it.

@dataclass
class SpyToolExecutor:
    """Like MockToolExecutor but also tracks which calls were 'destructive'."""
    responses: dict[str, Any] = field(default_factory=dict)
    call_log: list[dict] = field(default_factory=list)
    destructive_tools: set = field(default_factory=lambda: {
        "cancel_order", "delete_record", "send_email", "charge_payment"
    })

    def execute(self, name: str, arguments: dict) -> Any:
        entry = {
            "name": name,
            "arguments": arguments,
            "destructive": name in self.destructive_tools,
        }
        self.call_log.append(entry)
        return self.responses.get(name, {"success": True})

    @property
    def destructive_calls(self) -> list[dict]:
        return [c for c in self.call_log if c["destructive"]]

def test_cancellation_requires_confirmation(mock_tools):
    """Ensure destructive actions are not taken without confirmation."""
    spy = SpyToolExecutor(responses={"cancel_order": {"success": True}})
    agent = Agent(tool_executor=spy, require_confirmation=True)

    result = agent.run("Cancel order #123")

    # Agent should ask for confirmation, not immediately cancel
    assert len(spy.destructive_calls) == 0
    assert "confirm" in result.lower() or "sure" in result.lower()

Testing Tool Error Handling

Verify your agent handles tool failures gracefully.

def test_agent_handles_tool_timeout(mock_tools):
    mock_tools.responses["search_orders"] = TimeoutError("API timeout")
    agent = Agent(tool_executor=mock_tools)

    result = agent.run("Find my order #123")

    assert "error" in result.lower() or "try again" in result.lower()
    assert "traceback" not in result.lower()  # No leaked internals

def test_agent_handles_tool_returning_empty(mock_tools):
    mock_tools.responses["search_orders"] = []
    agent = Agent(tool_executor=mock_tools)

    result = agent.run("Find order #999999")

    assert "not found" in result.lower() or "no results" in result.lower()

FAQ

How do I test tools that call external APIs?

Use the mock executor pattern shown above for unit tests. For integration tests, use a sandbox or staging environment of the external API. Many services (Stripe, Twilio) provide test modes specifically for this purpose.

Should I test tool execution order in multi-tool chains?

Yes, when order matters. For example, an agent should search before canceling. Assert on the order of entries in call_log. When order does not matter (parallel lookups), only verify that all expected tools were called.

How do I test tools that return large or complex payloads?

Create fixture files with realistic payloads and load them as mock responses. Test that your agent correctly extracts the relevant fields from complex nested structures rather than asserting on the entire payload.

#ToolExecution #AIAgents #Testing #Pytest #Mocking #Python #AgenticAI #LearnAI #AIEngineering

Testing Tool Execution: Verifying Agent Tool Calls and Side Effects

Why Tool Testing Deserves Its Own Strategy

Building Testable Tool Interfaces

Verifying Tool Selection

Parameter Assertion Patterns

Testing Side Effects Safely

Testing Tool Error Handling

FAQ

How do I test tools that call external APIs?

Should I test tool execution order in multi-tool chains?

How do I test tools that return large or complex payloads?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding