Testing Tool Execution: Verifying Agent Tool Calls and Side Effects
Learn how to test AI agent tool execution with tool mocking, call verification, parameter assertions, and side effect tracking using pytest in Python.
Why Tool Testing Deserves Its Own Strategy
AI agents that call tools interact with the real world — databases, APIs, file systems, payment processors. A bug in tool execution can send wrong emails, delete wrong records, or charge wrong amounts. Unlike text generation errors that are merely embarrassing, tool execution errors have real consequences.
Testing tool execution means verifying three things: the agent calls the right tool, passes the correct parameters, and your code handles the tool's response (or failure) correctly.
Building Testable Tool Interfaces
Design tools with a clean interface that separates the tool definition from its implementation.
from typing import Protocol, Any
from dataclasses import dataclass, field
class ToolExecutor(Protocol):
def execute(self, name: str, arguments: dict) -> Any: ...
@dataclass
class MockToolExecutor:
"""Records tool calls and returns predetermined responses."""
responses: dict[str, Any] = field(default_factory=dict)
call_log: list[dict] = field(default_factory=list)
def execute(self, name: str, arguments: dict) -> Any:
self.call_log.append({"name": name, "arguments": arguments})
if name in self.responses:
response = self.responses[name]
if callable(response):
return response(arguments)
return response
raise ValueError(f"No mock response configured for tool: {name}")
Injecting the executor through the constructor makes it trivial to swap the real implementation for the mock in tests.
Verifying Tool Selection
Test that the agent picks the correct tool for a given user request.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
import pytest
from my_agent.core import Agent
from my_agent.tools import MockToolExecutor
@pytest.fixture
def mock_tools():
return MockToolExecutor(responses={
"search_orders": [{"id": 1, "status": "shipped"}],
"cancel_order": {"success": True},
"get_weather": {"temp": 72, "condition": "sunny"},
})
def test_order_query_uses_search_tool(mock_tools):
agent = Agent(tool_executor=mock_tools)
agent.run("Where is my order #12345?")
assert len(mock_tools.call_log) >= 1
tool_names = [c["name"] for c in mock_tools.call_log]
assert "search_orders" in tool_names
def test_weather_query_does_not_touch_orders(mock_tools):
agent = Agent(tool_executor=mock_tools)
agent.run("What is the weather in Chicago?")
tool_names = [c["name"] for c in mock_tools.call_log]
assert "search_orders" not in tool_names
assert "get_weather" in tool_names
Parameter Assertion Patterns
Verify that the agent extracts and passes correct parameters from the user's message.
def test_search_passes_correct_order_id(mock_tools):
agent = Agent(tool_executor=mock_tools)
agent.run("Check the status of order #98765")
search_calls = [c for c in mock_tools.call_log if c["name"] == "search_orders"]
assert len(search_calls) == 1
args = search_calls[0]["arguments"]
assert args["order_id"] == "98765" or args.get("query") == "98765"
def test_date_range_parsing(mock_tools):
agent = Agent(tool_executor=mock_tools)
agent.run("Show me all orders from last week")
search_calls = [c for c in mock_tools.call_log if c["name"] == "search_orders"]
args = search_calls[0]["arguments"]
assert "start_date" in args, "Agent should extract a date range"
assert "end_date" in args
Testing Side Effects Safely
For tools that modify state, use a spy pattern to verify the call would happen without actually executing it.
@dataclass
class SpyToolExecutor:
"""Like MockToolExecutor but also tracks which calls were 'destructive'."""
responses: dict[str, Any] = field(default_factory=dict)
call_log: list[dict] = field(default_factory=list)
destructive_tools: set = field(default_factory=lambda: {
"cancel_order", "delete_record", "send_email", "charge_payment"
})
def execute(self, name: str, arguments: dict) -> Any:
entry = {
"name": name,
"arguments": arguments,
"destructive": name in self.destructive_tools,
}
self.call_log.append(entry)
return self.responses.get(name, {"success": True})
@property
def destructive_calls(self) -> list[dict]:
return [c for c in self.call_log if c["destructive"]]
def test_cancellation_requires_confirmation(mock_tools):
"""Ensure destructive actions are not taken without confirmation."""
spy = SpyToolExecutor(responses={"cancel_order": {"success": True}})
agent = Agent(tool_executor=spy, require_confirmation=True)
result = agent.run("Cancel order #123")
# Agent should ask for confirmation, not immediately cancel
assert len(spy.destructive_calls) == 0
assert "confirm" in result.lower() or "sure" in result.lower()
Testing Tool Error Handling
Verify your agent handles tool failures gracefully.
def test_agent_handles_tool_timeout(mock_tools):
mock_tools.responses["search_orders"] = TimeoutError("API timeout")
agent = Agent(tool_executor=mock_tools)
result = agent.run("Find my order #123")
assert "error" in result.lower() or "try again" in result.lower()
assert "traceback" not in result.lower() # No leaked internals
def test_agent_handles_tool_returning_empty(mock_tools):
mock_tools.responses["search_orders"] = []
agent = Agent(tool_executor=mock_tools)
result = agent.run("Find order #999999")
assert "not found" in result.lower() or "no results" in result.lower()
FAQ
How do I test tools that call external APIs?
Use the mock executor pattern shown above for unit tests. For integration tests, use a sandbox or staging environment of the external API. Many services (Stripe, Twilio) provide test modes specifically for this purpose.
Should I test tool execution order in multi-tool chains?
Yes, when order matters. For example, an agent should search before canceling. Assert on the order of entries in call_log. When order does not matter (parallel lookups), only verify that all expected tools were called.
How do I test tools that return large or complex payloads?
Create fixture files with realistic payloads and load them as mock responses. Test that your agent correctly extracts the relevant fields from complex nested structures rather than asserting on the entire payload.
#ToolExecution #AIAgents #Testing #Pytest #Mocking #Python #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.