Unit Testing AI Agents: Mocking LLM Calls for Fast, Deterministic Tests

Why Unit Testing Agents Requires Special Patterns

AI agents depend on LLM calls that are non-deterministic, slow, and expensive. A single GPT-4 call takes 2-10 seconds and costs tokens — making it impractical to run hundreds of tests on every commit. Unit tests must be fast, free, and repeatable, which means you need a strategy for replacing real LLM calls with controlled substitutes.

The core challenge is that LLM outputs vary between calls even with temperature=0. Your tests need to verify your agent's logic — tool selection, state management, output parsing — without coupling to the exact wording an LLM produces.

Strategy 1: FakeLLM Classes

Create a drop-in replacement for your LLM client that returns predetermined responses.

from dataclasses import dataclass, field
from typing import Any

@dataclass
class FakeLLM:
    """A deterministic LLM replacement for unit tests."""
    responses: list[str] = field(default_factory=list)
    call_log: list[dict] = field(default_factory=list)
    _call_index: int = 0

    def chat(self, messages: list[dict], **kwargs) -> dict:
        self.call_log.append({"messages": messages, **kwargs})
        response = self.responses[self._call_index]
        self._call_index += 1
        return {"role": "assistant", "content": response}

This pattern lets you pre-load a sequence of responses and later inspect exactly what your agent sent to the LLM.

Strategy 2: Response Fixtures with pytest

Store realistic LLM responses as fixtures so multiple tests can share them.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

import pytest
import json
from pathlib import Path

@pytest.fixture
def tool_call_response():
    """Fixture simulating an LLM response that invokes a tool."""
    return {
        "role": "assistant",
        "content": None,
        "tool_calls": [
            {
                "id": "call_abc123",
                "type": "function",
                "function": {
                    "name": "search_database",
                    "arguments": json.dumps({"query": "open tickets", "limit": 10}),
                },
            }
        ],
    }

@pytest.fixture
def fixture_dir():
    return Path(__file__).parent / "fixtures"

def load_fixture(fixture_dir: Path, name: str) -> dict:
    return json.loads((fixture_dir / f"{name}.json").read_text())

Storing fixtures as JSON files in a tests/fixtures/ directory keeps tests clean and makes it easy to update expected responses when your prompts change.

Strategy 3: Patching with unittest.mock

Use unittest.mock.patch to intercept LLM calls at the boundary.

from unittest.mock import patch, MagicMock
from my_agent.core import Agent

def test_agent_extracts_entities():
    fake_response = MagicMock()
    fake_response.choices = [
        MagicMock(message=MagicMock(
            content='{"entities": ["Acme Corp", "Jane Doe"]}',
            tool_calls=None,
        ))
    ]

    with patch("my_agent.core.openai_client.chat.completions.create") as mock_create:
        mock_create.return_value = fake_response
        agent = Agent()
        result = agent.extract_entities("Contact Jane Doe at Acme Corp")

    assert result == ["Acme Corp", "Jane Doe"]
    mock_create.assert_called_once()
    call_args = mock_create.call_args
    assert any("extract" in str(m) for m in call_args.kwargs["messages"])

Assertion Patterns for Agent Tests

Focus your assertions on what your code controls, not on LLM output text.

def test_agent_selects_correct_tool(fake_llm):
    """Verify the agent passes the right tools to the LLM."""
    fake_llm.responses = ['{"action": "search", "query": "test"}']
    agent = Agent(llm=fake_llm)

    agent.run("Find recent orders")

    call = fake_llm.call_log[0]
    tool_names = [t["function"]["name"] for t in call["tools"]]
    assert "search_orders" in tool_names
    assert "delete_account" not in tool_names  # safety check

def test_agent_retries_on_parse_failure(fake_llm):
    """Verify retry logic when LLM returns malformed JSON."""
    fake_llm.responses = ["not json", '{"action": "search"}']
    agent = Agent(llm=fake_llm, max_retries=2)

    result = agent.run("Find orders")

    assert len(fake_llm.call_log) == 2  # retried once
    assert result["action"] == "search"

FAQ

How do I handle streaming responses in unit tests?

Create an async generator fixture that yields predetermined chunks. Replace the streaming client method with this generator using patch. This lets you test your chunk-assembly logic without a real stream.

Should I use `temperature=0` instead of mocking?

Setting temperature=0 reduces variance but does not eliminate it — model updates can still change outputs. It also still costs tokens and takes seconds per call. Use temperature=0 for integration tests, but always mock for unit tests.

How many response fixtures should I maintain?

Keep a small, representative set: one normal response, one tool-call response, one refusal, one malformed response, and one empty response. Five to ten fixtures cover most agent logic paths without becoming a maintenance burden.

#UnitTesting #AIAgents #Mocking #Pytest #Python #Testing #AgenticAI #LearnAI #AIEngineering

Unit Testing AI Agents: Mocking LLM Calls for Fast, Deterministic Tests

Why Unit Testing Agents Requires Special Patterns

Strategy 1: FakeLLM Classes

Strategy 2: Response Fixtures with pytest

Strategy 3: Patching with unittest.mock

Assertion Patterns for Agent Tests

FAQ

How do I handle streaming responses in unit tests?

Should I use `temperature=0` instead of mocking?

How many response fixtures should I maintain?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding

Why Unit Testing Agents Requires Special Patterns

Strategy 1: FakeLLM Classes

Strategy 2: Response Fixtures with pytest

Strategy 3: Patching with unittest.mock

Assertion Patterns for Agent Tests

FAQ

How do I handle streaming responses in unit tests?

Should I use temperature=0 instead of mocking?

How many response fixtures should I maintain?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding

Should I use `temperature=0` instead of mocking?