Integration Testing for AI Agent Connections: Mocking External Services and Verifying Flows

Why Integration Testing Matters for AI Agents

AI agents that connect to external services — Slack, GitHub, Stripe, Notion — have integration surfaces that unit tests cannot cover. A unit test might verify that your agent formats a Jira ticket correctly, but it cannot verify that the Jira API accepts that format, that your authentication works, or that webhook signatures validate properly. Integration tests close this gap by testing the full request-response cycle against realistic service behavior.

The challenge is testing against external APIs without making real API calls in CI, which would be slow, flaky, and expensive. The solution: mock servers and recorded interactions.

Setting Up Mock Servers with Respx

Respx is a library that intercepts httpx requests and returns predefined responses. It is ideal for testing agents that use httpx-based API clients.

import pytest
import respx
import httpx
from your_agent.github_client import GitHubClient

@pytest.fixture
def github_client():
    return GitHubClient(token="test-token-fake")

@respx.mock
@pytest.mark.asyncio
async def test_create_issue_comment(github_client):
    # Mock the GitHub API endpoint
    route = respx.post(
        "https://api.github.com/repos/owner/repo/issues/42/comments"
    ).mock(return_value=httpx.Response(
        201,
        json={
            "id": 123456,
            "body": "AI Triage: This is a bug",
            "created_at": "2026-03-17T10:00:00Z",
        },
    ))

    result = await github_client.create_issue_comment(
        owner="owner",
        repo="repo",
        issue_number=42,
        body="AI Triage: This is a bug",
    )

    assert result["id"] == 123456
    assert route.called
    # Verify the request body
    sent_body = route.calls[0].request.content
    assert b"AI Triage" in sent_body

@respx.mock
@pytest.mark.asyncio
async def test_handles_github_rate_limit(github_client):
    respx.post(
        "https://api.github.com/repos/owner/repo/issues/1/comments"
    ).mock(return_value=httpx.Response(
        429,
        headers={"Retry-After": "60"},
        json={"message": "API rate limit exceeded"},
    ))

    with pytest.raises(httpx.HTTPStatusError) as exc_info:
        await github_client.create_issue_comment(
            "owner", "repo", 1, "test"
        )
    assert exc_info.value.response.status_code == 429

VCR-Style Recording with pytest-recording

VCR records real API responses and replays them in subsequent test runs. This gives you realistic test data without the manual effort of writing fixtures.

# Install: pip install pytest-recording vcrpy
import pytest

@pytest.mark.vcr()
@pytest.mark.asyncio
async def test_fetch_pull_request_diff(github_client):
    """First run makes a real API call and records the response.
    Subsequent runs replay the recorded response."""
    diff = await github_client.get_pull_request_diff(
        owner="your-org",
        repo="your-repo",
        pr_number=100,
    )

    assert "diff --git" in diff
    assert len(diff) > 0

# Configure VCR in conftest.py
@pytest.fixture(scope="module")
def vcr_config():
    return {
        "filter_headers": [
            "authorization",  # Strip auth tokens from recordings
            "x-api-key",
        ],
        "filter_query_parameters": ["api_key"],
        "record_mode": "once",  # Record once, replay forever
        "cassette_library_dir": "tests/cassettes",
        "decode_compressed_response": True,
    }

Cassette files (YAML recordings) are committed to your repository so CI can replay them without API access.

Testing Webhook Signature Verification

Webhook handlers must verify signatures. Test both valid and invalid signatures to ensure security.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

import hmac
import hashlib
import json
from fastapi.testclient import TestClient
from your_agent.webhook_hub import app

client = TestClient(app)

def generate_github_signature(payload: bytes, secret: str) -> str:
    return "sha256=" + hmac.new(
        secret.encode(), payload, hashlib.sha256
    ).hexdigest()

def test_valid_github_webhook():
    payload = json.dumps({
        "action": "opened",
        "issue": {"number": 1, "title": "Test", "body": "Bug report"},
        "sender": {"login": "testuser"},
        "repository": {"name": "repo", "owner": {"login": "owner"}},
    }).encode()

    signature = generate_github_signature(payload, "gh-secret")

    response = client.post(
        "/webhooks/github",
        content=payload,
        headers={
            "Content-Type": "application/json",
            "X-Hub-Signature-256": signature,
            "X-GitHub-Event": "issues",
        },
    )
    assert response.status_code == 200
    assert response.json()["status"] == "accepted"

def test_invalid_signature_rejected():
    payload = b'{"test": true}'
    response = client.post(
        "/webhooks/github",
        content=payload,
        headers={
            "Content-Type": "application/json",
            "X-Hub-Signature-256": "sha256=invalid",
            "X-GitHub-Event": "ping",
        },
    )
    assert response.status_code == 401

Testing the Full Agent Flow

End-to-end tests verify the complete chain: webhook received, event normalized, agent processes, action taken.

@respx.mock
@pytest.mark.asyncio
async def test_issue_triage_full_flow():
    # Mock the AI agent's LLM call
    respx.post("https://api.openai.com/v1/chat/completions").mock(
        return_value=httpx.Response(200, json={
            "choices": [{
                "message": {
                    "content": json.dumps({
                        "labels": ["bug", "high-priority"],
                        "priority": "P1",
                        "comment": "This appears to be a critical bug.",
                    })
                }
            }]
        })
    )

    # Mock the GitHub label and comment APIs
    label_route = respx.post(
        "https://api.github.com/repos/owner/repo/issues/5/labels"
    ).mock(return_value=httpx.Response(200, json=[]))

    comment_route = respx.post(
        "https://api.github.com/repos/owner/repo/issues/5/comments"
    ).mock(return_value=httpx.Response(201, json={"id": 999}))

    # Simulate the webhook
    payload = {
        "action": "opened",
        "issue": {
            "number": 5,
            "title": "App crashes on login",
            "body": "After the latest update the app crashes.",
        },
        "sender": {"login": "reporter"},
        "repository": {
            "name": "repo",
            "owner": {"login": "owner"},
        },
    }

    await handle_issue_event(payload)

    assert label_route.called
    assert comment_route.called
    comment_body = json.loads(comment_route.calls[0].request.content)
    assert "P1" in comment_body["body"]

CI Pipeline Configuration

Configure your CI to run integration tests with proper environment setup.

# .github/workflows/integration-tests.yml content as Python dict for reference
ci_config = {
    "name": "Integration Tests",
    "on": {"push": {"branches": ["main"]}, "pull_request": {}},
    "jobs": {
        "integration": {
            "runs-on": "ubuntu-latest",
            "steps": [
                {"uses": "actions/checkout@v4"},
                {"uses": "actions/setup-python@v5",
                 "with": {"python-version": "3.12"}},
                {"run": "pip install -e '.[test]'"},
                {
                    "run": "pytest tests/integration/ -v --tb=short",
                    "env": {
                        "TESTING": "true",
                        "WEBHOOK_SECRET": "test-secret",
                    },
                },
            ],
        }
    },
}

The key principles: never use real API keys in CI, commit VCR cassettes alongside tests, and separate integration tests from unit tests so they can run on different schedules.

FAQ

When should I use mock servers versus VCR recordings?

Use mock servers (respx, responses) when you need precise control over edge cases — rate limits, timeouts, malformed responses, and error codes. Use VCR recordings when you want to capture realistic API behavior including complex response structures and headers. Many teams use both: VCR for happy-path tests and mocks for error-case tests.

How do I keep VCR cassettes from becoming stale?

Set up a scheduled CI job (weekly or monthly) that runs tests in "record" mode against the real APIs using a test account. This refreshes the cassettes and catches API changes early. Also configure cassette expiration so tests fail loudly if a recording is older than a set threshold, prompting a re-record.

Should I test the actual LLM responses or mock them?

Mock LLM responses for deterministic integration tests. Real LLM calls are non-deterministic, slow, and expensive — they make tests flaky. Mock the LLM with fixed responses that represent the structured output your agent expects, then test that your code correctly processes those outputs into API calls. Test the LLM integration separately with a small set of evaluation tests that run on a less frequent schedule.

#IntegrationTesting #Mocking #CICD #AIAgents #TestAutomation #AgenticAI #LearnAI #AIEngineering

Integration Testing for AI Agent Connections: Mocking External Services and Verifying Flows

Why Integration Testing Matters for AI Agents

Setting Up Mock Servers with Respx

VCR-Style Recording with pytest-recording

Testing Webhook Signature Verification

Testing the Full Agent Flow

CI Pipeline Configuration

FAQ

When should I use mock servers versus VCR recordings?

How do I keep VCR cassettes from becoming stale?

Should I test the actual LLM responses or mock them?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding