CI/CD for AI Agents: Automated Testing, Deployment, and Rollback Strategies
Learn how to build CI/CD pipelines for AI agents with prompt regression tests, tool integration tests, canary deployments, and automated rollback on quality degradation.
Why Traditional CI/CD Breaks for AI Agents
Traditional CI/CD pipelines test deterministic software: given the same input, the code produces the same output. Run the tests, check the assertions, deploy if green. AI agents break this model in three fundamental ways.
First, agent outputs are non-deterministic. The same prompt can produce different responses across runs, even at temperature zero, due to floating-point non-determinism in GPU inference. Your test assertions cannot be exact string matches.
Second, agents have more failure modes than traditional software. A code bug produces an error. An agent bug produces a confident, plausible, wrong answer. Your tests must evaluate quality, not just correctness.
Third, agent behavior depends on components outside your codebase: model versions, retrieval indexes, external API responses, and tool function behavior. A deployment that changes none of your code can still break your agent if the underlying model was updated.
Building CI/CD for agents means rethinking what "testing" means, what "deployment" means, and what "rollback" means.
The Agent Testing Pyramid
Just as traditional software has unit tests, integration tests, and end-to-end tests, agents need a testing pyramid with three layers: tool unit tests, agent integration tests, and evaluation benchmarks.
Tool unit tests verify that each tool function works correctly in isolation. These are traditional deterministic tests — give the tool an input, check the output. They run fast and catch most regressions.
Agent integration tests verify that the agent calls the right tools with the right parameters for a given user input. These are semi-deterministic — you assert on tool-call behavior, not on the final text output.
Evaluation benchmarks measure the end-to-end quality of the agent's responses against a curated dataset. These are statistical — you track aggregate metrics like accuracy, groundedness, and relevance, and you alert on regressions beyond a threshold.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
# Layer 1: Tool unit tests (deterministic)
import pytest
from unittest.mock import AsyncMock, patch
from agent.tools import search_knowledge_base, create_ticket
@pytest.mark.asyncio
async def test_search_knowledge_base_returns_results():
"""Tool returns structured results for a valid query."""
results = await search_knowledge_base(query="password reset", max_results=3)
assert len(results) <= 3
assert all("title" in r and "content" in r for r in results)
assert all(isinstance(r["relevance_score"], float) for r in results)
@pytest.mark.asyncio
async def test_search_knowledge_base_empty_query():
"""Tool returns empty list for empty query, not an error."""
results = await search_knowledge_base(query="", max_results=3)
assert results == []
@pytest.mark.asyncio
async def test_create_ticket_validates_priority():
"""Tool rejects invalid priority values."""
with pytest.raises(ValueError, match="priority must be one of"):
await create_ticket(
customer_id="cust_123",
summary="Test issue",
priority="super_urgent", # Invalid
)
# Layer 2: Agent integration tests (semi-deterministic)
@pytest.mark.asyncio
async def test_agent_calls_search_for_how_to_question():
"""Agent should use search tool when user asks a how-to question."""
agent = build_test_agent()
response = await agent.run("How do I reset my password?")
# Assert the agent called the right tool
tool_calls = response.get_tool_calls()
assert len(tool_calls) >= 1
assert any(tc.name == "search_knowledge_base" for tc in tool_calls)
# Assert the search query is relevant (not an exact match)
search_call = next(tc for tc in tool_calls if tc.name == "search_knowledge_base")
assert "password" in search_call.arguments["query"].lower()
@pytest.mark.asyncio
async def test_agent_creates_ticket_for_bug_report():
"""Agent should create a ticket when user reports a bug."""
agent = build_test_agent()
response = await agent.run(
"I found a bug: the export button crashes when I have more than 100 rows"
)
tool_calls = response.get_tool_calls()
ticket_calls = [tc for tc in tool_calls if tc.name == "create_ticket"]
assert len(ticket_calls) == 1
assert ticket_calls[0].arguments["priority"] in ["medium", "high"]
@pytest.mark.asyncio
async def test_agent_does_not_create_ticket_for_faq():
"""Agent should NOT create a ticket for a simple FAQ question."""
agent = build_test_agent()
response = await agent.run("What are your business hours?")
tool_calls = response.get_tool_calls()
ticket_calls = [tc for tc in tool_calls if tc.name == "create_ticket"]
assert len(ticket_calls) == 0 # No ticket for FAQ questions
Evaluation Benchmarks: The Quality Gate
Evaluation benchmarks are the most important and least intuitive part of agent CI/CD. You build a dataset of 50-200 test cases, each with a user input, expected tool calls, reference answer, and quality criteria. The pipeline runs the agent against this dataset and computes aggregate metrics.
# Layer 3: Evaluation benchmark pipeline
import json
from dataclasses import dataclass
from pathlib import Path
@dataclass
class EvalCase:
id: str
user_input: str
expected_tools: list[str] # Tool names the agent should call
reference_answer: str # Ground truth for comparison
required_facts: list[str] # Facts that must appear in the response
forbidden_content: list[str] # Content that must NOT appear
@dataclass
class EvalResult:
case_id: str
tool_call_accuracy: float # Did the agent call the right tools?
factual_coverage: float # What fraction of required facts appeared?
safety_pass: bool # No forbidden content present?
groundedness_score: float # Is the response supported by tool results?
relevance_score: float # Does the response address the question?
def load_eval_dataset(path: str) -> list[EvalCase]:
data = json.loads(Path(path).read_text())
return [EvalCase(**case) for case in data]
async def run_evaluation(agent, dataset: list[EvalCase]) -> dict[str, float]:
"""Run the agent against all eval cases and compute aggregate metrics."""
results: list[EvalResult] = []
for case in dataset:
response = await agent.run(case.user_input)
tool_calls = response.get_tool_calls()
# Tool call accuracy
called_tools = {tc.name for tc in tool_calls}
expected_tools = set(case.expected_tools)
tool_accuracy = len(called_tools & expected_tools) / max(len(expected_tools), 1)
# Factual coverage
response_text = response.text.lower()
facts_found = sum(1 for fact in case.required_facts if fact.lower() in response_text)
fact_coverage = facts_found / max(len(case.required_facts), 1)
# Safety check
safety_pass = not any(
forbidden.lower() in response_text
for forbidden in case.forbidden_content
)
# LLM-as-judge for groundedness and relevance
groundedness = await llm_judge_groundedness(response.text, tool_calls)
relevance = await llm_judge_relevance(response.text, case.user_input)
results.append(EvalResult(
case_id=case.id,
tool_call_accuracy=tool_accuracy,
factual_coverage=fact_coverage,
safety_pass=safety_pass,
groundedness_score=groundedness,
relevance_score=relevance,
))
# Aggregate metrics
n = len(results)
return {
"tool_call_accuracy": sum(r.tool_call_accuracy for r in results) / n,
"factual_coverage": sum(r.factual_coverage for r in results) / n,
"safety_pass_rate": sum(1 for r in results if r.safety_pass) / n,
"groundedness": sum(r.groundedness_score for r in results) / n,
"relevance": sum(r.relevance_score for r in results) / n,
}
The CI/CD Pipeline Configuration
With the three test layers defined, the pipeline ties them together. Tool tests run on every commit. Integration tests run on every pull request. Evaluation benchmarks run before every production deployment.
# .github/workflows/agent-ci-cd.yaml
name: Agent CI/CD Pipeline
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
env:
AGENT_MODEL: gemini-2.0-pro
EVAL_DATASET: tests/eval/benchmark_v3.json
jobs:
tool-unit-tests:
name: Tool Unit Tests
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- run: pip install -r requirements.txt -r requirements-test.txt
- run: pytest tests/tools/ -v --tb=short
env:
DATABASE_URL: ${{ secrets.TEST_DATABASE_URL }}
agent-integration-tests:
name: Agent Integration Tests
needs: tool-unit-tests
runs-on: ubuntu-latest
if: github.event_name == 'pull_request'
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- run: pip install -r requirements.txt -r requirements-test.txt
- run: pytest tests/agent/ -v --tb=short -x
env:
AGENT_MODEL: ${{ env.AGENT_MODEL }}
LLM_API_KEY: ${{ secrets.LLM_API_KEY }}
evaluation-benchmark:
name: Evaluation Benchmark
needs: agent-integration-tests
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- run: pip install -r requirements.txt -r requirements-test.txt
- name: Run evaluation benchmark
id: eval
run: |
python -m agent.evaluate \
--dataset ${{ env.EVAL_DATASET }} \
--output results.json \
--model ${{ env.AGENT_MODEL }}
env:
LLM_API_KEY: ${{ secrets.LLM_API_KEY }}
- name: Check quality gates
run: |
python scripts/check_quality_gates.py \
--results results.json \
--min-tool-accuracy 0.85 \
--min-factual-coverage 0.80 \
--min-safety-rate 0.99 \
--min-groundedness 0.80 \
--min-relevance 0.80
- name: Compare with baseline
run: |
python scripts/compare_with_baseline.py \
--current results.json \
--baseline baselines/production.json \
--max-regression 0.05
deploy-canary:
name: Canary Deployment
needs: evaluation-benchmark
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main'
steps:
- uses: actions/checkout@v4
- name: Deploy canary (10% traffic)
run: |
kubectl set image deployment/agent-canary \
agent=agent-image:${{ github.sha }}
kubectl scale deployment/agent-canary --replicas=1
- name: Monitor canary for 30 minutes
run: |
python scripts/monitor_canary.py \
--duration 1800 \
--metrics-endpoint ${{ secrets.METRICS_URL }} \
--error-threshold 0.05 \
--latency-p99-threshold 5000
promote-or-rollback:
name: Promote or Rollback
needs: deploy-canary
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Check canary health
id: health
run: python scripts/check_canary_health.py --output health.json
- name: Promote to production
if: steps.health.outputs.healthy == 'true'
run: |
kubectl set image deployment/agent-production \
agent=agent-image:${{ github.sha }}
kubectl rollout status deployment/agent-production --timeout=300s
# Update baseline for future comparisons
cp results.json baselines/production.json
- name: Rollback canary
if: steps.health.outputs.healthy == 'false'
run: |
kubectl rollout undo deployment/agent-canary
echo "::error::Canary deployment failed health checks. Rolled back."
exit 1
Canary Deployments and Automated Rollback
Canary deployments are critical for agents because agent failures are often subtle. A broken agent does not return HTTP 500 — it returns a polite, confident, wrong answer. You cannot detect this with standard health checks. Instead, you need quality-aware canary monitoring.
The canary monitor tracks three signal types: error rates (explicit failures), latency percentiles (degraded performance), and quality scores (evaluated by a judge model on a sample of live traffic). If any signal crosses its threshold during the canary window, the pipeline automatically rolls back.
# Canary monitoring with quality-aware rollback
import asyncio
import httpx
from datetime import datetime, timedelta
async def monitor_canary(
metrics_url: str,
duration_seconds: int,
error_threshold: float = 0.05,
latency_p99_threshold_ms: float = 5000,
quality_threshold: float = 0.75,
check_interval: int = 60,
) -> bool:
"""
Monitor canary deployment health. Returns True if healthy, False if rollback needed.
"""
end_time = datetime.utcnow() + timedelta(seconds=duration_seconds)
async with httpx.AsyncClient() as client:
while datetime.utcnow() < end_time:
# Fetch metrics from Prometheus/Grafana
metrics = await client.get(f"{metrics_url}/api/v1/query_range", params={
"query": "agent_canary_metrics",
"start": (datetime.utcnow() - timedelta(minutes=5)).isoformat(),
"end": datetime.utcnow().isoformat(),
"step": "30s",
})
data = metrics.json()
error_rate = extract_metric(data, "error_rate")
latency_p99 = extract_metric(data, "latency_p99_ms")
quality_score = extract_metric(data, "quality_score_avg")
print(f"[{datetime.utcnow().isoformat()}] "
f"errors={error_rate:.3f} "
f"p99={latency_p99:.0f}ms "
f"quality={quality_score:.3f}")
if error_rate > error_threshold:
print(f"ERROR RATE {error_rate:.3f} exceeds threshold {error_threshold}")
return False
if latency_p99 > latency_p99_threshold_ms:
print(f"LATENCY P99 {latency_p99:.0f}ms exceeds threshold {latency_p99_threshold_ms}ms")
return False
if quality_score < quality_threshold and quality_score > 0:
print(f"QUALITY SCORE {quality_score:.3f} below threshold {quality_threshold}")
return False
await asyncio.sleep(check_interval)
print("Canary monitoring completed successfully")
return True
Prompt Versioning and Regression Testing
Prompt changes are the most common source of agent regressions. A small change in wording can dramatically alter tool-calling behavior or response quality. Treat prompts as code: version them, review them in pull requests, and run regression tests before merging.
Store prompts in version-controlled files with metadata: a semantic version number, a changelog, and the evaluation benchmark results at the time of the last change. This creates a complete history of prompt evolution and its impact on quality.
The regression test compares the new prompt version against the current production prompt on the same evaluation dataset. If any metric drops by more than the allowed regression threshold (typically 3-5%), the pull request is blocked.
FAQ
How do you handle non-deterministic outputs in agent tests?
For tool-call assertions, test behavior not text. Assert that the agent called the correct tool with semantically correct parameters, not that the response contains an exact string. For quality metrics, use statistical thresholds: run each test case 3 times and take the median score. For safety tests, use the strictest criterion — the response must pass safety checks on every run, not just the average.
What is the recommended size for an agent evaluation benchmark dataset?
Start with 50-100 cases covering your most common request types and critical edge cases. Each case should represent a distinct scenario, not minor variations. Grow the dataset over time by adding cases from production failures and customer complaints. Google recommends at least 200 cases for agents handling diverse request types, but quality of cases matters more than quantity.
How often should evaluation benchmarks run in the CI/CD pipeline?
Run the full benchmark before every production deployment. For development branches, run a subset of 20-30 high-priority cases on every pull request to catch obvious regressions without slowing down the development cycle. Schedule a full benchmark run nightly against the production deployment to catch regressions caused by external changes like model updates or data drift.
Can you A/B test prompts through the CI/CD pipeline?
Yes. The canary deployment pattern naturally supports prompt A/B testing. Deploy the new prompt to the canary (10% of traffic), monitor quality metrics for both the canary and the control (production prompt), and promote only if the canary matches or exceeds the control. This requires tagging each request with the prompt version for later analysis.
Written by
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.