A/B Testing AI Agents: Comparing Prompts, Models, and Configurations in Production
Implement rigorous A/B testing for AI agents to compare prompts, models, and configurations in production with proper experiment design, traffic splitting, statistical significance, and safe rollout strategies.
Why A/B Testing Agents Is Harder Than A/B Testing Buttons
A/B testing a button color is straightforward: show variant A to half the users, variant B to the other half, measure click-through rate, compute statistical significance. A/B testing AI agents introduces complications. LLM outputs are non-deterministic — the same prompt and model can produce different responses on successive calls. Success metrics are multidimensional — a prompt that improves accuracy might increase latency or cost. And the feedback loop is slow — you need enough conversations to detect meaningful differences.
Despite these challenges, A/B testing is the only reliable way to know whether a prompt change, model switch, or configuration adjustment actually improves agent performance in production with real users.
Designing the Experiment Framework
Start with a configuration system that defines experiments and assigns users to variants deterministically.
from dataclasses import dataclass, field
import hashlib
from typing import Any
@dataclass
class Variant:
name: str
weight: float # Traffic allocation (0.0 to 1.0)
config: dict = field(default_factory=dict)
@dataclass
class Experiment:
id: str
name: str
variants: list[Variant]
enabled: bool = True
sticky: bool = True # Same user always gets same variant
def assign_variant(self, user_id: str) -> Variant:
"""Deterministic variant assignment based on user ID."""
hash_input = f"{self.id}:{user_id}"
hash_value = int(hashlib.sha256(hash_input.encode()).hexdigest(), 16)
bucket = (hash_value % 10000) / 10000.0
cumulative = 0.0
for variant in self.variants:
cumulative += variant.weight
if bucket < cumulative:
return variant
return self.variants[-1] # Fallback to last variant
# Define an experiment
prompt_experiment = Experiment(
id="exp_prompt_v2_march",
name="Support agent prompt v2",
variants=[
Variant(
name="control",
weight=0.5,
config={"system_prompt": "You are a helpful support agent..."},
),
Variant(
name="treatment",
weight=0.5,
config={"system_prompt": "You are an expert support agent. Always start by confirming the user's issue..."},
),
],
)
Integrating Experiments into the Agent
Apply the assigned variant's configuration before running the agent, and tag all metrics and events with the experiment and variant.
class ExperimentManager:
def __init__(self):
self.experiments: dict[str, Experiment] = {}
self.assignments: dict[str, dict[str, str]] = {} # user_id -> {exp_id: variant_name}
def register(self, experiment: Experiment):
self.experiments[experiment.id] = experiment
def get_variant(self, experiment_id: str, user_id: str) -> Variant | None:
exp = self.experiments.get(experiment_id)
if not exp or not exp.enabled:
return None
return exp.assign_variant(user_id)
def get_active_assignments(self, user_id: str) -> dict[str, Variant]:
return {
exp_id: exp.assign_variant(user_id)
for exp_id, exp in self.experiments.items()
if exp.enabled
}
experiments = ExperimentManager()
experiments.register(prompt_experiment)
async def run_agent_with_experiments(user_message: str, user_id: str, conversation_id: str):
# Get variant assignment
variant = experiments.get_variant("exp_prompt_v2_march", user_id)
if variant:
system_prompt = variant.config["system_prompt"]
experiment_tags = {
"experiment_id": "exp_prompt_v2_march",
"variant": variant.name,
}
else:
system_prompt = DEFAULT_SYSTEM_PROMPT
experiment_tags = {}
# Run the agent with the variant's config
response = await agent.run(
user_message,
system_prompt=system_prompt,
)
# Record metrics tagged with experiment info
await record_conversation_metrics(
conversation_id=conversation_id,
user_id=user_id,
response=response,
**experiment_tags,
)
return response
Collecting and Comparing Metrics
Collect the same metrics for both variants and compute the difference with confidence intervals.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
import math
from dataclasses import dataclass
@dataclass
class VariantMetrics:
variant_name: str
sample_size: int
completion_rate: float
avg_turns: float
avg_satisfaction: float
avg_latency_ms: float
avg_cost_usd: float
def compute_significance(control: VariantMetrics, treatment: VariantMetrics) -> dict:
"""Compute statistical significance for completion rate difference."""
p1 = control.completion_rate
p2 = treatment.completion_rate
n1 = control.sample_size
n2 = treatment.sample_size
if n1 == 0 or n2 == 0:
return {"significant": False, "reason": "insufficient data"}
# Pooled proportion for two-proportion z-test
pooled = (p1 * n1 + p2 * n2) / (n1 + n2)
se = math.sqrt(pooled * (1 - pooled) * (1 / n1 + 1 / n2))
if se == 0:
return {"significant": False, "reason": "zero variance"}
z_score = (p2 - p1) / se
# For 95% confidence, z > 1.96
significant = abs(z_score) > 1.96
return {
"significant": significant,
"z_score": round(z_score, 3),
"control_rate": round(p1, 4),
"treatment_rate": round(p2, 4),
"absolute_diff": round(p2 - p1, 4),
"relative_lift": round((p2 - p1) / p1 * 100, 2) if p1 > 0 else None,
"control_n": n1,
"treatment_n": n2,
}
Sample Size Planning
Before starting an experiment, estimate how many conversations you need to detect a meaningful difference.
def required_sample_size(
baseline_rate: float,
minimum_detectable_effect: float,
alpha: float = 0.05,
power: float = 0.80,
) -> int:
"""Calculate required sample size per variant."""
# z-scores for alpha and power
z_alpha = 1.96 if alpha == 0.05 else 2.576 # 95% or 99%
z_beta = 0.84 if power == 0.80 else 1.28 # 80% or 90%
p1 = baseline_rate
p2 = baseline_rate + minimum_detectable_effect
p_avg = (p1 + p2) / 2
numerator = (z_alpha * math.sqrt(2 * p_avg * (1 - p_avg)) +
z_beta * math.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) ** 2
denominator = (p2 - p1) ** 2
return math.ceil(numerator / denominator)
# Example: detect a 5% improvement on a 70% baseline completion rate
n = required_sample_size(0.70, 0.05)
# Returns ~783 conversations per variant
Safe Rollout After an Experiment Concludes
When a variant wins, roll it out gradually rather than flipping a switch for all users.
class GradualRollout:
def __init__(self, experiment_id: str, winning_variant: str):
self.experiment_id = experiment_id
self.winning_variant = winning_variant
self.rollout_percentage = 0.0 # Start at 0%
def set_rollout(self, percentage: float):
self.rollout_percentage = min(1.0, max(0.0, percentage))
def should_use_new_config(self, user_id: str) -> bool:
hash_input = f"rollout:{self.experiment_id}:{user_id}"
hash_value = int(hashlib.sha256(hash_input.encode()).hexdigest(), 16)
bucket = (hash_value % 10000) / 10000.0
return bucket < self.rollout_percentage
# Rollout schedule:
# Day 1: 10%, Day 2: 25%, Day 3: 50%, Day 5: 100%
rollout = GradualRollout("exp_prompt_v2_march", "treatment")
rollout.set_rollout(0.10)
FAQ
How long should I run an A/B test on an AI agent?
Run until you reach the required sample size for statistical significance, with a minimum of 7 days to capture day-of-week effects. For most agent deployments, 2-4 weeks provides enough data. Never stop an experiment early because the results look promising — early stopping inflates false positive rates. Set the duration upfront based on your traffic volume and minimum detectable effect.
Can I A/B test different LLM models against each other?
Yes, and this is one of the highest-value experiments you can run. Configure one variant with GPT-4o and another with Claude Sonnet, keeping the prompt identical. Compare on quality, latency, and cost simultaneously. Be aware that the same prompt often performs differently across models — if the model switch loses, try adapting the prompt for the new model before concluding it is inferior.
How do I handle experiments that affect multiple interacting agents?
Assign the variant at the conversation level, not the agent level. If a triage agent hands off to a specialist, both should use the same experiment variant. Pass the variant assignment as part of the handoff context. This prevents confounding where a user gets the new triage prompt but the old specialist prompt, which would make results uninterpretable.
#ABTesting #Experimentation #PromptEngineering #AIAgents #Production #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.