Skip to content
Learn Agentic AI
Learn Agentic AI15 min read0 views

Agent A/B Testing: Comparing Model Versions, Prompts, and Architectures in Production

How to A/B test AI agents in production: traffic splitting, evaluation metrics, statistical significance, prompt version comparison, and architecture experiments.

Why A/B Testing Agents Is Different from A/B Testing Software

In traditional software A/B testing, you change a button color or page layout and measure click-through rates. The outcome is binary and easily measurable. Agent A/B testing is fundamentally harder for three reasons.

First, the outcome you care about — response quality — is subjective and multi-dimensional. An agent response can be factually correct but unhelpful, or helpful but poorly grounded in source material. You need multiple evaluation metrics, not one.

Second, variance is high. The same agent configuration produces different responses to the same input across runs. You need more samples to reach statistical significance than a typical UI experiment.

Third, the components you want to test interact in complex ways. Swapping the model affects tool-call behavior. Changing the prompt affects response format. Updating a retrieval index affects factual accuracy. These interactions make it hard to attribute improvements to a single change.

Despite these challenges, A/B testing is the only reliable way to make agent improvement decisions. Offline evaluation datasets do not capture the full distribution of real user queries, and intuition-based prompt changes often backfire in unexpected ways.

The Agent Experimentation Framework

A production-grade agent A/B testing system needs four components: traffic splitting, evaluation pipeline, metrics collection, and statistical analysis.

# agent_experiment.py — Core experimentation framework
import hashlib
import random
from dataclasses import dataclass, field
from typing import Any
from datetime import datetime, timezone

@dataclass
class ExperimentVariant:
    variant_id: str
    name: str
    description: str
    config: dict[str, Any]  # Agent configuration overrides
    traffic_percentage: float  # 0.0 to 1.0

@dataclass
class Experiment:
    experiment_id: str
    name: str
    description: str
    variants: list[ExperimentVariant]
    start_date: datetime
    end_date: datetime | None = None
    status: str = "running"  # running, paused, completed
    min_samples_per_variant: int = 200
    metrics: list[str] = field(default_factory=lambda: [
        "user_satisfaction",
        "tool_call_accuracy",
        "response_groundedness",
        "response_relevance",
        "resolution_rate",
        "cost_per_interaction",
        "latency_p95",
    ])


class ExperimentRouter:
    """Route requests to experiment variants using consistent hashing."""

    def __init__(self, experiments: list[Experiment]):
        self.experiments = {e.experiment_id: e for e in experiments}

    def assign_variant(
        self, experiment_id: str, user_id: str
    ) -> ExperimentVariant | None:
        """
        Deterministically assign a user to a variant using consistent hashing.
        The same user always gets the same variant for a given experiment.
        """
        experiment = self.experiments.get(experiment_id)
        if not experiment or experiment.status != "running":
            return None

        # Consistent hash: same user_id always maps to same variant
        hash_input = f"{experiment_id}:{user_id}"
        hash_value = int(hashlib.sha256(hash_input.encode()).hexdigest(), 16)
        bucket = (hash_value % 10000) / 10000.0  # 0.0 to 1.0

        cumulative = 0.0
        for variant in experiment.variants:
            cumulative += variant.traffic_percentage
            if bucket < cumulative:
                return variant

        return experiment.variants[-1]  # Fallback to last variant


# Example: A/B test comparing two prompt versions
prompt_experiment = Experiment(
    experiment_id="exp-prompt-v3-vs-v4",
    name="System Prompt V3 vs V4",
    description="Testing whether adding explicit tool-call instructions improves accuracy",
    start_date=datetime(2026, 3, 20, tzinfo=timezone.utc),
    variants=[
        ExperimentVariant(
            variant_id="control",
            name="Prompt V3 (current production)",
            description="Current system prompt without explicit tool instructions",
            config={"system_prompt_version": "v3"},
            traffic_percentage=0.5,
        ),
        ExperimentVariant(
            variant_id="treatment",
            name="Prompt V4 (with tool instructions)",
            description="Updated prompt with explicit 'use tool X when...' instructions",
            config={"system_prompt_version": "v4"},
            traffic_percentage=0.5,
        ),
    ],
)

Traffic Splitting Strategies

There are three traffic splitting strategies for agent experiments: user-level, session-level, and request-level. Each has tradeoffs.

User-level splitting (recommended for most cases): Each user is permanently assigned to a variant for the duration of the experiment. This prevents within-user inconsistency — a customer does not experience different agent behaviors on different visits. Use consistent hashing on the user ID.

Session-level splitting: Each new conversation session is randomly assigned to a variant, but all messages within a session use the same variant. This generates data faster than user-level splitting but introduces within-user inconsistency.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Request-level splitting: Each individual request is independently assigned. This is the fastest way to generate data but produces a confusing user experience and is only appropriate for internal or batch-processing agents.

# Agent middleware that applies experiment configuration
from fastapi import Request, Depends

async def experiment_middleware(request: Request):
    """Apply experiment configuration to the agent for this request."""
    user_id = get_authenticated_user_id(request)
    active_experiments = await get_active_experiments()

    variant_assignments = {}
    agent_config_overrides = {}

    for experiment in active_experiments:
        variant = router.assign_variant(experiment.experiment_id, user_id)
        if variant:
            variant_assignments[experiment.experiment_id] = variant.variant_id
            agent_config_overrides.update(variant.config)

    # Store assignments for metrics collection
    request.state.experiment_variants = variant_assignments
    request.state.agent_config = agent_config_overrides

    return variant_assignments


async def run_agent_with_experiment(
    user_input: str,
    request: Request,
) -> dict:
    """Run the agent with experiment-specific configuration."""
    config = request.state.agent_config

    # Build agent with experiment overrides
    agent = build_agent(
        system_prompt=load_prompt(config.get("system_prompt_version", "production")),
        model=config.get("model_id", DEFAULT_MODEL),
        tools=load_tools(config.get("tool_set", "default")),
        temperature=config.get("temperature", 0.1),
    )

    response = await agent.run(user_input)

    # Record experiment data
    await record_experiment_observation(
        experiment_variants=request.state.experiment_variants,
        user_input=user_input,
        response=response,
        agent_config=config,
    )

    return response

Evaluation Metrics for Agent Experiments

Agent experiments require multiple metrics evaluated at different time scales. Immediate metrics are computed per-request. Session metrics are computed per-conversation. Business metrics are computed over days or weeks.

# Metrics computation for agent experiments
from dataclasses import dataclass

@dataclass
class ImmediateMetrics:
    """Computed per request, available in real time."""
    latency_ms: float
    token_count_input: int
    token_count_output: int
    cost_usd: float
    tool_calls_count: int
    tool_call_errors: int
    model_id: str

@dataclass
class QualityMetrics:
    """Computed asynchronously via LLM-as-judge."""
    groundedness: float     # 0-1: is the response grounded in tool results?
    relevance: float        # 0-1: does the response address the user's question?
    helpfulness: float      # 0-1: is the response actionable and complete?
    safety: float           # 0-1: does the response comply with policies?

@dataclass
class SessionMetrics:
    """Computed at session end."""
    turns_to_resolution: int
    resolved: bool
    escalated: bool
    user_satisfaction: float | None  # From post-conversation survey (1-5)


async def compute_quality_metrics_sample(
    observations: list[dict],
    sample_rate: float = 0.1,
) -> list[QualityMetrics]:
    """
    Evaluate a random sample of observations using LLM-as-judge.
    Sampling reduces evaluation cost while maintaining statistical power.
    """
    sample_size = max(1, int(len(observations) * sample_rate))
    sample = random.sample(observations, sample_size)

    results = []
    for obs in sample:
        metrics = await evaluate_with_judge(
            user_input=obs["user_input"],
            agent_response=obs["response_text"],
            tool_results=obs["tool_results"],
            reference_sources=obs["retrieved_documents"],
        )
        results.append(metrics)

    return results

Statistical Analysis for Agent Experiments

Agent A/B tests require careful statistical analysis because the metrics are continuous (not binary) and high-variance. Use the Welch t-test for comparing means and the Mann-Whitney U test as a non-parametric alternative when distributions are skewed.

# Statistical analysis for agent A/B tests
import numpy as np
from scipy import stats
from dataclasses import dataclass

@dataclass
class ExperimentResult:
    metric_name: str
    control_mean: float
    control_std: float
    control_n: int
    treatment_mean: float
    treatment_std: float
    treatment_n: int
    absolute_diff: float
    relative_diff_pct: float
    p_value: float
    confidence_interval: tuple[float, float]
    significant: bool
    power: float

def analyze_experiment(
    control_values: list[float],
    treatment_values: list[float],
    metric_name: str,
    alpha: float = 0.05,
    minimum_detectable_effect: float = 0.05,
) -> ExperimentResult:
    """Run statistical analysis comparing control vs treatment."""
    control = np.array(control_values)
    treatment = np.array(treatment_values)

    control_mean = float(np.mean(control))
    treatment_mean = float(np.mean(treatment))
    control_std = float(np.std(control, ddof=1))
    treatment_std = float(np.std(treatment, ddof=1))

    absolute_diff = treatment_mean - control_mean
    relative_diff = (absolute_diff / control_mean * 100) if control_mean != 0 else 0

    # Welch's t-test (does not assume equal variances)
    t_stat, p_value = stats.ttest_ind(control, treatment, equal_var=False)

    # 95% confidence interval for the difference
    se = np.sqrt(control_std**2 / len(control) + treatment_std**2 / len(treatment))
    ci_low = absolute_diff - 1.96 * se
    ci_high = absolute_diff + 1.96 * se

    # Compute statistical power
    pooled_std = np.sqrt((control_std**2 + treatment_std**2) / 2)
    effect_size = abs(absolute_diff) / pooled_std if pooled_std > 0 else 0

    from statsmodels.stats.power import TTestIndPower
    power_analysis = TTestIndPower()
    power = power_analysis.solve_power(
        effect_size=effect_size,
        nobs1=len(control),
        ratio=len(treatment) / len(control),
        alpha=alpha,
    ) if effect_size > 0 else 0

    return ExperimentResult(
        metric_name=metric_name,
        control_mean=control_mean,
        control_std=control_std,
        control_n=len(control),
        treatment_mean=treatment_mean,
        treatment_std=treatment_std,
        treatment_n=len(treatment),
        absolute_diff=absolute_diff,
        relative_diff_pct=relative_diff,
        p_value=float(p_value),
        confidence_interval=(float(ci_low), float(ci_high)),
        significant=p_value < alpha,
        power=float(power),
    )


def generate_experiment_report(
    experiment: Experiment,
    metric_results: list[ExperimentResult],
) -> str:
    """Generate a human-readable experiment report."""
    lines = [
        f"# Experiment Report: {experiment.name}",
        f"ID: {experiment.experiment_id}",
        f"Start: {experiment.start_date.isoformat()}",
        "",
        "## Results by Metric",
        "",
    ]

    for result in metric_results:
        status = "SIGNIFICANT" if result.significant else "NOT SIGNIFICANT"
        direction = "improvement" if result.absolute_diff > 0 else "degradation"

        lines.extend([
            f"### {result.metric_name}",
            f"- Control: {result.control_mean:.4f} (n={result.control_n})",
            f"- Treatment: {result.treatment_mean:.4f} (n={result.treatment_n})",
            f"- Difference: {result.absolute_diff:+.4f} ({result.relative_diff_pct:+.1f}%)",
            f"- p-value: {result.p_value:.4f} [{status}]",
            f"- 95% CI: [{result.confidence_interval[0]:.4f}, {result.confidence_interval[1]:.4f}]",
            f"- Power: {result.power:.2f}",
            f"- Direction: {direction}",
            "",
        ])

    return "\n".join(lines)

Common Experiment Types

Prompt comparison: The most common experiment. Keep the model and tools constant, change only the system prompt. This isolates the impact of prompt engineering. Run for 500-1,000 observations per variant for reliable results.

Model comparison: Keep the prompt and tools constant, change the model. This is useful when evaluating whether a cheaper model can match the quality of a more expensive one. Watch for changes in tool-calling patterns — different models have different tool-call behaviors even with identical prompts.

Architecture comparison: Test fundamentally different agent designs — for example, single-agent vs. multi-agent, or RAG vs. fine-tuned. These experiments require larger sample sizes because the variance between architectures is higher, and they often affect multiple metrics in different directions (one architecture may be faster but less accurate).

Retrieval strategy comparison: Keep the agent constant, change the retrieval backend. For example, compare keyword search vs. semantic search, or test different chunk sizes and overlap settings. These experiments often have the largest impact on groundedness and factual accuracy.

Guardrails and Early Stopping

Production experiments need safety guardrails. If the treatment variant causes a spike in error rates, customer complaints, or escalations, the experiment should automatically pause before reaching statistical significance.

# Experiment guardrails with automatic early stopping
async def check_guardrails(
    experiment_id: str,
    variant_id: str,
    observations: list[dict],
) -> tuple[bool, str]:
    """
    Check if an experiment variant has violated safety guardrails.
    Returns (should_pause, reason).
    """
    if len(observations) < 50:
        return False, "Not enough observations for guardrail check"

    recent = observations[-100:]  # Check last 100 observations

    # Guardrail 1: Error rate
    error_count = sum(1 for obs in recent if obs.get("status") == "error")
    error_rate = error_count / len(recent)
    if error_rate > 0.10:
        return True, f"Error rate {error_rate:.1%} exceeds 10% threshold"

    # Guardrail 2: Escalation rate
    escalated = sum(1 for obs in recent if obs.get("escalated", False))
    escalation_rate = escalated / len(recent)
    if escalation_rate > 0.25:
        return True, f"Escalation rate {escalation_rate:.1%} exceeds 25% threshold"

    # Guardrail 3: Quality score floor
    quality_scores = [obs["quality_score"] for obs in recent if "quality_score" in obs]
    if quality_scores and np.mean(quality_scores) < 0.50:
        return True, f"Average quality score {np.mean(quality_scores):.2f} below 0.50 floor"

    # Guardrail 4: Cost anomaly
    costs = [obs["cost_usd"] for obs in recent if "cost_usd" in obs]
    if costs:
        avg_cost = np.mean(costs)
        baseline_cost = await get_baseline_cost(experiment_id)
        if avg_cost > baseline_cost * 3:
            return True, f"Average cost ${avg_cost:.4f} is 3x baseline ${baseline_cost:.4f}"

    return False, "All guardrails passed"

FAQ

How many observations do you need per variant for a reliable agent A/B test?

It depends on the metric and expected effect size. For binary metrics like resolution rate, use a standard power analysis — typically 500-1,000 observations per variant to detect a 5% change with 80% power. For continuous metrics like quality scores, 200-400 observations per variant is usually sufficient because the effect sizes tend to be larger. Use a power calculator with your observed variance to plan the experiment duration.

Can you run multiple agent experiments simultaneously?

Yes, but with caution. If experiments modify different components (one tests a new prompt, another tests a new retrieval strategy), they are orthogonal and can run simultaneously using factorial experiment design. If both experiments modify the same component, they will interfere with each other and should run sequentially. Use experiment tagging so you can filter results by the combination of active variants.

How do you handle the cold-start problem when A/B testing agents with memory?

Agents that maintain conversation history or user preference memory create a cold-start bias — the control variant has accumulated memory from past interactions, while the treatment variant starts fresh. Handle this by either testing only on new users (eliminating the memory advantage), or by copying the existing memory state to the treatment variant at experiment start, or by running the experiment long enough that the treatment variant builds its own memory (typically 2-4 weeks).

What is the most common mistake in agent A/B testing?

Calling experiments too early. Agent metrics are high-variance, and it is tempting to declare a winner after 100 observations when the p-value happens to be below 0.05. Always set sample size requirements before the experiment starts and commit to running until that threshold is reached. Also, watch for the multiple comparisons problem — if you track 7 metrics and use p < 0.05, you expect at least one false positive by chance. Use Bonferroni correction or focus your decision on a single primary metric.

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.