Configuration Observability: Tracking Which Config Changes Impact Agent Performance

The Missing Link: Config-to-Performance Correlation

Most teams track agent performance metrics (latency, error rate, task completion) and separately track configuration changes (who changed what, when). But very few connect the two. When performance degrades, the debugging conversation goes: "Did anyone change anything?" followed by frantic Slack messages. Configuration observability closes this gap by automatically correlating config changes with performance shifts.

The key principle is that every configuration change is an event that creates a "before" and "after" window. By comparing performance metrics in those windows, you can attribute performance changes to specific configuration modifications.

Change Event Model

Every configuration change generates a structured event that captures the full context of what changed.

from dataclasses import dataclass, field
from datetime import datetime
from typing import Any, Optional
import json
import hashlib


@dataclass
class ConfigChangeEvent:
    event_id: str
    agent_id: str
    timestamp: datetime
    changed_by: str
    change_type: str  # "prompt", "model", "temperature", "tools", "guardrails"
    field_path: str
    old_value: Any
    new_value: Any
    old_config_hash: str
    new_config_hash: str
    change_reason: Optional[str] = None
    tags: list[str] = field(default_factory=list)


class ChangeEventStore:
    def __init__(self):
        self._events: list[ConfigChangeEvent] = []

    def record(self, event: ConfigChangeEvent):
        self._events.append(event)

    def get_changes_in_window(
        self, agent_id: str, start: datetime, end: datetime
    ) -> list[ConfigChangeEvent]:
        return [
            e for e in self._events
            if e.agent_id == agent_id
            and start <= e.timestamp <= end
        ]

    def get_recent_changes(
        self, agent_id: str, limit: int = 10
    ) -> list[ConfigChangeEvent]:
        agent_events = [
            e for e in self._events if e.agent_id == agent_id
        ]
        return sorted(
            agent_events, key=lambda e: e.timestamp, reverse=True
        )[:limit]

Performance Metrics Collector

Collect agent performance metrics with enough granularity to detect changes. Each metric point carries a config hash so you can group metrics by configuration version.

from dataclasses import dataclass
import time
import statistics


@dataclass
class PerformanceMetric:
    agent_id: str
    config_hash: str
    timestamp: float
    metric_name: str
    metric_value: float
    session_id: str


class PerformanceCollector:
    def __init__(self):
        self._metrics: list[PerformanceMetric] = []

    def record(
        self,
        agent_id: str,
        config_hash: str,
        session_id: str,
        metrics: dict[str, float],
    ):
        now = time.time()
        for name, value in metrics.items():
            self._metrics.append(
                PerformanceMetric(
                    agent_id=agent_id,
                    config_hash=config_hash,
                    timestamp=now,
                    metric_name=name,
                    metric_value=value,
                    session_id=session_id,
                )
            )

    def get_metrics_by_hash(
        self, agent_id: str, config_hash: str, metric_name: str
    ) -> list[float]:
        return [
            m.metric_value
            for m in self._metrics
            if m.agent_id == agent_id
            and m.config_hash == config_hash
            and m.metric_name == metric_name
        ]

    def get_summary(
        self, agent_id: str, config_hash: str, metric_name: str
    ) -> dict:
        values = self.get_metrics_by_hash(agent_id, config_hash, metric_name)
        if not values:
            return {"count": 0}
        return {
            "count": len(values),
            "mean": statistics.mean(values),
            "median": statistics.median(values),
            "stdev": statistics.stdev(values) if len(values) > 1 else 0.0,
            "p95": sorted(values)[int(len(values) * 0.95)],
            "min": min(values),
            "max": max(values),
        }

Config-Performance Correlation Engine

The correlation engine compares performance metrics before and after each configuration change to determine its impact.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

import math
from typing import NamedTuple


class ImpactAnalysis(NamedTuple):
    change_event: ConfigChangeEvent
    metric_name: str
    before_mean: float
    after_mean: float
    relative_change: float
    is_significant: bool
    p_value: float
    sample_sizes: tuple[int, int]
    verdict: str  # "improved", "degraded", "neutral"


class CorrelationEngine:
    def __init__(
        self,
        change_store: ChangeEventStore,
        perf_collector: PerformanceCollector,
    ):
        self._changes = change_store
        self._perf = perf_collector

    def analyze_change_impact(
        self,
        change_event: ConfigChangeEvent,
        metric_name: str,
        significance_threshold: float = 0.05,
    ) -> ImpactAnalysis:
        before_values = self._perf.get_metrics_by_hash(
            change_event.agent_id,
            change_event.old_config_hash,
            metric_name,
        )
        after_values = self._perf.get_metrics_by_hash(
            change_event.agent_id,
            change_event.new_config_hash,
            metric_name,
        )

        if len(before_values) < 5 or len(after_values) < 5:
            return ImpactAnalysis(
                change_event=change_event,
                metric_name=metric_name,
                before_mean=statistics.mean(before_values) if before_values else 0,
                after_mean=statistics.mean(after_values) if after_values else 0,
                relative_change=0.0,
                is_significant=False,
                p_value=1.0,
                sample_sizes=(len(before_values), len(after_values)),
                verdict="insufficient_data",
            )

        before_mean = statistics.mean(before_values)
        after_mean = statistics.mean(after_values)

        # Welch's t-test
        p_value = self._welch_t_test(before_values, after_values)
        relative_change = (
            (after_mean - before_mean) / before_mean
            if before_mean != 0 else 0.0
        )
        is_significant = p_value < significance_threshold

        if not is_significant:
            verdict = "neutral"
        elif relative_change > 0:
            verdict = "improved"
        else:
            verdict = "degraded"

        return ImpactAnalysis(
            change_event=change_event,
            metric_name=metric_name,
            before_mean=before_mean,
            after_mean=after_mean,
            relative_change=relative_change,
            is_significant=is_significant,
            p_value=p_value,
            sample_sizes=(len(before_values), len(after_values)),
            verdict=verdict,
        )

    def _welch_t_test(self, a: list[float], b: list[float]) -> float:
        n1, n2 = len(a), len(b)
        mean1, mean2 = statistics.mean(a), statistics.mean(b)
        var1 = statistics.variance(a)
        var2 = statistics.variance(b)

        se = math.sqrt(var1 / n1 + var2 / n2)
        if se == 0:
            return 1.0

        t_stat = abs(mean1 - mean2) / se
        # Approximate p-value using normal distribution for large samples
        p_value = 2 * (1 - 0.5 * (1 + math.erf(t_stat / math.sqrt(2))))
        return p_value

Automated Rollback Triggers

When a configuration change causes a statistically significant degradation, trigger an automatic rollback and alert the team.

@dataclass
class RollbackRule:
    metric_name: str
    max_degradation_percent: float  # e.g., 10.0 means 10% worse
    min_sample_size: int = 30
    cooldown_minutes: int = 60


class AutoRollbackMonitor:
    def __init__(
        self,
        correlation_engine: CorrelationEngine,
        rules: list[RollbackRule],
    ):
        self._engine = correlation_engine
        self._rules = rules

    def evaluate(
        self, change_event: ConfigChangeEvent
    ) -> dict:
        violations = []

        for rule in self._rules:
            analysis = self._engine.analyze_change_impact(
                change_event, rule.metric_name
            )

            total_samples = sum(analysis.sample_sizes)
            if total_samples < rule.min_sample_size:
                continue

            degradation = -analysis.relative_change * 100
            if (
                analysis.is_significant
                and analysis.verdict == "degraded"
                and degradation > rule.max_degradation_percent
            ):
                violations.append({
                    "rule": rule.metric_name,
                    "degradation_percent": round(degradation, 2),
                    "threshold_percent": rule.max_degradation_percent,
                    "p_value": round(analysis.p_value, 4),
                    "before_mean": round(analysis.before_mean, 4),
                    "after_mean": round(analysis.after_mean, 4),
                })

        should_rollback = len(violations) > 0

        return {
            "change_event_id": change_event.event_id,
            "should_rollback": should_rollback,
            "violations": violations,
            "checked_rules": len(self._rules),
        }

Observability Dashboard Data

Provide an API endpoint that the dashboard queries to show the timeline of config changes overlaid with performance metrics.

from fastapi import FastAPI

app = FastAPI()


@app.get("/api/agents/{agent_id}/config-impact")
def get_config_impact_timeline(agent_id: str, metric: str = "task_completion_rate"):
    change_store = ChangeEventStore()
    perf_collector = PerformanceCollector()
    engine = CorrelationEngine(change_store, perf_collector)

    recent_changes = change_store.get_recent_changes(agent_id, limit=20)

    timeline = []
    for change in recent_changes:
        analysis = engine.analyze_change_impact(change, metric)
        timeline.append({
            "timestamp": change.timestamp.isoformat(),
            "changed_by": change.changed_by,
            "field": change.field_path,
            "change_type": change.change_type,
            "before_mean": round(analysis.before_mean, 4),
            "after_mean": round(analysis.after_mean, 4),
            "relative_change_pct": round(analysis.relative_change * 100, 2),
            "verdict": analysis.verdict,
            "significant": analysis.is_significant,
        })

    return {"agent_id": agent_id, "metric": metric, "timeline": timeline}

Building the Annotation Layer

The most valuable observability feature is annotations — markers on your performance graphs that show exactly when a config change happened. This transforms a mysterious performance dip into an explainable event.

class AnnotationBuilder:
    def build_annotations(
        self, changes: list[ConfigChangeEvent]
    ) -> list[dict]:
        return [
            {
                "time": change.timestamp.isoformat(),
                "title": f"Config: {change.field_path}",
                "description": (
                    f"{change.changed_by} changed {change.change_type} "
                    f"from {self._truncate(change.old_value)} "
                    f"to {self._truncate(change.new_value)}"
                ),
                "tags": change.tags,
                "severity": self._classify_severity(change),
            }
            for change in changes
        ]

    def _truncate(self, value: Any, max_len: int = 50) -> str:
        s = str(value)
        return s[:max_len] + "..." if len(s) > max_len else s

    def _classify_severity(self, change: ConfigChangeEvent) -> str:
        high_risk = {"model", "system_prompt", "temperature"}
        if change.change_type in high_risk:
            return "high"
        return "low"

FAQ

How long should I keep performance data before and after a config change?

Keep at least 24 hours of data on each side of the change to account for daily usage patterns. For lower-traffic agents, extend this to 72 hours to accumulate enough samples for statistical significance. Archive raw metrics after 90 days but retain the aggregated impact analysis indefinitely — it forms a knowledge base of what kinds of changes help or hurt performance.

What metrics should I track for config-performance correlation?

Start with four core metrics: task completion rate (did the agent successfully help the user), average latency per turn, error rate (tool failures, API errors, guardrail blocks), and cost per conversation (token usage multiplied by model pricing). As you mature, add user satisfaction scores and escalation rates. Each metric tells a different story — a model change might improve completion rate but increase cost.

How do I prevent alert fatigue from the rollback monitor?

Set the minimum sample size threshold high enough that you only alert on statistically meaningful changes. Require at least 30 observations per config version before evaluating. Use a cooldown period so the same change does not trigger multiple alerts. Group related alerts — if three metrics degrade simultaneously after one config change, send one alert with all three violations rather than three separate alerts.

#Observability #AIAgents #ConfigurationManagement #PerformanceMonitoring #Python #AgenticAI #LearnAI #AIEngineering

Configuration Observability: Tracking Which Config Changes Impact Agent Performance

The Missing Link: Config-to-Performance Correlation

Change Event Model

Performance Metrics Collector

Config-Performance Correlation Engine

Automated Rollback Triggers

Observability Dashboard Data

Building the Annotation Layer

FAQ

How long should I keep performance data before and after a config change?

What metrics should I track for config-performance correlation?

How do I prevent alert fatigue from the rollback monitor?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding