Configuration Observability: Tracking Which Config Changes Impact Agent Performance
Build observability into your AI agent configuration pipeline. Learn change tracking, performance correlation analysis, anomaly detection, and automated rollback triggers.
The Missing Link: Config-to-Performance Correlation
Most teams track agent performance metrics (latency, error rate, task completion) and separately track configuration changes (who changed what, when). But very few connect the two. When performance degrades, the debugging conversation goes: "Did anyone change anything?" followed by frantic Slack messages. Configuration observability closes this gap by automatically correlating config changes with performance shifts.
The key principle is that every configuration change is an event that creates a "before" and "after" window. By comparing performance metrics in those windows, you can attribute performance changes to specific configuration modifications.
Change Event Model
Every configuration change generates a structured event that captures the full context of what changed.
from dataclasses import dataclass, field
from datetime import datetime
from typing import Any, Optional
import json
import hashlib
@dataclass
class ConfigChangeEvent:
event_id: str
agent_id: str
timestamp: datetime
changed_by: str
change_type: str # "prompt", "model", "temperature", "tools", "guardrails"
field_path: str
old_value: Any
new_value: Any
old_config_hash: str
new_config_hash: str
change_reason: Optional[str] = None
tags: list[str] = field(default_factory=list)
class ChangeEventStore:
def __init__(self):
self._events: list[ConfigChangeEvent] = []
def record(self, event: ConfigChangeEvent):
self._events.append(event)
def get_changes_in_window(
self, agent_id: str, start: datetime, end: datetime
) -> list[ConfigChangeEvent]:
return [
e for e in self._events
if e.agent_id == agent_id
and start <= e.timestamp <= end
]
def get_recent_changes(
self, agent_id: str, limit: int = 10
) -> list[ConfigChangeEvent]:
agent_events = [
e for e in self._events if e.agent_id == agent_id
]
return sorted(
agent_events, key=lambda e: e.timestamp, reverse=True
)[:limit]
Performance Metrics Collector
Collect agent performance metrics with enough granularity to detect changes. Each metric point carries a config hash so you can group metrics by configuration version.
from dataclasses import dataclass
import time
import statistics
@dataclass
class PerformanceMetric:
agent_id: str
config_hash: str
timestamp: float
metric_name: str
metric_value: float
session_id: str
class PerformanceCollector:
def __init__(self):
self._metrics: list[PerformanceMetric] = []
def record(
self,
agent_id: str,
config_hash: str,
session_id: str,
metrics: dict[str, float],
):
now = time.time()
for name, value in metrics.items():
self._metrics.append(
PerformanceMetric(
agent_id=agent_id,
config_hash=config_hash,
timestamp=now,
metric_name=name,
metric_value=value,
session_id=session_id,
)
)
def get_metrics_by_hash(
self, agent_id: str, config_hash: str, metric_name: str
) -> list[float]:
return [
m.metric_value
for m in self._metrics
if m.agent_id == agent_id
and m.config_hash == config_hash
and m.metric_name == metric_name
]
def get_summary(
self, agent_id: str, config_hash: str, metric_name: str
) -> dict:
values = self.get_metrics_by_hash(agent_id, config_hash, metric_name)
if not values:
return {"count": 0}
return {
"count": len(values),
"mean": statistics.mean(values),
"median": statistics.median(values),
"stdev": statistics.stdev(values) if len(values) > 1 else 0.0,
"p95": sorted(values)[int(len(values) * 0.95)],
"min": min(values),
"max": max(values),
}
Config-Performance Correlation Engine
The correlation engine compares performance metrics before and after each configuration change to determine its impact.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
import math
from typing import NamedTuple
class ImpactAnalysis(NamedTuple):
change_event: ConfigChangeEvent
metric_name: str
before_mean: float
after_mean: float
relative_change: float
is_significant: bool
p_value: float
sample_sizes: tuple[int, int]
verdict: str # "improved", "degraded", "neutral"
class CorrelationEngine:
def __init__(
self,
change_store: ChangeEventStore,
perf_collector: PerformanceCollector,
):
self._changes = change_store
self._perf = perf_collector
def analyze_change_impact(
self,
change_event: ConfigChangeEvent,
metric_name: str,
significance_threshold: float = 0.05,
) -> ImpactAnalysis:
before_values = self._perf.get_metrics_by_hash(
change_event.agent_id,
change_event.old_config_hash,
metric_name,
)
after_values = self._perf.get_metrics_by_hash(
change_event.agent_id,
change_event.new_config_hash,
metric_name,
)
if len(before_values) < 5 or len(after_values) < 5:
return ImpactAnalysis(
change_event=change_event,
metric_name=metric_name,
before_mean=statistics.mean(before_values) if before_values else 0,
after_mean=statistics.mean(after_values) if after_values else 0,
relative_change=0.0,
is_significant=False,
p_value=1.0,
sample_sizes=(len(before_values), len(after_values)),
verdict="insufficient_data",
)
before_mean = statistics.mean(before_values)
after_mean = statistics.mean(after_values)
# Welch's t-test
p_value = self._welch_t_test(before_values, after_values)
relative_change = (
(after_mean - before_mean) / before_mean
if before_mean != 0 else 0.0
)
is_significant = p_value < significance_threshold
if not is_significant:
verdict = "neutral"
elif relative_change > 0:
verdict = "improved"
else:
verdict = "degraded"
return ImpactAnalysis(
change_event=change_event,
metric_name=metric_name,
before_mean=before_mean,
after_mean=after_mean,
relative_change=relative_change,
is_significant=is_significant,
p_value=p_value,
sample_sizes=(len(before_values), len(after_values)),
verdict=verdict,
)
def _welch_t_test(self, a: list[float], b: list[float]) -> float:
n1, n2 = len(a), len(b)
mean1, mean2 = statistics.mean(a), statistics.mean(b)
var1 = statistics.variance(a)
var2 = statistics.variance(b)
se = math.sqrt(var1 / n1 + var2 / n2)
if se == 0:
return 1.0
t_stat = abs(mean1 - mean2) / se
# Approximate p-value using normal distribution for large samples
p_value = 2 * (1 - 0.5 * (1 + math.erf(t_stat / math.sqrt(2))))
return p_value
Automated Rollback Triggers
When a configuration change causes a statistically significant degradation, trigger an automatic rollback and alert the team.
@dataclass
class RollbackRule:
metric_name: str
max_degradation_percent: float # e.g., 10.0 means 10% worse
min_sample_size: int = 30
cooldown_minutes: int = 60
class AutoRollbackMonitor:
def __init__(
self,
correlation_engine: CorrelationEngine,
rules: list[RollbackRule],
):
self._engine = correlation_engine
self._rules = rules
def evaluate(
self, change_event: ConfigChangeEvent
) -> dict:
violations = []
for rule in self._rules:
analysis = self._engine.analyze_change_impact(
change_event, rule.metric_name
)
total_samples = sum(analysis.sample_sizes)
if total_samples < rule.min_sample_size:
continue
degradation = -analysis.relative_change * 100
if (
analysis.is_significant
and analysis.verdict == "degraded"
and degradation > rule.max_degradation_percent
):
violations.append({
"rule": rule.metric_name,
"degradation_percent": round(degradation, 2),
"threshold_percent": rule.max_degradation_percent,
"p_value": round(analysis.p_value, 4),
"before_mean": round(analysis.before_mean, 4),
"after_mean": round(analysis.after_mean, 4),
})
should_rollback = len(violations) > 0
return {
"change_event_id": change_event.event_id,
"should_rollback": should_rollback,
"violations": violations,
"checked_rules": len(self._rules),
}
Observability Dashboard Data
Provide an API endpoint that the dashboard queries to show the timeline of config changes overlaid with performance metrics.
from fastapi import FastAPI
app = FastAPI()
@app.get("/api/agents/{agent_id}/config-impact")
def get_config_impact_timeline(agent_id: str, metric: str = "task_completion_rate"):
change_store = ChangeEventStore()
perf_collector = PerformanceCollector()
engine = CorrelationEngine(change_store, perf_collector)
recent_changes = change_store.get_recent_changes(agent_id, limit=20)
timeline = []
for change in recent_changes:
analysis = engine.analyze_change_impact(change, metric)
timeline.append({
"timestamp": change.timestamp.isoformat(),
"changed_by": change.changed_by,
"field": change.field_path,
"change_type": change.change_type,
"before_mean": round(analysis.before_mean, 4),
"after_mean": round(analysis.after_mean, 4),
"relative_change_pct": round(analysis.relative_change * 100, 2),
"verdict": analysis.verdict,
"significant": analysis.is_significant,
})
return {"agent_id": agent_id, "metric": metric, "timeline": timeline}
Building the Annotation Layer
The most valuable observability feature is annotations — markers on your performance graphs that show exactly when a config change happened. This transforms a mysterious performance dip into an explainable event.
class AnnotationBuilder:
def build_annotations(
self, changes: list[ConfigChangeEvent]
) -> list[dict]:
return [
{
"time": change.timestamp.isoformat(),
"title": f"Config: {change.field_path}",
"description": (
f"{change.changed_by} changed {change.change_type} "
f"from {self._truncate(change.old_value)} "
f"to {self._truncate(change.new_value)}"
),
"tags": change.tags,
"severity": self._classify_severity(change),
}
for change in changes
]
def _truncate(self, value: Any, max_len: int = 50) -> str:
s = str(value)
return s[:max_len] + "..." if len(s) > max_len else s
def _classify_severity(self, change: ConfigChangeEvent) -> str:
high_risk = {"model", "system_prompt", "temperature"}
if change.change_type in high_risk:
return "high"
return "low"
FAQ
How long should I keep performance data before and after a config change?
Keep at least 24 hours of data on each side of the change to account for daily usage patterns. For lower-traffic agents, extend this to 72 hours to accumulate enough samples for statistical significance. Archive raw metrics after 90 days but retain the aggregated impact analysis indefinitely — it forms a knowledge base of what kinds of changes help or hurt performance.
What metrics should I track for config-performance correlation?
Start with four core metrics: task completion rate (did the agent successfully help the user), average latency per turn, error rate (tool failures, API errors, guardrail blocks), and cost per conversation (token usage multiplied by model pricing). As you mature, add user satisfaction scores and escalation rates. Each metric tells a different story — a model change might improve completion rate but increase cost.
How do I prevent alert fatigue from the rollback monitor?
Set the minimum sample size threshold high enough that you only alert on statistically meaningful changes. Require at least 30 observations per config version before evaluating. Use a cooldown period so the same change does not trigger multiple alerts. Group related alerts — if three metrics degrade simultaneously after one config change, send one alert with all three violations rather than three separate alerts.
#Observability #AIAgents #ConfigurationManagement #PerformanceMonitoring #Python #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.