Agent Performance SLAs: Defining and Measuring Service Level Agreements
Define and measure Service Level Agreements for AI agent systems with practical guidance on SLA definition, measurement methodology, automated reporting, and penalty handling for production agent deployments.
Why AI Agent SLAs Require New Thinking
A traditional SLA might promise 99.9% uptime and sub-200ms response times. These metrics are necessary but insufficient for AI agents. An agent can have 100% uptime and respond in 50ms while consistently giving wrong answers.
AI agent SLAs must cover four dimensions: availability, performance, correctness, and safety. Each dimension needs distinct measurement methodology and distinct penalty structures.
Defining Multi-Dimensional SLAs
from dataclasses import dataclass
from enum import Enum
from typing import Optional
class SLADimension(Enum):
AVAILABILITY = "availability"
PERFORMANCE = "performance"
CORRECTNESS = "correctness"
SAFETY = "safety"
@dataclass
class SLADefinition:
dimension: SLADimension
metric_name: str
target: float
measurement_window: str # "monthly", "weekly"
measurement_method: str
exclusions: list
penalty_per_breach: Optional[str] = None
AGENT_SLAS = [
SLADefinition(
dimension=SLADimension.AVAILABILITY,
metric_name="agent_uptime",
target=0.999,
measurement_window="monthly",
measurement_method="1 - (minutes_of_downtime / total_minutes_in_month)",
exclusions=["scheduled_maintenance", "llm_provider_outage"],
penalty_per_breach="5% credit per 0.1% below target",
),
SLADefinition(
dimension=SLADimension.PERFORMANCE,
metric_name="p95_task_completion_time",
target=10.0, # seconds
measurement_window="monthly",
measurement_method="95th percentile of task_completion_seconds",
exclusions=["tasks_requiring_human_escalation"],
penalty_per_breach="2% credit per second above target",
),
SLADefinition(
dimension=SLADimension.CORRECTNESS,
metric_name="task_success_rate",
target=0.90,
measurement_window="monthly",
measurement_method="successful_tasks / (successful_tasks + failed_tasks)",
exclusions=["ambiguous_requests", "unsupported_task_types"],
penalty_per_breach="10% credit per 5% below target",
),
SLADefinition(
dimension=SLADimension.SAFETY,
metric_name="safety_incident_rate",
target=0.0001,
measurement_window="monthly",
measurement_method="safety_incidents / total_interactions",
exclusions=[],
penalty_per_breach="Contract review triggered",
),
]
Safety has no exclusions — there is no acceptable excuse for a safety incident. The penalty is a contract review rather than a credit because safety breaches threaten the entire relationship, not just a billing period.
Measurement Methodology
Accurate SLA measurement requires careful instrumentation and clear definitions of what counts as a success or failure.
import time
from datetime import datetime, timedelta
from typing import List, Tuple
class SLAMeasurer:
def __init__(self, metrics_store):
self.metrics = metrics_store
async def measure_availability(self, start: datetime,
end: datetime) -> Tuple[float, dict]:
"""Measure availability excluding planned maintenance."""
total_minutes = (end - start).total_seconds() / 60
downtime_events = await self.metrics.query(
metric="agent_health_status",
start=start, end=end,
filter={"status": "unhealthy"},
)
maintenance_windows = await self.metrics.query(
metric="planned_maintenance",
start=start, end=end,
)
raw_downtime = sum(e["duration_minutes"] for e in downtime_events)
maintenance_time = sum(m["duration_minutes"] for m in maintenance_windows)
excluded_downtime = sum(
e["duration_minutes"] for e in downtime_events
if e.get("cause") == "llm_provider_outage"
)
counted_downtime = raw_downtime - excluded_downtime
effective_total = total_minutes - maintenance_time
availability = 1 - (counted_downtime / effective_total) if effective_total > 0 else 1.0
return availability, {
"total_minutes": total_minutes,
"raw_downtime_minutes": raw_downtime,
"excluded_downtime_minutes": excluded_downtime,
"maintenance_minutes": maintenance_time,
"counted_downtime_minutes": counted_downtime,
"availability": round(availability, 6),
}
async def measure_correctness(self, start: datetime,
end: datetime) -> Tuple[float, dict]:
"""Measure task success rate with exclusions."""
tasks = await self.metrics.query(
metric="agent_task_results",
start=start, end=end,
)
total = len(tasks)
excluded = len([t for t in tasks if t.get("excluded", False)])
counted = total - excluded
successful = len([
t for t in tasks
if not t.get("excluded") and t["result"] == "success"
])
rate = successful / counted if counted > 0 else 1.0
return rate, {
"total_tasks": total,
"excluded_tasks": excluded,
"counted_tasks": counted,
"successful_tasks": successful,
"success_rate": round(rate, 4),
}
Exclusions must be clearly defined in the SLA contract and automatically tracked. A manual exclusion process creates disputes.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Automated SLA Reporting
class SLAReporter:
def __init__(self, measurer: SLAMeasurer, sla_definitions: List[SLADefinition]):
self.measurer = measurer
self.slas = sla_definitions
async def generate_monthly_report(self, year: int, month: int) -> dict:
start = datetime(year, month, 1)
if month == 12:
end = datetime(year + 1, 1, 1)
else:
end = datetime(year, month + 1, 1)
results = []
total_penalty_percentage = 0
for sla in self.slas:
if sla.dimension == SLADimension.AVAILABILITY:
value, details = await self.measurer.measure_availability(start, end)
elif sla.dimension == SLADimension.CORRECTNESS:
value, details = await self.measurer.measure_correctness(start, end)
elif sla.dimension == SLADimension.PERFORMANCE:
value, details = await self.measurer.measure_performance(start, end)
else:
value, details = await self.measurer.measure_safety(start, end)
met = self._check_target(sla, value)
penalty = self._calculate_penalty(sla, value) if not met else 0
results.append({
"dimension": sla.dimension.value,
"metric": sla.metric_name,
"target": sla.target,
"actual": round(value, 4),
"met": met,
"penalty_percentage": penalty,
"details": details,
})
total_penalty_percentage += penalty
return {
"period": f"{year}-{month:02d}",
"generated_at": datetime.utcnow().isoformat(),
"results": results,
"overall_met": all(r["met"] for r in results),
"total_penalty_percentage": min(total_penalty_percentage, 30),
}
def _check_target(self, sla: SLADefinition, value: float) -> bool:
if sla.dimension == SLADimension.SAFETY:
return value <= sla.target
elif sla.dimension == SLADimension.PERFORMANCE:
return value <= sla.target
return value >= sla.target
def _calculate_penalty(self, sla: SLADefinition, value: float) -> float:
if sla.dimension == SLADimension.AVAILABILITY:
gap = sla.target - value
return round(gap / 0.001 * 5, 1) # 5% per 0.1%
elif sla.dimension == SLADimension.CORRECTNESS:
gap = sla.target - value
return round(gap / 0.05 * 10, 1) # 10% per 5%
return 0
Cap total penalties at 30% to prevent a single catastrophic month from exceeding the contract value. Some organizations cap at the monthly fee.
SLA Review and Renegotiation
# sla-review-process.yaml
review_schedule:
frequency: quarterly
participants:
- "engineering lead"
- "product manager"
- "customer success"
- "client stakeholder"
review_agenda:
- "SLA performance summary for the quarter"
- "Root cause analysis for any breaches"
- "Exclusion review — are exclusions fair and accurate?"
- "Target adjustment proposals"
- "New dimensions to add or remove"
adjustment_rules:
- "Targets can only increase after 2 consecutive quarters of meeting them"
- "Targets can decrease if a systemic issue is identified and documented"
- "New dimensions require 1 month of measurement before SLA enforcement"
- "Safety targets never decrease"
FAQ
How do I set initial SLA targets for a new AI agent system?
Run the agent in production for 30-60 days without SLA enforcement, measuring all proposed dimensions. Set initial targets at or slightly below the observed performance. This gives you a realistic baseline. Ratchet targets upward as the system matures and you gain confidence. Never start with aspirational targets — you will breach immediately and lose credibility.
Should correctness SLAs exclude edge cases and ambiguous requests?
Yes, but define exclusions precisely in the contract. Use automated classification to tag requests as excluded — never rely on manual post-hoc exclusion decisions. Common exclusions include requests in unsupported languages, intentionally adversarial inputs, and tasks outside the agent's documented scope. Publish the exclusion criteria and track the exclusion rate as a separate metric.
How do I handle SLA breaches caused by third-party LLM providers?
Define "provider outage" exclusions in your SLA but do not make them a blanket excuse. You are responsible for building redundancy. If you have a single LLM provider and they go down for 4 hours, your SLA should absorb some of that downtime. The exclusion should only apply to outages beyond your architectural redundancy — for example, if all three of your configured LLM providers are down simultaneously.
#SLA #AIAgents #Performance #ServiceAgreements #Monitoring #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.