Agent Performance SLAs: Defining and Measuring Service Level Agreements

Why AI Agent SLAs Require New Thinking

A traditional SLA might promise 99.9% uptime and sub-200ms response times. These metrics are necessary but insufficient for AI agents. An agent can have 100% uptime and respond in 50ms while consistently giving wrong answers.

AI agent SLAs must cover four dimensions: availability, performance, correctness, and safety. Each dimension needs distinct measurement methodology and distinct penalty structures.

Defining Multi-Dimensional SLAs

from dataclasses import dataclass
from enum import Enum
from typing import Optional

class SLADimension(Enum):
    AVAILABILITY = "availability"
    PERFORMANCE = "performance"
    CORRECTNESS = "correctness"
    SAFETY = "safety"

@dataclass
class SLADefinition:
    dimension: SLADimension
    metric_name: str
    target: float
    measurement_window: str  # "monthly", "weekly"
    measurement_method: str
    exclusions: list
    penalty_per_breach: Optional[str] = None

AGENT_SLAS = [
    SLADefinition(
        dimension=SLADimension.AVAILABILITY,
        metric_name="agent_uptime",
        target=0.999,
        measurement_window="monthly",
        measurement_method="1 - (minutes_of_downtime / total_minutes_in_month)",
        exclusions=["scheduled_maintenance", "llm_provider_outage"],
        penalty_per_breach="5% credit per 0.1% below target",
    ),
    SLADefinition(
        dimension=SLADimension.PERFORMANCE,
        metric_name="p95_task_completion_time",
        target=10.0,  # seconds
        measurement_window="monthly",
        measurement_method="95th percentile of task_completion_seconds",
        exclusions=["tasks_requiring_human_escalation"],
        penalty_per_breach="2% credit per second above target",
    ),
    SLADefinition(
        dimension=SLADimension.CORRECTNESS,
        metric_name="task_success_rate",
        target=0.90,
        measurement_window="monthly",
        measurement_method="successful_tasks / (successful_tasks + failed_tasks)",
        exclusions=["ambiguous_requests", "unsupported_task_types"],
        penalty_per_breach="10% credit per 5% below target",
    ),
    SLADefinition(
        dimension=SLADimension.SAFETY,
        metric_name="safety_incident_rate",
        target=0.0001,
        measurement_window="monthly",
        measurement_method="safety_incidents / total_interactions",
        exclusions=[],
        penalty_per_breach="Contract review triggered",
    ),
]

Safety has no exclusions — there is no acceptable excuse for a safety incident. The penalty is a contract review rather than a credit because safety breaches threaten the entire relationship, not just a billing period.

Measurement Methodology

Accurate SLA measurement requires careful instrumentation and clear definitions of what counts as a success or failure.

import time
from datetime import datetime, timedelta
from typing import List, Tuple

class SLAMeasurer:
    def __init__(self, metrics_store):
        self.metrics = metrics_store

    async def measure_availability(self, start: datetime,
                                   end: datetime) -> Tuple[float, dict]:
        """Measure availability excluding planned maintenance."""
        total_minutes = (end - start).total_seconds() / 60

        downtime_events = await self.metrics.query(
            metric="agent_health_status",
            start=start, end=end,
            filter={"status": "unhealthy"},
        )

        maintenance_windows = await self.metrics.query(
            metric="planned_maintenance",
            start=start, end=end,
        )

        raw_downtime = sum(e["duration_minutes"] for e in downtime_events)
        maintenance_time = sum(m["duration_minutes"] for m in maintenance_windows)
        excluded_downtime = sum(
            e["duration_minutes"] for e in downtime_events
            if e.get("cause") == "llm_provider_outage"
        )

        counted_downtime = raw_downtime - excluded_downtime
        effective_total = total_minutes - maintenance_time

        availability = 1 - (counted_downtime / effective_total) if effective_total > 0 else 1.0

        return availability, {
            "total_minutes": total_minutes,
            "raw_downtime_minutes": raw_downtime,
            "excluded_downtime_minutes": excluded_downtime,
            "maintenance_minutes": maintenance_time,
            "counted_downtime_minutes": counted_downtime,
            "availability": round(availability, 6),
        }

    async def measure_correctness(self, start: datetime,
                                  end: datetime) -> Tuple[float, dict]:
        """Measure task success rate with exclusions."""
        tasks = await self.metrics.query(
            metric="agent_task_results",
            start=start, end=end,
        )

        total = len(tasks)
        excluded = len([t for t in tasks if t.get("excluded", False)])
        counted = total - excluded
        successful = len([
            t for t in tasks
            if not t.get("excluded") and t["result"] == "success"
        ])

        rate = successful / counted if counted > 0 else 1.0

        return rate, {
            "total_tasks": total,
            "excluded_tasks": excluded,
            "counted_tasks": counted,
            "successful_tasks": successful,
            "success_rate": round(rate, 4),
        }

Exclusions must be clearly defined in the SLA contract and automatically tracked. A manual exclusion process creates disputes.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

Automated SLA Reporting

class SLAReporter:
    def __init__(self, measurer: SLAMeasurer, sla_definitions: List[SLADefinition]):
        self.measurer = measurer
        self.slas = sla_definitions

    async def generate_monthly_report(self, year: int, month: int) -> dict:
        start = datetime(year, month, 1)
        if month == 12:
            end = datetime(year + 1, 1, 1)
        else:
            end = datetime(year, month + 1, 1)

        results = []
        total_penalty_percentage = 0

        for sla in self.slas:
            if sla.dimension == SLADimension.AVAILABILITY:
                value, details = await self.measurer.measure_availability(start, end)
            elif sla.dimension == SLADimension.CORRECTNESS:
                value, details = await self.measurer.measure_correctness(start, end)
            elif sla.dimension == SLADimension.PERFORMANCE:
                value, details = await self.measurer.measure_performance(start, end)
            else:
                value, details = await self.measurer.measure_safety(start, end)

            met = self._check_target(sla, value)
            penalty = self._calculate_penalty(sla, value) if not met else 0

            results.append({
                "dimension": sla.dimension.value,
                "metric": sla.metric_name,
                "target": sla.target,
                "actual": round(value, 4),
                "met": met,
                "penalty_percentage": penalty,
                "details": details,
            })
            total_penalty_percentage += penalty

        return {
            "period": f"{year}-{month:02d}",
            "generated_at": datetime.utcnow().isoformat(),
            "results": results,
            "overall_met": all(r["met"] for r in results),
            "total_penalty_percentage": min(total_penalty_percentage, 30),
        }

    def _check_target(self, sla: SLADefinition, value: float) -> bool:
        if sla.dimension == SLADimension.SAFETY:
            return value <= sla.target
        elif sla.dimension == SLADimension.PERFORMANCE:
            return value <= sla.target
        return value >= sla.target

    def _calculate_penalty(self, sla: SLADefinition, value: float) -> float:
        if sla.dimension == SLADimension.AVAILABILITY:
            gap = sla.target - value
            return round(gap / 0.001 * 5, 1)  # 5% per 0.1%
        elif sla.dimension == SLADimension.CORRECTNESS:
            gap = sla.target - value
            return round(gap / 0.05 * 10, 1)  # 10% per 5%
        return 0

Cap total penalties at 30% to prevent a single catastrophic month from exceeding the contract value. Some organizations cap at the monthly fee.

SLA Review and Renegotiation

# sla-review-process.yaml
review_schedule:
  frequency: quarterly
  participants:
    - "engineering lead"
    - "product manager"
    - "customer success"
    - "client stakeholder"

review_agenda:
  - "SLA performance summary for the quarter"
  - "Root cause analysis for any breaches"
  - "Exclusion review — are exclusions fair and accurate?"
  - "Target adjustment proposals"
  - "New dimensions to add or remove"

adjustment_rules:
  - "Targets can only increase after 2 consecutive quarters of meeting them"
  - "Targets can decrease if a systemic issue is identified and documented"
  - "New dimensions require 1 month of measurement before SLA enforcement"
  - "Safety targets never decrease"

FAQ

How do I set initial SLA targets for a new AI agent system?

Run the agent in production for 30-60 days without SLA enforcement, measuring all proposed dimensions. Set initial targets at or slightly below the observed performance. This gives you a realistic baseline. Ratchet targets upward as the system matures and you gain confidence. Never start with aspirational targets — you will breach immediately and lose credibility.

Should correctness SLAs exclude edge cases and ambiguous requests?

Yes, but define exclusions precisely in the contract. Use automated classification to tag requests as excluded — never rely on manual post-hoc exclusion decisions. Common exclusions include requests in unsupported languages, intentionally adversarial inputs, and tasks outside the agent's documented scope. Publish the exclusion criteria and track the exclusion rate as a separate metric.

How do I handle SLA breaches caused by third-party LLM providers?

Define "provider outage" exclusions in your SLA but do not make them a blanket excuse. You are responsible for building redundancy. If you have a single LLM provider and they go down for 4 hours, your SLA should absorb some of that downtime. The exclusion should only apply to outages beyond your architectural redundancy — for example, if all three of your configured LLM providers are down simultaneously.

#SLA #AIAgents #Performance #ServiceAgreements #Monitoring #AgenticAI #LearnAI #AIEngineering

Agent Performance SLAs: Defining and Measuring Service Level Agreements

Why AI Agent SLAs Require New Thinking

Defining Multi-Dimensional SLAs

Measurement Methodology

Automated SLA Reporting

SLA Review and Renegotiation

FAQ

How do I set initial SLA targets for a new AI agent system?

Should correctness SLAs exclude edge cases and ambiguous requests?

How do I handle SLA breaches caused by third-party LLM providers?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding