Platform Reliability: Building 99.9% Uptime for an AI Agent SaaS

The Math of 99.9%

99.9% uptime sounds impressive until you do the math. It allows 8.76 hours of downtime per year, or 43.8 minutes per month. For an agent platform serving customer-facing chatbots, 43 minutes of downtime means 43 minutes where your customers' customers get error messages instead of answers. That is enough to lose enterprise accounts.

The path to 99.9% is not about preventing all failures — it is about ensuring that no single failure takes down the entire system. Every component must be redundant, every dependency must have a fallback, and every failure mode must be detected and isolated within seconds.

Health Check System

Reliable systems start with reliable health checks. Shallow checks that return 200 OK without testing dependencies are useless. Deep health checks verify that the service can actually do its job:

# health.py — Deep health check implementation
import asyncio
import time
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional

class HealthStatus(str, Enum):
    HEALTHY = "healthy"
    DEGRADED = "degraded"
    UNHEALTHY = "unhealthy"

@dataclass
class ComponentHealth:
    name: str
    status: HealthStatus
    latency_ms: float
    message: str = ""

@dataclass
class SystemHealth:
    status: HealthStatus
    components: list[ComponentHealth] = field(default_factory=list)
    timestamp: float = field(default_factory=time.time)

class HealthChecker:
    def __init__(self, db, redis_client, llm_client):
        self.db = db
        self.redis = redis_client
        self.llm = llm_client

    async def check(self) -> SystemHealth:
        checks = await asyncio.gather(
            self._check_database(),
            self._check_redis(),
            self._check_llm_provider(),
            return_exceptions=True,
        )

        components = []
        for result in checks:
            if isinstance(result, Exception):
                components.append(ComponentHealth(
                    name="unknown", status=HealthStatus.UNHEALTHY,
                    latency_ms=0, message=str(result),
                ))
            else:
                components.append(result)

        unhealthy = sum(1 for c in components if c.status == HealthStatus.UNHEALTHY)
        degraded = sum(1 for c in components if c.status == HealthStatus.DEGRADED)

        if unhealthy > 0:
            overall = HealthStatus.UNHEALTHY
        elif degraded > 0:
            overall = HealthStatus.DEGRADED
        else:
            overall = HealthStatus.HEALTHY

        return SystemHealth(status=overall, components=components)

    async def _check_database(self) -> ComponentHealth:
        start = time.monotonic()
        try:
            await self.db.execute("SELECT 1")
            latency = (time.monotonic() - start) * 1000
            status = HealthStatus.HEALTHY if latency < 100 else HealthStatus.DEGRADED
            return ComponentHealth("database", status, latency)
        except Exception as e:
            return ComponentHealth("database", HealthStatus.UNHEALTHY, 0, str(e))

    async def _check_redis(self) -> ComponentHealth:
        start = time.monotonic()
        try:
            await self.redis.ping()
            latency = (time.monotonic() - start) * 1000
            status = HealthStatus.HEALTHY if latency < 50 else HealthStatus.DEGRADED
            return ComponentHealth("redis", status, latency)
        except Exception as e:
            return ComponentHealth("redis", HealthStatus.UNHEALTHY, 0, str(e))

    async def _check_llm_provider(self) -> ComponentHealth:
        start = time.monotonic()
        try:
            # Minimal completion to verify API connectivity
            response = await self.llm.completions.create(
                model="gpt-4o-mini", messages=[{"role": "user", "content": "ping"}],
                max_tokens=1,
            )
            latency = (time.monotonic() - start) * 1000
            status = HealthStatus.HEALTHY if latency < 2000 else HealthStatus.DEGRADED
            return ComponentHealth("llm_provider", status, latency)
        except Exception as e:
            return ComponentHealth("llm_provider", HealthStatus.UNHEALTHY, 0, str(e))

Circuit Breaker Pattern

When an LLM provider goes down, you do not want every request to wait 30 seconds for a timeout. A circuit breaker detects failure patterns and fails fast:

# circuit_breaker.py — Circuit breaker for external dependencies
import time
from enum import Enum
from dataclasses import dataclass

class CircuitState(str, Enum):
    CLOSED = "closed"       # Normal operation
    OPEN = "open"           # Failing fast, not sending requests
    HALF_OPEN = "half_open" # Testing if the service recovered

@dataclass
class CircuitBreaker:
    name: str
    failure_threshold: int = 5
    recovery_timeout: float = 30.0  # seconds
    half_open_max_calls: int = 3

    _state: CircuitState = CircuitState.CLOSED
    _failure_count: int = 0
    _last_failure_time: float = 0
    _half_open_calls: int = 0

    @property
    def state(self) -> CircuitState:
        if self._state == CircuitState.OPEN:
            if time.monotonic() - self._last_failure_time > self.recovery_timeout:
                self._state = CircuitState.HALF_OPEN
                self._half_open_calls = 0
        return self._state

    def record_success(self):
        if self._state == CircuitState.HALF_OPEN:
            self._half_open_calls += 1
            if self._half_open_calls >= self.half_open_max_calls:
                self._state = CircuitState.CLOSED
                self._failure_count = 0
        else:
            self._failure_count = 0

    def record_failure(self):
        self._failure_count += 1
        self._last_failure_time = time.monotonic()
        if self._failure_count >= self.failure_threshold:
            self._state = CircuitState.OPEN

    def allow_request(self) -> bool:
        state = self.state
        if state == CircuitState.CLOSED:
            return True
        if state == CircuitState.HALF_OPEN:
            return True
        return False  # OPEN — fail fast

Multi-Provider LLM Failover

The highest-risk dependency for an agent platform is the LLM provider. If OpenAI goes down, your entire platform goes down — unless you have failover:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

# llm_failover.py — Multi-provider LLM failover
class LLMFailoverClient:
    def __init__(self, providers: list[dict]):
        self.providers = providers  # [{"name": "openai", "client": ..., "models": {...}}]
        self.breakers = {p["name"]: CircuitBreaker(name=p["name"]) for p in providers}

    async def complete(self, messages: list, model: str, **kwargs) -> dict:
        errors = []
        for provider in self.providers:
            breaker = self.breakers[provider["name"]]
            if not breaker.allow_request():
                errors.append(f"{provider['name']}: circuit open")
                continue

            mapped_model = provider["models"].get(model, model)
            try:
                result = await provider["client"].chat.completions.create(
                    model=mapped_model, messages=messages, **kwargs,
                )
                breaker.record_success()
                return {
                    "content": result.choices[0].message.content,
                    "provider": provider["name"],
                    "model": mapped_model,
                    "input_tokens": result.usage.prompt_tokens,
                    "output_tokens": result.usage.completion_tokens,
                }
            except Exception as e:
                breaker.record_failure()
                errors.append(f"{provider['name']}: {str(e)}")

        raise AllProvidersFailedError(
            f"All LLM providers failed: {'; '.join(errors)}"
        )

# Configuration
failover_client = LLMFailoverClient([
    {
        "name": "openai",
        "client": openai_client,
        "models": {"gpt-4o": "gpt-4o", "gpt-4o-mini": "gpt-4o-mini"},
    },
    {
        "name": "anthropic",
        "client": anthropic_client,
        "models": {"gpt-4o": "claude-sonnet-4-20250514", "gpt-4o-mini": "claude-haiku-4-20250414"},
    },
])

Graceful Degradation Strategy

When components fail, the system should degrade gracefully rather than crash entirely:

# degradation.py — Graceful degradation policies
class DegradationPolicy:
    def __init__(self, health_checker: HealthChecker):
        self.health = health_checker

    async def get_capabilities(self) -> dict:
        health = await self.health.check()
        component_status = {c.name: c.status for c in health.components}

        return {
            "chat": component_status.get("llm_provider") != HealthStatus.UNHEALTHY,
            "streaming": component_status.get("llm_provider") == HealthStatus.HEALTHY,
            "conversation_history": component_status.get("database") != HealthStatus.UNHEALTHY,
            "analytics": component_status.get("database") == HealthStatus.HEALTHY,
            "caching": component_status.get("redis") != HealthStatus.UNHEALTHY,
            "real_time_usage": component_status.get("redis") == HealthStatus.HEALTHY,
        }

If Redis is down, the system still works — it just skips caching. If the database is degraded, analytics queries are disabled but chat continues using in-memory conversation state. This layered degradation keeps the core functionality running even when supporting services fail.

FAQ

How do I implement chaos engineering without breaking production?

Start with game days in a staging environment. Use tools like Chaos Monkey or LitmusChaos to randomly kill pods, inject network latency, and simulate LLM provider outages. Once your team is comfortable with the failure modes, introduce controlled chaos in production during business hours with the team ready to intervene. Never run chaos experiments during peak traffic or outside business hours.

What monitoring and alerting should I set up for 99.9% uptime?

Monitor four golden signals: latency (P50, P95, P99 response times), traffic (requests per second), errors (error rate by status code), and saturation (CPU, memory, connection pool usage). Set alerts on error rate exceeding 1% for 5 minutes and P95 latency exceeding 5 seconds for 10 minutes. Use PagerDuty or Opsgenie for on-call rotation. Dashboard these in Grafana with a 30-day uptime counter visible to the entire team.

How do I handle planned maintenance without counting against my uptime SLA?

Schedule maintenance windows in advance and communicate them to customers 72 hours ahead. Use blue-green deployments so that most updates require zero downtime. For database migrations that require downtime, run them during the lowest-traffic window and keep the maintenance window under 15 minutes. Your SLA should explicitly exclude pre-announced maintenance windows.

#Reliability #SRE #Uptime #AIAgents #Infrastructure #AgenticAI #LearnAI #AIEngineering

Platform Reliability: Building 99.9% Uptime for an AI Agent SaaS

The Math of 99.9%

Health Check System

Circuit Breaker Pattern

Multi-Provider LLM Failover

Graceful Degradation Strategy

FAQ

How do I implement chaos engineering without breaking production?

What monitoring and alerting should I set up for 99.9% uptime?

How do I handle planned maintenance without counting against my uptime SLA?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding