Platform Reliability: Building 99.9% Uptime for an AI Agent SaaS
Engineer 99.9% uptime for an AI agent platform through redundancy design, health checking, circuit breakers, graceful degradation, and chaos engineering practices that find failures before your customers do.
The Math of 99.9%
99.9% uptime sounds impressive until you do the math. It allows 8.76 hours of downtime per year, or 43.8 minutes per month. For an agent platform serving customer-facing chatbots, 43 minutes of downtime means 43 minutes where your customers' customers get error messages instead of answers. That is enough to lose enterprise accounts.
The path to 99.9% is not about preventing all failures — it is about ensuring that no single failure takes down the entire system. Every component must be redundant, every dependency must have a fallback, and every failure mode must be detected and isolated within seconds.
Health Check System
Reliable systems start with reliable health checks. Shallow checks that return 200 OK without testing dependencies are useless. Deep health checks verify that the service can actually do its job:
# health.py — Deep health check implementation
import asyncio
import time
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
class HealthStatus(str, Enum):
HEALTHY = "healthy"
DEGRADED = "degraded"
UNHEALTHY = "unhealthy"
@dataclass
class ComponentHealth:
name: str
status: HealthStatus
latency_ms: float
message: str = ""
@dataclass
class SystemHealth:
status: HealthStatus
components: list[ComponentHealth] = field(default_factory=list)
timestamp: float = field(default_factory=time.time)
class HealthChecker:
def __init__(self, db, redis_client, llm_client):
self.db = db
self.redis = redis_client
self.llm = llm_client
async def check(self) -> SystemHealth:
checks = await asyncio.gather(
self._check_database(),
self._check_redis(),
self._check_llm_provider(),
return_exceptions=True,
)
components = []
for result in checks:
if isinstance(result, Exception):
components.append(ComponentHealth(
name="unknown", status=HealthStatus.UNHEALTHY,
latency_ms=0, message=str(result),
))
else:
components.append(result)
unhealthy = sum(1 for c in components if c.status == HealthStatus.UNHEALTHY)
degraded = sum(1 for c in components if c.status == HealthStatus.DEGRADED)
if unhealthy > 0:
overall = HealthStatus.UNHEALTHY
elif degraded > 0:
overall = HealthStatus.DEGRADED
else:
overall = HealthStatus.HEALTHY
return SystemHealth(status=overall, components=components)
async def _check_database(self) -> ComponentHealth:
start = time.monotonic()
try:
await self.db.execute("SELECT 1")
latency = (time.monotonic() - start) * 1000
status = HealthStatus.HEALTHY if latency < 100 else HealthStatus.DEGRADED
return ComponentHealth("database", status, latency)
except Exception as e:
return ComponentHealth("database", HealthStatus.UNHEALTHY, 0, str(e))
async def _check_redis(self) -> ComponentHealth:
start = time.monotonic()
try:
await self.redis.ping()
latency = (time.monotonic() - start) * 1000
status = HealthStatus.HEALTHY if latency < 50 else HealthStatus.DEGRADED
return ComponentHealth("redis", status, latency)
except Exception as e:
return ComponentHealth("redis", HealthStatus.UNHEALTHY, 0, str(e))
async def _check_llm_provider(self) -> ComponentHealth:
start = time.monotonic()
try:
# Minimal completion to verify API connectivity
response = await self.llm.completions.create(
model="gpt-4o-mini", messages=[{"role": "user", "content": "ping"}],
max_tokens=1,
)
latency = (time.monotonic() - start) * 1000
status = HealthStatus.HEALTHY if latency < 2000 else HealthStatus.DEGRADED
return ComponentHealth("llm_provider", status, latency)
except Exception as e:
return ComponentHealth("llm_provider", HealthStatus.UNHEALTHY, 0, str(e))
Circuit Breaker Pattern
When an LLM provider goes down, you do not want every request to wait 30 seconds for a timeout. A circuit breaker detects failure patterns and fails fast:
# circuit_breaker.py — Circuit breaker for external dependencies
import time
from enum import Enum
from dataclasses import dataclass
class CircuitState(str, Enum):
CLOSED = "closed" # Normal operation
OPEN = "open" # Failing fast, not sending requests
HALF_OPEN = "half_open" # Testing if the service recovered
@dataclass
class CircuitBreaker:
name: str
failure_threshold: int = 5
recovery_timeout: float = 30.0 # seconds
half_open_max_calls: int = 3
_state: CircuitState = CircuitState.CLOSED
_failure_count: int = 0
_last_failure_time: float = 0
_half_open_calls: int = 0
@property
def state(self) -> CircuitState:
if self._state == CircuitState.OPEN:
if time.monotonic() - self._last_failure_time > self.recovery_timeout:
self._state = CircuitState.HALF_OPEN
self._half_open_calls = 0
return self._state
def record_success(self):
if self._state == CircuitState.HALF_OPEN:
self._half_open_calls += 1
if self._half_open_calls >= self.half_open_max_calls:
self._state = CircuitState.CLOSED
self._failure_count = 0
else:
self._failure_count = 0
def record_failure(self):
self._failure_count += 1
self._last_failure_time = time.monotonic()
if self._failure_count >= self.failure_threshold:
self._state = CircuitState.OPEN
def allow_request(self) -> bool:
state = self.state
if state == CircuitState.CLOSED:
return True
if state == CircuitState.HALF_OPEN:
return True
return False # OPEN — fail fast
Multi-Provider LLM Failover
The highest-risk dependency for an agent platform is the LLM provider. If OpenAI goes down, your entire platform goes down — unless you have failover:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
# llm_failover.py — Multi-provider LLM failover
class LLMFailoverClient:
def __init__(self, providers: list[dict]):
self.providers = providers # [{"name": "openai", "client": ..., "models": {...}}]
self.breakers = {p["name"]: CircuitBreaker(name=p["name"]) for p in providers}
async def complete(self, messages: list, model: str, **kwargs) -> dict:
errors = []
for provider in self.providers:
breaker = self.breakers[provider["name"]]
if not breaker.allow_request():
errors.append(f"{provider['name']}: circuit open")
continue
mapped_model = provider["models"].get(model, model)
try:
result = await provider["client"].chat.completions.create(
model=mapped_model, messages=messages, **kwargs,
)
breaker.record_success()
return {
"content": result.choices[0].message.content,
"provider": provider["name"],
"model": mapped_model,
"input_tokens": result.usage.prompt_tokens,
"output_tokens": result.usage.completion_tokens,
}
except Exception as e:
breaker.record_failure()
errors.append(f"{provider['name']}: {str(e)}")
raise AllProvidersFailedError(
f"All LLM providers failed: {'; '.join(errors)}"
)
# Configuration
failover_client = LLMFailoverClient([
{
"name": "openai",
"client": openai_client,
"models": {"gpt-4o": "gpt-4o", "gpt-4o-mini": "gpt-4o-mini"},
},
{
"name": "anthropic",
"client": anthropic_client,
"models": {"gpt-4o": "claude-sonnet-4-20250514", "gpt-4o-mini": "claude-haiku-4-20250414"},
},
])
Graceful Degradation Strategy
When components fail, the system should degrade gracefully rather than crash entirely:
# degradation.py — Graceful degradation policies
class DegradationPolicy:
def __init__(self, health_checker: HealthChecker):
self.health = health_checker
async def get_capabilities(self) -> dict:
health = await self.health.check()
component_status = {c.name: c.status for c in health.components}
return {
"chat": component_status.get("llm_provider") != HealthStatus.UNHEALTHY,
"streaming": component_status.get("llm_provider") == HealthStatus.HEALTHY,
"conversation_history": component_status.get("database") != HealthStatus.UNHEALTHY,
"analytics": component_status.get("database") == HealthStatus.HEALTHY,
"caching": component_status.get("redis") != HealthStatus.UNHEALTHY,
"real_time_usage": component_status.get("redis") == HealthStatus.HEALTHY,
}
If Redis is down, the system still works — it just skips caching. If the database is degraded, analytics queries are disabled but chat continues using in-memory conversation state. This layered degradation keeps the core functionality running even when supporting services fail.
FAQ
How do I implement chaos engineering without breaking production?
Start with game days in a staging environment. Use tools like Chaos Monkey or LitmusChaos to randomly kill pods, inject network latency, and simulate LLM provider outages. Once your team is comfortable with the failure modes, introduce controlled chaos in production during business hours with the team ready to intervene. Never run chaos experiments during peak traffic or outside business hours.
What monitoring and alerting should I set up for 99.9% uptime?
Monitor four golden signals: latency (P50, P95, P99 response times), traffic (requests per second), errors (error rate by status code), and saturation (CPU, memory, connection pool usage). Set alerts on error rate exceeding 1% for 5 minutes and P95 latency exceeding 5 seconds for 10 minutes. Use PagerDuty or Opsgenie for on-call rotation. Dashboard these in Grafana with a 30-day uptime counter visible to the entire team.
How do I handle planned maintenance without counting against my uptime SLA?
Schedule maintenance windows in advance and communicate them to customers 72 hours ahead. Use blue-green deployments so that most updates require zero downtime. For database migrations that require downtime, run them during the lowest-traffic window and keep the maintenance window under 15 minutes. Your SLA should explicitly exclude pre-announced maintenance windows.
#Reliability #SRE #Uptime #AIAgents #Infrastructure #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.