Building Self-Healing Agent Infrastructure: Auto-Recovery and Auto-Scaling
Build self-healing AI agent infrastructure with health checks, automated recovery procedures, restart policies, and intelligent scaling rules that keep your agents running without manual intervention.
The Cost of Manual Agent Recovery
In production, AI agents fail in ways that are hard to predict. An agent might enter an infinite tool-calling loop, exhaust its context window, lose database connectivity, or hang waiting for a rate-limited LLM response. Without self-healing infrastructure, each failure requires an engineer to diagnose and restart the system manually.
Self-healing infrastructure detects problems automatically and applies corrective actions without human intervention. For AI agents, this means intelligent health checks, graduated recovery procedures, and scaling rules that respond to real-time conditions.
Multi-Layer Health Checks
A simple HTTP ping is not sufficient for AI agents. You need health checks at multiple layers to distinguish between "the process is alive" and "the agent is functioning correctly."
import asyncio
import time
from enum import Enum
from dataclasses import dataclass
class HealthStatus(Enum):
HEALTHY = "healthy"
DEGRADED = "degraded"
UNHEALTHY = "unhealthy"
@dataclass
class HealthCheckResult:
status: HealthStatus
latency_ms: float
details: dict
class AgentHealthChecker:
def __init__(self, agent, llm_client, db_pool, tool_registry):
self.agent = agent
self.llm_client = llm_client
self.db_pool = db_pool
self.tool_registry = tool_registry
async def check_liveness(self) -> HealthCheckResult:
"""Is the agent process alive and responsive?"""
start = time.monotonic()
try:
response = await asyncio.wait_for(
self.agent.ping(), timeout=5.0
)
return HealthCheckResult(
status=HealthStatus.HEALTHY,
latency_ms=(time.monotonic() - start) * 1000,
details={"ping": "ok"},
)
except (asyncio.TimeoutError, Exception) as e:
return HealthCheckResult(
status=HealthStatus.UNHEALTHY,
latency_ms=(time.monotonic() - start) * 1000,
details={"error": str(e)},
)
async def check_readiness(self) -> HealthCheckResult:
"""Can the agent actually process requests?"""
start = time.monotonic()
checks = {}
# Check LLM connectivity
try:
await asyncio.wait_for(
self.llm_client.complete("Say OK", max_tokens=5),
timeout=10.0,
)
checks["llm"] = "ok"
except Exception as e:
checks["llm"] = f"failed: {e}"
# Check database
try:
await asyncio.wait_for(
self.db_pool.execute("SELECT 1"), timeout=5.0
)
checks["database"] = "ok"
except Exception as e:
checks["database"] = f"failed: {e}"
# Check critical tools
for tool_name in self.tool_registry.critical_tools:
try:
available = await self.tool_registry.verify(tool_name)
checks[f"tool_{tool_name}"] = "ok" if available else "unavailable"
except Exception as e:
checks[f"tool_{tool_name}"] = f"failed: {e}"
failed = [k for k, v in checks.items() if v != "ok"]
if not failed:
status = HealthStatus.HEALTHY
elif "llm" in [k for k in failed]:
status = HealthStatus.UNHEALTHY
else:
status = HealthStatus.DEGRADED
return HealthCheckResult(
status=status,
latency_ms=(time.monotonic() - start) * 1000,
details=checks,
)
The readiness check verifies the entire dependency chain. An agent that is alive but cannot reach its LLM provider should not receive traffic.
Automated Recovery Procedures
Recovery actions should be graduated — start with the least disruptive action and escalate only if the problem persists.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
class RecoveryManager:
def __init__(self, agent_pool, metrics, notifier):
self.agent_pool = agent_pool
self.metrics = metrics
self.notifier = notifier
self.failure_counts = {}
async def handle_unhealthy_agent(self, agent_id: str):
count = self.failure_counts.get(agent_id, 0) + 1
self.failure_counts[agent_id] = count
if count <= 2:
# Level 1: Soft restart — clear context and retry
await self.agent_pool.clear_context(agent_id)
await self.agent_pool.reassign_pending_tasks(agent_id)
self.metrics.increment("recovery.soft_restart")
elif count <= 5:
# Level 2: Hard restart — kill and recreate the agent
await self.agent_pool.terminate(agent_id)
new_agent = await self.agent_pool.spawn_replacement(agent_id)
self.metrics.increment("recovery.hard_restart")
await self.notifier.send(
severity="warning",
message=f"Hard restarted agent {agent_id} (failure #{count})",
)
else:
# Level 3: Quarantine — remove from pool, alert humans
await self.agent_pool.quarantine(agent_id)
self.metrics.increment("recovery.quarantine")
await self.notifier.send(
severity="critical",
message=f"Quarantined agent {agent_id} after {count} failures. Manual review required.",
)
async def run_recovery_loop(self, interval_seconds: int = 30):
while True:
for agent_id in self.agent_pool.active_agent_ids():
health = await self.agent_pool.check_health(agent_id)
if health.status == HealthStatus.UNHEALTHY:
await self.handle_unhealthy_agent(agent_id)
elif health.status == HealthStatus.HEALTHY:
self.failure_counts.pop(agent_id, None)
await asyncio.sleep(interval_seconds)
The graduated approach prevents a transient LLM timeout from triggering a full agent restart. Only persistent failures escalate to quarantine.
Kubernetes Configuration for Self-Healing Agents
# agent-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ai-agent-pool
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
spec:
containers:
- name: agent
image: ai-agent:latest
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "1000m"
livenessProbe:
httpGet:
path: /health/liveness
port: 8080
initialDelaySeconds: 10
periodSeconds: 15
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/readiness
port: 8080
initialDelaySeconds: 20
periodSeconds: 10
failureThreshold: 2
startupProbe:
httpGet:
path: /health/startup
port: 8080
failureThreshold: 30
periodSeconds: 5
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ai-agent-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ai-agent-pool
minReplicas: 2
maxReplicas: 15
metrics:
- type: Pods
pods:
metric:
name: agent_active_tasks
target:
type: AverageValue
averageValue: "5"
- type: Pods
pods:
metric:
name: agent_queue_depth
target:
type: AverageValue
averageValue: "10"
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Pods
value: 2
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 1
periodSeconds: 120
The startup probe allows up to 150 seconds (30 x 5s) for the agent to load models and warm caches. The asymmetric scale-up/scale-down behavior prevents flapping — agents scale up fast but scale down slowly.
FAQ
How do I prevent self-healing from masking underlying issues?
Every automated recovery action must emit metrics and alerts. Track recovery frequency per agent — if an agent is being soft-restarted 20 times per hour, the self-healing is working but something is fundamentally broken. Set thresholds on recovery rates that trigger human investigation.
What is the right health check interval for AI agents?
Use 10-15 second intervals for liveness checks and 30-60 seconds for readiness checks. Readiness checks that call the LLM are expensive, so do not run them too frequently. Consider using a cached readiness status that only refreshes the LLM check every 5 minutes, with other dependency checks running more frequently.
Should I use Kubernetes liveness probes or application-level health management?
Use both. Kubernetes probes handle process-level failures — crashes, OOM kills, and unresponsive containers. Application-level health management handles agent-specific issues — stuck reasoning loops, context overflow, and tool failures. Kubernetes is your safety net; application-level management is your first line of defense.
#SelfHealing #AIAgents #AutoRecovery #AutoScaling #Infrastructure #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.