Health Checks and Readiness Probes for AI Agent Services

Why AI Agents Need Specialized Health Checks

A traditional web service health check verifies the process is running and can respond to HTTP requests. AI agent services have additional failure modes: the LLM API key may be expired, a vector database for RAG may be unreachable, a tool endpoint may be down, or the model may be returning degraded responses due to rate limiting. A simple "200 OK" does not capture these nuances.

Kubernetes uses two distinct probes — liveness and readiness — that serve different purposes. Getting these right prevents cascading failures and ensures traffic only reaches agent pods that can actually serve requests.

Liveness vs. Readiness: Understanding the Difference

Liveness probes answer: "Is this process stuck?" If a liveness probe fails, Kubernetes kills and restarts the pod. Use this to detect deadlocks, infinite loops, or corrupted state.

Readiness probes answer: "Can this pod handle traffic right now?" If readiness fails, Kubernetes removes the pod from the Service load balancer but does not restart it. Use this during startup while loading models, or when a downstream dependency is temporarily unavailable.

# app/health.py
from fastapi import APIRouter, Response
from datetime import datetime

router = APIRouter(tags=["Health"])

startup_complete = False
startup_time: datetime | None = None

@router.get("/healthz")
async def liveness():
    """Liveness probe - is the process alive and responsive?"""
    return {"status": "alive", "timestamp": datetime.utcnow().isoformat()}

@router.get("/readyz")
async def readiness(response: Response):
    """Readiness probe - can this pod serve traffic?"""
    if not startup_complete:
        response.status_code = 503
        return {"status": "starting", "detail": "Initialization in progress"}

    checks = await run_dependency_checks()

    if all(c["status"] == "ok" for c in checks.values()):
        return {"status": "ready", "checks": checks}

    response.status_code = 503
    return {"status": "not_ready", "checks": checks}

Comprehensive Dependency Checks

Check every dependency your agent needs to function:

# app/health.py
import httpx
import redis.asyncio as aioredis
from app.config import settings

async def run_dependency_checks() -> dict:
    checks = {}

    # Check Redis (session store)
    try:
        r = aioredis.from_url(settings.redis_url)
        await r.ping()
        await r.aclose()
        checks["redis"] = {"status": "ok"}
    except Exception as e:
        checks["redis"] = {"status": "down", "error": str(e)}

    # Check LLM API reachability
    try:
        async with httpx.AsyncClient(timeout=5) as client:
            resp = await client.get(
                "https://api.openai.com/v1/models",
                headers={"Authorization": f"Bearer {settings.openai_api_key}"},
            )
            if resp.status_code == 200:
                checks["llm_api"] = {"status": "ok"}
            elif resp.status_code == 401:
                checks["llm_api"] = {"status": "down", "error": "Invalid API key"}
            else:
                checks["llm_api"] = {"status": "degraded", "error": f"HTTP {resp.status_code}"}
    except httpx.TimeoutException:
        checks["llm_api"] = {"status": "degraded", "error": "Timeout"}
    except Exception as e:
        checks["llm_api"] = {"status": "down", "error": str(e)}

    return checks

Kubernetes Probe Configuration

Configure probes in your Deployment manifest with appropriate timing:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

# k8s/deployment.yaml
containers:
  - name: agent
    image: agent-service:1.0.0
    ports:
      - containerPort: 8000
    livenessProbe:
      httpGet:
        path: /healthz
        port: 8000
      initialDelaySeconds: 15
      periodSeconds: 20
      timeoutSeconds: 3
      failureThreshold: 3
    readinessProbe:
      httpGet:
        path: /readyz
        port: 8000
      initialDelaySeconds: 5
      periodSeconds: 10
      timeoutSeconds: 5
      failureThreshold: 2
    startupProbe:
      httpGet:
        path: /healthz
        port: 8000
      initialDelaySeconds: 5
      periodSeconds: 5
      failureThreshold: 12

The startup probe gives slow-starting agents up to 60 seconds (12 failures x 5 second period) to initialize before liveness checks begin. This prevents Kubernetes from kill-looping an agent that takes 30 seconds to load a local model.

Graceful Startup with FastAPI Lifespan

Use lifespan events to mark the service ready only after all dependencies are initialized:

# app/main.py
from contextlib import asynccontextmanager
from fastapi import FastAPI
import app.health as health_module

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Startup: Initialize dependencies
    await initialize_redis_pool()
    await warm_up_agent()  # Optional: pre-load models
    health_module.startup_complete = True
    health_module.startup_time = datetime.utcnow()
    print("Agent service ready to accept traffic")

    yield

    # Shutdown: Clean up
    health_module.startup_complete = False
    await close_redis_pool()
    print("Agent service shut down cleanly")

app = FastAPI(lifespan=lifespan)
app.include_router(health_module.router)

Graceful Shutdown for In-Flight Requests

When Kubernetes sends SIGTERM, finish active agent requests before exiting:

import signal
import asyncio

active_requests: set = set()

async def track_request(request_id: str):
    active_requests.add(request_id)
    try:
        yield
    finally:
        active_requests.discard(request_id)

def handle_sigterm(signum, frame):
    print(f"SIGTERM received, {len(active_requests)} requests in flight")
    health_module.startup_complete = False  # Fail readiness probe

signal.signal(signal.SIGTERM, handle_sigterm)

Set terminationGracePeriodSeconds in your Deployment to match your maximum expected agent response time:

spec:
  terminationGracePeriodSeconds: 90

FAQ

Should the liveness probe check downstream dependencies like the LLM API?

No. Liveness probes should only check whether the process itself is healthy. If your liveness probe depends on an external API and that API has an outage, Kubernetes will restart all your pods — making the situation worse. Put dependency checks in the readiness probe, which removes the pod from traffic without killing it.

How do I handle partial degradation where some tools work but others do not?

Return a 200 from the readiness probe with a "degraded" status. Your agent code reads the health status and disables unavailable tools at runtime rather than rejecting all requests. This gives users a reduced-capability agent instead of a complete outage. Log degraded dependencies prominently so your monitoring catches it.

What timeout should I set for health check endpoints?

Keep health check endpoints fast — under 5 seconds total. Cache dependency check results for 10-15 seconds to avoid hammering downstream services with health check traffic. If your Redis check takes 3 seconds, something is wrong with Redis, and you should report it as degraded immediately rather than waiting for a longer timeout.

#HealthChecks #AIAgents #Kubernetes #Observability #FastAPI #AgenticAI #LearnAI #AIEngineering

Health Checks and Readiness Probes for AI Agent Services

Why AI Agents Need Specialized Health Checks

Liveness vs. Readiness: Understanding the Difference

Comprehensive Dependency Checks

Kubernetes Probe Configuration

Graceful Startup with FastAPI Lifespan

Graceful Shutdown for In-Flight Requests

FAQ

Should the liveness probe check downstream dependencies like the LLM API?

How do I handle partial degradation where some tools work but others do not?

What timeout should I set for health check endpoints?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding