Deploying AI Agents on Kubernetes: Scaling, Health Checks, and Resource Management

Why Kubernetes for AI Agents

AI agents in production need the same operational guarantees as any critical service: high availability, automatic scaling, rolling deployments, health monitoring, and resource isolation. Kubernetes provides all of these out of the box, plus features that are particularly valuable for AI workloads: GPU scheduling, horizontal pod autoscaling based on custom metrics, and namespace-based isolation for multi-tenant agent deployments.

This guide covers the end-to-end process of deploying AI agents on Kubernetes, from container design through scaling and cost optimization.

Container Design for AI Agents

AI agent containers differ from typical web service containers in three ways: they often need ML libraries (which are large), they may require GPU drivers, and their startup time is longer due to model loading or embedding initialization.

# agent_server.py — FastAPI server wrapping an AI agent
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from contextlib import asynccontextmanager
import asyncio

# Global state initialized at startup
agent_system = None

@asynccontextmanager
async def lifespan(app: FastAPI):
    global agent_system
    # Startup: initialize agent, load models, connect to vector DB
    agent_system = await initialize_agent_system()
    yield
    # Shutdown: cleanup connections
    await agent_system.shutdown()

app = FastAPI(lifespan=lifespan)

class AgentRequest(BaseModel):
    message: str
    conversation_id: str | None = None
    user_id: str

class AgentResponse(BaseModel):
    response: str
    conversation_id: str
    tokens_used: int
    duration_ms: float

@app.post("/agent/run", response_model=AgentResponse)
async def run_agent(request: AgentRequest):
    if agent_system is None:
        raise HTTPException(503, "Agent system not initialized")
    result = await agent_system.handle(
        message=request.message,
        conversation_id=request.conversation_id,
        user_id=request.user_id,
    )
    return AgentResponse(
        response=result.output,
        conversation_id=result.conversation_id,
        tokens_used=result.tokens,
        duration_ms=result.duration_ms,
    )

@app.get("/healthz")
async def health():
    return {"status": "healthy"}

@app.get("/readyz")
async def ready():
    if agent_system is None or not agent_system.is_ready():
        raise HTTPException(503, "Not ready")
    return {"status": "ready"}

The Dockerfile should use multi-stage builds to keep the image size manageable:

# Dockerfile
FROM python:3.12-slim AS builder
WORKDIR /build
COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt

FROM python:3.12-slim
WORKDIR /app
COPY --from=builder /install /usr/local
COPY . .
EXPOSE 8000
CMD ["uvicorn", "agent_server:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "2"]

Kubernetes Deployment Manifest

A production-grade deployment manifest for an AI agent includes resource requests and limits, health probes, anti-affinity rules, and proper environment variable management.

# agent-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: billing-agent
  namespace: ai-agents
  labels:
    app: billing-agent
    tier: specialist
spec:
  replicas: 3
  selector:
    matchLabels:
      app: billing-agent
  template:
    metadata:
      labels:
        app: billing-agent
        tier: specialist
    spec:
      containers:
        - name: agent
          image: registry.example.com/billing-agent:v1.4.2
          ports:
            - containerPort: 8000
          resources:
            requests:
              cpu: "500m"
              memory: "512Mi"
            limits:
              cpu: "2000m"
              memory: "2Gi"
          env:
            - name: OPENAI_API_KEY
              valueFrom:
                secretKeyRef:
                  name: agent-secrets
                  key: openai-api-key
            - name: DATABASE_URL
              valueFrom:
                secretKeyRef:
                  name: agent-secrets
                  key: database-url
            - name: AGENT_MAX_TOKENS
              value: "4096"
            - name: AGENT_TIMEOUT_SECONDS
              value: "30"
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8000
            initialDelaySeconds: 10
            periodSeconds: 15
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /readyz
              port: 8000
            initialDelaySeconds: 20
            periodSeconds: 10
            failureThreshold: 2
          startupProbe:
            httpGet:
              path: /healthz
              port: 8000
            initialDelaySeconds: 5
            periodSeconds: 5
            failureThreshold: 30  # Allow up to 2.5 min startup
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    app: billing-agent
                topologyKey: kubernetes.io/hostname

Key Configuration Decisions

Resource requests vs limits. CPU requests should reflect the baseline load (LLM calls are I/O-bound, not CPU-bound). Memory limits should account for peak usage including conversation context buffers. For agents that call LLM APIs (not running local models), 512Mi-2Gi memory is typical.

Startup probe. AI agents often take 15-60 seconds to initialize (loading embeddings, connecting to vector databases, warming caches). The startup probe prevents the liveness probe from killing pods during initialization. Set failureThreshold * periodSeconds to exceed your worst-case startup time.

Pod anti-affinity. Spread agent replicas across nodes to avoid losing all replicas if a node fails. Use preferredDuringScheduling rather than required so scheduling still works in resource-constrained clusters.

Health Checks That Actually Work

The biggest mistake in AI agent health checks is making them too simple. A basic HTTP 200 from /healthz tells you the process is running, not that the agent can actually serve requests.

@app.get("/readyz")
async def readiness_check():
    checks = {}

    # Check LLM API connectivity
    try:
        await asyncio.wait_for(
            agent_system.llm_client.ping(), timeout=5.0
        )
        checks["llm_api"] = "ok"
    except Exception as e:
        checks["llm_api"] = f"error: {str(e)}"

    # Check database connectivity
    try:
        await asyncio.wait_for(
            agent_system.db.execute("SELECT 1"), timeout=3.0
        )
        checks["database"] = "ok"
    except Exception as e:
        checks["database"] = f"error: {str(e)}"

    # Check vector store connectivity
    try:
        await asyncio.wait_for(
            agent_system.vector_store.health(), timeout=3.0
        )
        checks["vector_store"] = "ok"
    except Exception as e:
        checks["vector_store"] = f"error: {str(e)}"

    # Check current load
    current_load = agent_system.active_requests
    max_load = agent_system.max_concurrent_requests
    checks["load"] = f"{current_load}/{max_load}"

    all_ok = all(
        v == "ok" for k, v in checks.items() if k != "load"
    )

    if not all_ok:
        raise HTTPException(
            status_code=503,
            detail={"status": "not_ready", "checks": checks},
        )

    return {"status": "ready", "checks": checks}

Liveness probes should be lightweight and check only if the process is healthy (not deadlocked, not out of memory). Do not include external dependency checks in liveness probes — a database outage should not cause pod restarts.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

Readiness probes should verify the agent can serve requests: LLM API accessible, database connected, vector store reachable. Failing readiness removes the pod from the service endpoint without restarting it.

Horizontal Pod Autoscaling

AI agents have a unique scaling profile. CPU usage is low (most time is spent waiting for LLM API responses), but concurrent request capacity is limited by memory and connection pools. Custom metrics provide better scaling signals than CPU.

# hpa.yaml — Scale based on active requests per pod
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: billing-agent-hpa
  namespace: ai-agents
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: billing-agent
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Pods
      pods:
        metric:
          name: agent_active_requests
        target:
          type: AverageValue
          averageValue: "8"  # Scale up when avg exceeds 8 per pod
    - type: Pods
      pods:
        metric:
          name: agent_request_queue_depth
        target:
          type: AverageValue
          averageValue: "5"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Pods
          value: 4
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scaling down
      policies:
        - type: Pods
          value: 2
          periodSeconds: 120

Expose custom metrics from your agent server using a Prometheus client:

from prometheus_client import Gauge, Counter, Histogram
from prometheus_client import make_asgi_app

active_requests = Gauge(
    "agent_active_requests",
    "Number of currently active agent requests",
)
request_queue_depth = Gauge(
    "agent_request_queue_depth",
    "Number of requests waiting in queue",
)
request_duration = Histogram(
    "agent_request_duration_seconds",
    "Agent request duration",
    buckets=[0.5, 1, 2, 5, 10, 30, 60, 120],
)

# Mount Prometheus metrics endpoint
metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)

Scaling Down Safely

AI agent requests can take 5-60 seconds. Scaling down too aggressively kills pods with in-flight requests. Configure a generous terminationGracePeriodSeconds and handle SIGTERM gracefully:

import signal

async def graceful_shutdown(sig, frame):
    logger.info("Received shutdown signal, draining requests...")
    agent_system.stop_accepting_requests()
    # Wait for in-flight requests to complete
    while agent_system.active_requests > 0:
        logger.info(
            f"Waiting for {agent_system.active_requests} "
            f"in-flight requests"
        )
        await asyncio.sleep(2)
    logger.info("All requests drained, shutting down")

signal.signal(signal.SIGTERM, graceful_shutdown)

GPU Resource Management

Agents running local models (not calling external APIs) need GPU resources. Kubernetes manages GPUs as extended resources.

# GPU deployment for local model inference
containers:
  - name: agent-with-local-model
    image: registry.example.com/local-inference-agent:v2.1
    resources:
      requests:
        cpu: "2000m"
        memory: "8Gi"
        nvidia.com/gpu: "1"
      limits:
        cpu: "4000m"
        memory: "16Gi"
        nvidia.com/gpu: "1"

For mixed workloads where some agents call APIs and others run local models, use node selectors or taints to schedule GPU-requiring pods only on GPU nodes:

nodeSelector:
  gpu-type: "a100"
tolerations:
  - key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"

Cost Optimization Strategies

Kubernetes cost optimization for AI agents focuses on three areas: compute efficiency, LLM API spend, and infrastructure right-sizing.

Spot/preemptible nodes for non-critical agents. Evaluation runners, batch processing agents, and development environments can tolerate preemption. Save 60-80% on compute costs.

Request-based scaling over CPU-based scaling. Since AI agents are I/O-bound, CPU-based HPA under-scales during high load and over-scales during idle periods.

Pod disruption budgets prevent Kubernetes from evicting too many agent pods during node maintenance.

# pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: billing-agent-pdb
  namespace: ai-agents
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: billing-agent

FAQ

How many uvicorn workers should an AI agent pod run?

For agents that primarily call external LLM APIs (I/O-bound), 2-4 workers per pod is typical. Each worker handles concurrent requests via asyncio, so the concurrency is workers * async_concurrency. For agents running local inference (CPU/GPU-bound), use 1 worker per GPU. Monitor memory usage per worker — each worker loads its own copy of any in-memory models or caches.

Each agent type should have its own deployment. This allows independent scaling (billing agents may need 10 replicas during invoice season while sales agents need 2), independent rollouts (update the billing agent without affecting other agents), and independent resource allocation. Share common infrastructure (databases, message queues) but not compute.

How do you handle LLM API rate limits across multiple pods?

Use a centralized rate limiter (Redis-based token bucket or sliding window) that all pods consult before making LLM API calls. Alternatively, divide your API rate limit by the number of pods and configure per-pod limits. The centralized approach is more efficient (it allows burst handling) but adds a dependency.

What is the minimum replica count for production agents?

Run at least 2 replicas for any agent handling production traffic. This ensures availability during pod restarts, deployments, and node failures. For critical agents (triage, payment processing), run 3+ replicas across multiple availability zones. A pod disruption budget of minAvailable: 2 ensures at least 2 pods are always running even during voluntary disruptions.

Deploying AI Agents on Kubernetes: Scaling, Health Checks, and Resource Management

Why Kubernetes for AI Agents

Container Design for AI Agents

Kubernetes Deployment Manifest

Key Configuration Decisions

Health Checks That Actually Work

Horizontal Pod Autoscaling

Scaling Down Safely

GPU Resource Management

Cost Optimization Strategies

FAQ

How many uvicorn workers should an AI agent pod run?

How do you handle LLM API rate limits across multiple pods?

What is the minimum replica count for production agents?

Try CallSphere AI Voice Agents

Related Articles

Evaluating AI Pipelines: From LLMs to Real-World Impact

Fine-Tuning LLMs for Agentic Tasks: When and How to Customize Foundation Models

Agent A/B Testing: Comparing Model Versions, Prompts, and Architectures in Production

Why Kubernetes for AI Agents

Container Design for AI Agents

Kubernetes Deployment Manifest

Key Configuration Decisions

Health Checks That Actually Work

Horizontal Pod Autoscaling

Scaling Down Safely

GPU Resource Management

Cost Optimization Strategies

FAQ

How many uvicorn workers should an AI agent pod run?

Should each agent type have its own deployment or share a deployment?

How do you handle LLM API rate limits across multiple pods?

What is the minimum replica count for production agents?

Try CallSphere AI Voice Agents

Related Articles

Evaluating AI Pipelines: From LLMs to Real-World Impact

Fine-Tuning LLMs for Agentic Tasks: When and How to Customize Foundation Models

Agent A/B Testing: Comparing Model Versions, Prompts, and Architectures in Production