Deploying AI Agents on Kubernetes: Scaling, Health Checks, and Resource Management
Technical guide to Kubernetes deployment for AI agents including container design, HPA scaling, readiness and liveness probes, GPU resource requests, and cost optimization.
Why Kubernetes for AI Agents
AI agents in production need the same operational guarantees as any critical service: high availability, automatic scaling, rolling deployments, health monitoring, and resource isolation. Kubernetes provides all of these out of the box, plus features that are particularly valuable for AI workloads: GPU scheduling, horizontal pod autoscaling based on custom metrics, and namespace-based isolation for multi-tenant agent deployments.
This guide covers the end-to-end process of deploying AI agents on Kubernetes, from container design through scaling and cost optimization.
Container Design for AI Agents
AI agent containers differ from typical web service containers in three ways: they often need ML libraries (which are large), they may require GPU drivers, and their startup time is longer due to model loading or embedding initialization.
# agent_server.py — FastAPI server wrapping an AI agent
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from contextlib import asynccontextmanager
import asyncio
# Global state initialized at startup
agent_system = None
@asynccontextmanager
async def lifespan(app: FastAPI):
global agent_system
# Startup: initialize agent, load models, connect to vector DB
agent_system = await initialize_agent_system()
yield
# Shutdown: cleanup connections
await agent_system.shutdown()
app = FastAPI(lifespan=lifespan)
class AgentRequest(BaseModel):
message: str
conversation_id: str | None = None
user_id: str
class AgentResponse(BaseModel):
response: str
conversation_id: str
tokens_used: int
duration_ms: float
@app.post("/agent/run", response_model=AgentResponse)
async def run_agent(request: AgentRequest):
if agent_system is None:
raise HTTPException(503, "Agent system not initialized")
result = await agent_system.handle(
message=request.message,
conversation_id=request.conversation_id,
user_id=request.user_id,
)
return AgentResponse(
response=result.output,
conversation_id=result.conversation_id,
tokens_used=result.tokens,
duration_ms=result.duration_ms,
)
@app.get("/healthz")
async def health():
return {"status": "healthy"}
@app.get("/readyz")
async def ready():
if agent_system is None or not agent_system.is_ready():
raise HTTPException(503, "Not ready")
return {"status": "ready"}
The Dockerfile should use multi-stage builds to keep the image size manageable:
# Dockerfile
FROM python:3.12-slim AS builder
WORKDIR /build
COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt
FROM python:3.12-slim
WORKDIR /app
COPY --from=builder /install /usr/local
COPY . .
EXPOSE 8000
CMD ["uvicorn", "agent_server:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "2"]
Kubernetes Deployment Manifest
A production-grade deployment manifest for an AI agent includes resource requests and limits, health probes, anti-affinity rules, and proper environment variable management.
# agent-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: billing-agent
namespace: ai-agents
labels:
app: billing-agent
tier: specialist
spec:
replicas: 3
selector:
matchLabels:
app: billing-agent
template:
metadata:
labels:
app: billing-agent
tier: specialist
spec:
containers:
- name: agent
image: registry.example.com/billing-agent:v1.4.2
ports:
- containerPort: 8000
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "2000m"
memory: "2Gi"
env:
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: agent-secrets
key: openai-api-key
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: agent-secrets
key: database-url
- name: AGENT_MAX_TOKENS
value: "4096"
- name: AGENT_TIMEOUT_SECONDS
value: "30"
livenessProbe:
httpGet:
path: /healthz
port: 8000
initialDelaySeconds: 10
periodSeconds: 15
failureThreshold: 3
readinessProbe:
httpGet:
path: /readyz
port: 8000
initialDelaySeconds: 20
periodSeconds: 10
failureThreshold: 2
startupProbe:
httpGet:
path: /healthz
port: 8000
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 30 # Allow up to 2.5 min startup
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: billing-agent
topologyKey: kubernetes.io/hostname
Key Configuration Decisions
Resource requests vs limits. CPU requests should reflect the baseline load (LLM calls are I/O-bound, not CPU-bound). Memory limits should account for peak usage including conversation context buffers. For agents that call LLM APIs (not running local models), 512Mi-2Gi memory is typical.
Startup probe. AI agents often take 15-60 seconds to initialize (loading embeddings, connecting to vector databases, warming caches). The startup probe prevents the liveness probe from killing pods during initialization. Set failureThreshold * periodSeconds to exceed your worst-case startup time.
Pod anti-affinity. Spread agent replicas across nodes to avoid losing all replicas if a node fails. Use preferredDuringScheduling rather than required so scheduling still works in resource-constrained clusters.
Health Checks That Actually Work
The biggest mistake in AI agent health checks is making them too simple. A basic HTTP 200 from /healthz tells you the process is running, not that the agent can actually serve requests.
@app.get("/readyz")
async def readiness_check():
checks = {}
# Check LLM API connectivity
try:
await asyncio.wait_for(
agent_system.llm_client.ping(), timeout=5.0
)
checks["llm_api"] = "ok"
except Exception as e:
checks["llm_api"] = f"error: {str(e)}"
# Check database connectivity
try:
await asyncio.wait_for(
agent_system.db.execute("SELECT 1"), timeout=3.0
)
checks["database"] = "ok"
except Exception as e:
checks["database"] = f"error: {str(e)}"
# Check vector store connectivity
try:
await asyncio.wait_for(
agent_system.vector_store.health(), timeout=3.0
)
checks["vector_store"] = "ok"
except Exception as e:
checks["vector_store"] = f"error: {str(e)}"
# Check current load
current_load = agent_system.active_requests
max_load = agent_system.max_concurrent_requests
checks["load"] = f"{current_load}/{max_load}"
all_ok = all(
v == "ok" for k, v in checks.items() if k != "load"
)
if not all_ok:
raise HTTPException(
status_code=503,
detail={"status": "not_ready", "checks": checks},
)
return {"status": "ready", "checks": checks}
Liveness probes should be lightweight and check only if the process is healthy (not deadlocked, not out of memory). Do not include external dependency checks in liveness probes — a database outage should not cause pod restarts.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Readiness probes should verify the agent can serve requests: LLM API accessible, database connected, vector store reachable. Failing readiness removes the pod from the service endpoint without restarting it.
Horizontal Pod Autoscaling
AI agents have a unique scaling profile. CPU usage is low (most time is spent waiting for LLM API responses), but concurrent request capacity is limited by memory and connection pools. Custom metrics provide better scaling signals than CPU.
# hpa.yaml — Scale based on active requests per pod
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: billing-agent-hpa
namespace: ai-agents
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: billing-agent
minReplicas: 2
maxReplicas: 20
metrics:
- type: Pods
pods:
metric:
name: agent_active_requests
target:
type: AverageValue
averageValue: "8" # Scale up when avg exceeds 8 per pod
- type: Pods
pods:
metric:
name: agent_request_queue_depth
target:
type: AverageValue
averageValue: "5"
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Pods
value: 4
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300 # Wait 5 min before scaling down
policies:
- type: Pods
value: 2
periodSeconds: 120
Expose custom metrics from your agent server using a Prometheus client:
from prometheus_client import Gauge, Counter, Histogram
from prometheus_client import make_asgi_app
active_requests = Gauge(
"agent_active_requests",
"Number of currently active agent requests",
)
request_queue_depth = Gauge(
"agent_request_queue_depth",
"Number of requests waiting in queue",
)
request_duration = Histogram(
"agent_request_duration_seconds",
"Agent request duration",
buckets=[0.5, 1, 2, 5, 10, 30, 60, 120],
)
# Mount Prometheus metrics endpoint
metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)
Scaling Down Safely
AI agent requests can take 5-60 seconds. Scaling down too aggressively kills pods with in-flight requests. Configure a generous terminationGracePeriodSeconds and handle SIGTERM gracefully:
import signal
async def graceful_shutdown(sig, frame):
logger.info("Received shutdown signal, draining requests...")
agent_system.stop_accepting_requests()
# Wait for in-flight requests to complete
while agent_system.active_requests > 0:
logger.info(
f"Waiting for {agent_system.active_requests} "
f"in-flight requests"
)
await asyncio.sleep(2)
logger.info("All requests drained, shutting down")
signal.signal(signal.SIGTERM, graceful_shutdown)
GPU Resource Management
Agents running local models (not calling external APIs) need GPU resources. Kubernetes manages GPUs as extended resources.
# GPU deployment for local model inference
containers:
- name: agent-with-local-model
image: registry.example.com/local-inference-agent:v2.1
resources:
requests:
cpu: "2000m"
memory: "8Gi"
nvidia.com/gpu: "1"
limits:
cpu: "4000m"
memory: "16Gi"
nvidia.com/gpu: "1"
For mixed workloads where some agents call APIs and others run local models, use node selectors or taints to schedule GPU-requiring pods only on GPU nodes:
nodeSelector:
gpu-type: "a100"
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"
Cost Optimization Strategies
Kubernetes cost optimization for AI agents focuses on three areas: compute efficiency, LLM API spend, and infrastructure right-sizing.
Spot/preemptible nodes for non-critical agents. Evaluation runners, batch processing agents, and development environments can tolerate preemption. Save 60-80% on compute costs.
Request-based scaling over CPU-based scaling. Since AI agents are I/O-bound, CPU-based HPA under-scales during high load and over-scales during idle periods.
Pod disruption budgets prevent Kubernetes from evicting too many agent pods during node maintenance.
# pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: billing-agent-pdb
namespace: ai-agents
spec:
minAvailable: 2
selector:
matchLabels:
app: billing-agent
FAQ
How many uvicorn workers should an AI agent pod run?
For agents that primarily call external LLM APIs (I/O-bound), 2-4 workers per pod is typical. Each worker handles concurrent requests via asyncio, so the concurrency is workers * async_concurrency. For agents running local inference (CPU/GPU-bound), use 1 worker per GPU. Monitor memory usage per worker — each worker loads its own copy of any in-memory models or caches.
Should each agent type have its own deployment or share a deployment?
Each agent type should have its own deployment. This allows independent scaling (billing agents may need 10 replicas during invoice season while sales agents need 2), independent rollouts (update the billing agent without affecting other agents), and independent resource allocation. Share common infrastructure (databases, message queues) but not compute.
How do you handle LLM API rate limits across multiple pods?
Use a centralized rate limiter (Redis-based token bucket or sliding window) that all pods consult before making LLM API calls. Alternatively, divide your API rate limit by the number of pods and configure per-pod limits. The centralized approach is more efficient (it allows burst handling) but adds a dependency.
What is the minimum replica count for production agents?
Run at least 2 replicas for any agent handling production traffic. This ensures availability during pod restarts, deployments, and node failures. For critical agents (triage, payment processing), run 3+ replicas across multiple availability zones. A pod disruption budget of minAvailable: 2 ensures at least 2 pods are always running even during voluntary disruptions.
Written by
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.