Deploying AI Agents on Kubernetes: Production Architecture
A hands-on guide to deploying AI agent systems on Kubernetes, covering pod design, autoscaling based on queue depth, GPU scheduling, secrets management, health checks, and production-ready Helm charts for LLM-powered services.
Why Kubernetes for AI Agents
AI agent systems have unique deployment requirements: they make long-running API calls (30-120 seconds), consume variable memory depending on context window size, need access to external secrets (API keys), and benefit from horizontal scaling based on queue depth rather than CPU utilization. Kubernetes handles all of these requirements with its declarative resource management, autoscaling primitives, and secret management.
Architecture Overview
A production AI agent deployment on Kubernetes typically has four components:
Ingress (nginx/traefik)
|
API Gateway
/ | \
Agent API Worker Pods Vector DB
| | |
Redis Redis Queue Qdrant/
(cache) (task queue) Weaviate
- Agent API: Handles HTTP requests, enqueues tasks, returns results
- Worker Pods: Process agent tasks from the queue (LLM calls, tool execution)
- Vector DB: Serves retrieval queries for RAG pipelines
- Redis: Shared cache and task queue
Pod Design for AI Agents
The Agent Worker Pod
apiVersion: apps/v1
kind: Deployment
metadata:
name: agent-worker
namespace: ai-agents
spec:
replicas: 3
selector:
matchLabels:
app: agent-worker
template:
metadata:
labels:
app: agent-worker
spec:
containers:
- name: worker
image: myregistry/agent-worker:v1.2.0
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "2Gi"
cpu: "1000m"
env:
- name: ANTHROPIC_API_KEY
valueFrom:
secretKeyRef:
name: llm-api-keys
key: anthropic-key
- name: REDIS_URL
value: "redis://redis-master:6379/0"
- name: WORKER_CONCURRENCY
value: "4"
- name: MAX_CONTEXT_TOKENS
value: "100000"
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 30
timeoutSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
startupProbe:
httpGet:
path: /health
port: 8080
failureThreshold: 30
periodSeconds: 2
terminationGracePeriodSeconds: 120
Key design decisions:
- Memory limits at 2Gi: Agent workers need memory for conversation context, tool results, and parsed documents. 2Gi handles most workloads.
- terminationGracePeriodSeconds: 120: Agent tasks can run for minutes. Give pods enough time to finish current work before shutdown.
- Startup probe with high failure threshold: The worker may need time to load models or establish connections.
Health Check Implementation
from fastapi import FastAPI
import asyncio
app = FastAPI()
worker_healthy = True
worker_ready = False
@app.get("/health")
async def health():
if not worker_healthy:
return {"status": "unhealthy"}, 503
return {"status": "healthy"}
@app.get("/ready")
async def ready():
if not worker_ready:
return {"status": "not ready"}, 503
return {
"status": "ready",
"active_tasks": task_counter.value,
"queue_depth": await get_queue_depth(),
}
@app.on_event("startup")
async def startup():
global worker_ready
# Verify LLM API connectivity
try:
await test_llm_connection()
await test_redis_connection()
worker_ready = True
except Exception as e:
logger.error("startup_failed", error=str(e))
Autoscaling AI Agent Workers
Standard CPU-based autoscaling does not work for AI agents. Workers spend most of their time waiting for LLM API responses (I/O bound), so CPU stays low even when the system is overloaded. Scale based on queue depth instead.
KEDA (Kubernetes Event-Driven Autoscaling)
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: agent-worker-scaler
namespace: ai-agents
spec:
scaleTargetRef:
name: agent-worker
minReplicaCount: 2
maxReplicaCount: 20
cooldownPeriod: 300
pollingInterval: 15
triggers:
- type: redis
metadata:
address: redis-master:6379
listName: agent:task_queue
listLength: "5" # Scale up when >5 tasks per worker
activationListLength: "1"
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
query: |
avg(agent_task_duration_seconds{quantile="0.95"}) > 30
threshold: "1"
This configuration scales workers when:
- The Redis task queue exceeds 5 items per worker (primary trigger)
- The P95 task duration exceeds 30 seconds (indicating overload)
Scaling Considerations
| Factor | Recommendation |
|---|---|
| Min replicas | 2 (high availability) |
| Max replicas | Based on LLM API rate limits |
| Scale-up speed | Aggressive (15s polling) |
| Scale-down speed | Conservative (300s cooldown) |
| Tasks per worker | 3-5 concurrent (I/O bound) |
Secrets Management
Never put API keys in environment variables directly in manifests. Use Kubernetes Secrets with an external secrets operator:
# Using External Secrets Operator with AWS Secrets Manager
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
name: llm-api-keys
namespace: ai-agents
spec:
refreshInterval: 1h
secretStoreRef:
name: aws-secrets-store
kind: ClusterSecretStore
target:
name: llm-api-keys
creationPolicy: Owner
data:
- secretKey: anthropic-key
remoteRef:
key: /production/ai-agents/anthropic-api-key
- secretKey: openai-key
remoteRef:
key: /production/ai-agents/openai-api-key
Persistent Storage for Agent State
Agents that maintain conversation history or checkpoint state need persistent storage:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: agent-checkpoints
namespace: ai-agents
spec:
accessModes:
- ReadWriteMany # Multiple workers need access
storageClassName: efs-sc # EFS for shared access
resources:
requests:
storage: 50Gi
For most production systems, using Redis or PostgreSQL for agent state is preferable to filesystem storage:
# Redis for agent state and caching
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis-master
namespace: ai-agents
spec:
replicas: 1
selector:
matchLabels:
app: redis
template:
spec:
containers:
- name: redis
image: redis:7-alpine
command: ["redis-server", "--maxmemory", "1gb",
"--maxmemory-policy", "allkeys-lru",
"--appendonly", "yes"]
resources:
requests:
memory: "1Gi"
cpu: "250m"
limits:
memory: "1.5Gi"
volumeMounts:
- name: redis-data
mountPath: /data
volumes:
- name: redis-data
persistentVolumeClaim:
claimName: redis-pvc
Network Policies
Restrict agent pods to only communicate with necessary services:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: agent-worker-policy
namespace: ai-agents
spec:
podSelector:
matchLabels:
app: agent-worker
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: agent-api
ports:
- port: 8080
egress:
# Allow Redis
- to:
- podSelector:
matchLabels:
app: redis
ports:
- port: 6379
# Allow external LLM APIs (HTTPS)
- to:
- ipBlock:
cidr: 0.0.0.0/0
ports:
- port: 443
protocol: TCP
# Allow DNS
- to: []
ports:
- port: 53
protocol: UDP
- port: 53
protocol: TCP
Monitoring and Alerting
Deploy Prometheus ServiceMonitor and Grafana dashboards:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: agent-worker-monitor
namespace: ai-agents
spec:
selector:
matchLabels:
app: agent-worker
endpoints:
- port: metrics
interval: 15s
path: /metrics
Key metrics to expose from your agent workers:
from prometheus_client import Counter, Histogram, Gauge
# Task metrics
tasks_processed = Counter("agent_tasks_total", "Total tasks processed",
["status", "model"])
task_duration = Histogram("agent_task_duration_seconds", "Task duration",
buckets=[1, 5, 10, 30, 60, 120, 300])
active_tasks = Gauge("agent_active_tasks", "Currently running tasks")
# LLM metrics
llm_requests = Counter("llm_requests_total", "LLM API calls",
["model", "status"])
llm_tokens = Counter("llm_tokens_total", "Tokens used",
["model", "direction"]) # input/output
llm_latency = Histogram("llm_request_duration_seconds", "LLM call latency",
["model"])
# Cost metrics
llm_cost = Counter("llm_cost_dollars_total", "Estimated LLM cost",
["model"])
Graceful Shutdown
When Kubernetes terminates a pod (during scaling, updates, or node drain), the worker must finish its current task:
import signal
import asyncio
shutdown_event = asyncio.Event()
def handle_shutdown(signum, frame):
logger.info("Received shutdown signal, finishing current tasks...")
shutdown_event.set()
signal.signal(signal.SIGTERM, handle_shutdown)
async def worker_loop():
while not shutdown_event.is_set():
task = await get_task_from_queue(timeout=5)
if task:
active_tasks.inc()
try:
await process_task(task)
tasks_processed.labels(status="success", model=task.model).inc()
except Exception as e:
tasks_processed.labels(status="error", model=task.model).inc()
await requeue_task(task) # Put it back for another worker
finally:
active_tasks.dec()
logger.info("Worker shutdown complete")
Key Takeaways
Deploying AI agents on Kubernetes is fundamentally about adapting Kubernetes primitives to the unique characteristics of LLM workloads: I/O-bound processing, long task durations, variable memory usage, and queue-based scaling. The patterns covered here -- KEDA-based autoscaling, generous termination grace periods, queue-depth triggers, and LLM-specific health checks -- form the foundation of a production-ready deployment.
NYC News
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.