Why Kubernetes for AI Agents

AI agent systems have unique deployment requirements: they make long-running API calls (30-120 seconds), consume variable memory depending on context window size, need access to external secrets (API keys), and benefit from horizontal scaling based on queue depth rather than CPU utilization. Kubernetes handles all of these requirements with its declarative resource management, autoscaling primitives, and secret management.

Architecture Overview

A production AI agent deployment on Kubernetes typically has four components:

                 Ingress (nginx/traefik)
                        |
                   API Gateway
                   /    |    \
          Agent API  Worker Pods  Vector DB
              |          |           |
           Redis     Redis Queue   Qdrant/
          (cache)    (task queue)   Weaviate

Agent API: Handles HTTP requests, enqueues tasks, returns results
Worker Pods: Process agent tasks from the queue (LLM calls, tool execution)
Vector DB: Serves retrieval queries for RAG pipelines
Redis: Shared cache and task queue

Pod Design for AI Agents

The Agent Worker Pod

apiVersion: apps/v1
kind: Deployment
metadata:
  name: agent-worker
  namespace: ai-agents
spec:
  replicas: 3
  selector:
    matchLabels:
      app: agent-worker
  template:
    metadata:
      labels:
        app: agent-worker
    spec:
      containers:
        - name: worker
          image: myregistry/agent-worker:v1.2.0
          resources:
            requests:
              memory: "512Mi"
              cpu: "250m"
            limits:
              memory: "2Gi"
              cpu: "1000m"
          env:
            - name: ANTHROPIC_API_KEY
              valueFrom:
                secretKeyRef:
                  name: llm-api-keys
                  key: anthropic-key
            - name: REDIS_URL
              value: "redis://redis-master:6379/0"
            - name: WORKER_CONCURRENCY
              value: "4"
            - name: MAX_CONTEXT_TOKENS
              value: "100000"
          livenessProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 30
            timeoutSeconds: 10
          readinessProbe:
            httpGet:
              path: /ready
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 10
          startupProbe:
            httpGet:
              path: /health
              port: 8080
            failureThreshold: 30
            periodSeconds: 2
      terminationGracePeriodSeconds: 120

Key design decisions:

Memory limits at 2Gi: Agent workers need memory for conversation context, tool results, and parsed documents. 2Gi handles most workloads.
terminationGracePeriodSeconds: 120: Agent tasks can run for minutes. Give pods enough time to finish current work before shutdown.
Startup probe with high failure threshold: The worker may need time to load models or establish connections.

Health Check Implementation

from fastapi import FastAPI
import asyncio

app = FastAPI()
worker_healthy = True
worker_ready = False

@app.get("/health")
async def health():
    if not worker_healthy:
        return {"status": "unhealthy"}, 503
    return {"status": "healthy"}

@app.get("/ready")
async def ready():
    if not worker_ready:
        return {"status": "not ready"}, 503
    return {
        "status": "ready",
        "active_tasks": task_counter.value,
        "queue_depth": await get_queue_depth(),
    }

@app.on_event("startup")
async def startup():
    global worker_ready
    # Verify LLM API connectivity
    try:
        await test_llm_connection()
        await test_redis_connection()
        worker_ready = True
    except Exception as e:
        logger.error("startup_failed", error=str(e))

Autoscaling AI Agent Workers

Standard CPU-based autoscaling does not work for AI agents. Workers spend most of their time waiting for LLM API responses (I/O bound), so CPU stays low even when the system is overloaded. Scale based on queue depth instead.

KEDA (Kubernetes Event-Driven Autoscaling)

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: agent-worker-scaler
  namespace: ai-agents
spec:
  scaleTargetRef:
    name: agent-worker
  minReplicaCount: 2
  maxReplicaCount: 20
  cooldownPeriod: 300
  pollingInterval: 15
  triggers:
    - type: redis
      metadata:
        address: redis-master:6379
        listName: agent:task_queue
        listLength: "5"  # Scale up when >5 tasks per worker
        activationListLength: "1"
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        query: |
          avg(agent_task_duration_seconds{quantile="0.95"}) > 30
        threshold: "1"

This configuration scales workers when:

The Redis task queue exceeds 5 items per worker (primary trigger)
The P95 task duration exceeds 30 seconds (indicating overload)

Scaling Considerations

Factor	Recommendation
Min replicas	2 (high availability)
Max replicas	Based on LLM API rate limits
Scale-up speed	Aggressive (15s polling)
Scale-down speed	Conservative (300s cooldown)
Tasks per worker	3-5 concurrent (I/O bound)

Secrets Management

Never put API keys in environment variables directly in manifests. Use Kubernetes Secrets with an external secrets operator:

# Using External Secrets Operator with AWS Secrets Manager
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: llm-api-keys
  namespace: ai-agents
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: aws-secrets-store
    kind: ClusterSecretStore
  target:
    name: llm-api-keys
    creationPolicy: Owner
  data:
    - secretKey: anthropic-key
      remoteRef:
        key: /production/ai-agents/anthropic-api-key
    - secretKey: openai-key
      remoteRef:
        key: /production/ai-agents/openai-api-key

Persistent Storage for Agent State

Agents that maintain conversation history or checkpoint state need persistent storage:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: agent-checkpoints
  namespace: ai-agents
spec:
  accessModes:
    - ReadWriteMany  # Multiple workers need access
  storageClassName: efs-sc  # EFS for shared access
  resources:
    requests:
      storage: 50Gi

For most production systems, using Redis or PostgreSQL for agent state is preferable to filesystem storage:

# Redis for agent state and caching
apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-master
  namespace: ai-agents
spec:
  replicas: 1
  selector:
    matchLabels:
      app: redis
  template:
    spec:
      containers:
        - name: redis
          image: redis:7-alpine
          command: ["redis-server", "--maxmemory", "1gb",
                   "--maxmemory-policy", "allkeys-lru",
                   "--appendonly", "yes"]
          resources:
            requests:
              memory: "1Gi"
              cpu: "250m"
            limits:
              memory: "1.5Gi"
          volumeMounts:
            - name: redis-data
              mountPath: /data
      volumes:
        - name: redis-data
          persistentVolumeClaim:
            claimName: redis-pvc

Network Policies

Restrict agent pods to only communicate with necessary services:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: agent-worker-policy
  namespace: ai-agents
spec:
  podSelector:
    matchLabels:
      app: agent-worker
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              app: agent-api
      ports:
        - port: 8080
  egress:
    # Allow Redis
    - to:
        - podSelector:
            matchLabels:
              app: redis
      ports:
        - port: 6379
    # Allow external LLM APIs (HTTPS)
    - to:
        - ipBlock:
            cidr: 0.0.0.0/0
      ports:
        - port: 443
          protocol: TCP
    # Allow DNS
    - to: []
      ports:
        - port: 53
          protocol: UDP
        - port: 53
          protocol: TCP

Monitoring and Alerting

Deploy Prometheus ServiceMonitor and Grafana dashboards:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: agent-worker-monitor
  namespace: ai-agents
spec:
  selector:
    matchLabels:
      app: agent-worker
  endpoints:
    - port: metrics
      interval: 15s
      path: /metrics

Key metrics to expose from your agent workers:

from prometheus_client import Counter, Histogram, Gauge

# Task metrics
tasks_processed = Counter("agent_tasks_total", "Total tasks processed",
                          ["status", "model"])
task_duration = Histogram("agent_task_duration_seconds", "Task duration",
                          buckets=[1, 5, 10, 30, 60, 120, 300])
active_tasks = Gauge("agent_active_tasks", "Currently running tasks")

# LLM metrics
llm_requests = Counter("llm_requests_total", "LLM API calls",
                        ["model", "status"])
llm_tokens = Counter("llm_tokens_total", "Tokens used",
                      ["model", "direction"])  # input/output
llm_latency = Histogram("llm_request_duration_seconds", "LLM call latency",
                         ["model"])

# Cost metrics
llm_cost = Counter("llm_cost_dollars_total", "Estimated LLM cost",
                    ["model"])

Graceful Shutdown

When Kubernetes terminates a pod (during scaling, updates, or node drain), the worker must finish its current task:

import signal
import asyncio

shutdown_event = asyncio.Event()

def handle_shutdown(signum, frame):
    logger.info("Received shutdown signal, finishing current tasks...")
    shutdown_event.set()

signal.signal(signal.SIGTERM, handle_shutdown)

async def worker_loop():
    while not shutdown_event.is_set():
        task = await get_task_from_queue(timeout=5)
        if task:
            active_tasks.inc()
            try:
                await process_task(task)
                tasks_processed.labels(status="success", model=task.model).inc()
            except Exception as e:
                tasks_processed.labels(status="error", model=task.model).inc()
                await requeue_task(task)  # Put it back for another worker
            finally:
                active_tasks.dec()

    logger.info("Worker shutdown complete")

Key Takeaways

Deploying AI agents on Kubernetes is fundamentally about adapting Kubernetes primitives to the unique characteristics of LLM workloads: I/O-bound processing, long task durations, variable memory usage, and queue-based scaling. The patterns covered here -- KEDA-based autoscaling, generous termination grace periods, queue-depth triggers, and LLM-specific health checks -- form the foundation of a production-ready deployment.

Deploying AI Agents on Kubernetes: Production Architecture