Skip to content
Learn Agentic AI12 min read0 views

Kubernetes for AI Agents: Scaling Agent Workloads with K8s

Deploy and scale AI agent services on Kubernetes with Deployments, Services, Horizontal Pod Autoscalers, resource limits, and health checks for production-grade reliability.

Why Kubernetes for AI Agent Workloads

A single FastAPI container running your AI agent handles one user well. But production workloads demand horizontal scaling, automatic recovery from crashes, rolling updates without downtime, and resource isolation. Kubernetes provides all of this through declarative configuration — you describe the desired state, and K8s continuously reconciles reality to match.

AI agents present unique scaling challenges: requests are long-running (seconds to minutes per LLM call), memory usage spikes with large context windows, and traffic patterns are bursty. Kubernetes gives you the primitives to handle all of these.

The Deployment Manifest

A Deployment defines how many replicas of your agent pod to run and how to update them:

# k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: agent-service
  namespace: ai-agents
  labels:
    app: agent-service
spec:
  replicas: 3
  selector:
    matchLabels:
      app: agent-service
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1
  template:
    metadata:
      labels:
        app: agent-service
    spec:
      containers:
        - name: agent
          image: registry.example.com/agent-service:1.0.0
          ports:
            - containerPort: 8000
          env:
            - name: OPENAI_API_KEY
              valueFrom:
                secretKeyRef:
                  name: agent-secrets
                  key: openai-api-key
            - name: AGENT_MODEL
              value: "gpt-4o"
          resources:
            requests:
              cpu: "250m"
              memory: "512Mi"
            limits:
              cpu: "1000m"
              memory: "1Gi"
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 10
            periodSeconds: 15
            timeoutSeconds: 5
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 5
            periodSeconds: 10

Key decisions here: resource requests guarantee minimum allocation so the scheduler places pods intelligently. Limits prevent a single agent from consuming all node resources during a large context window request.

Service for Internal Traffic

A Service gives your agent pods a stable DNS name and load balances traffic:

# k8s/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: agent-service
  namespace: ai-agents
spec:
  selector:
    app: agent-service
  ports:
    - port: 80
      targetPort: 8000
      protocol: TCP
  type: ClusterIP

Other services in the cluster reach the agent at http://agent-service.ai-agents.svc.cluster.local.

Secrets Management

Store your API keys in Kubernetes Secrets, not in Deployment manifests:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

kubectl create secret generic agent-secrets \
  --namespace ai-agents \
  --from-literal=openai-api-key=sk-proj-your-key-here

Reference them in your Deployment with valueFrom.secretKeyRef as shown above.

Horizontal Pod Autoscaler

Scale pods automatically based on CPU utilization or custom metrics:

# k8s/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: agent-service-hpa
  namespace: ai-agents
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: agent-service
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300

The scaleDown.stabilizationWindowSeconds: 300 prevents thrashing — agent traffic is bursty, and you do not want Kubernetes removing pods only to recreate them a minute later.

Health Check Endpoint Design

Your /health endpoint should verify all critical dependencies:

@app.get("/health")
async def health():
    checks = {}
    try:
        await redis_client.ping()
        checks["redis"] = "ok"
    except Exception:
        checks["redis"] = "down"

    overall = "ok" if all(v == "ok" for v in checks.values()) else "degraded"
    status_code = 200 if overall == "ok" else 503
    return JSONResponse(
        content={"status": overall, "checks": checks},
        status_code=status_code,
    )

Applying the Manifests

kubectl create namespace ai-agents

kubectl apply -f k8s/deployment.yaml
kubectl apply -f k8s/service.yaml
kubectl apply -f k8s/hpa.yaml

# Watch the rollout
kubectl rollout status deployment/agent-service -n ai-agents

# Check pods
kubectl get pods -n ai-agents -l app=agent-service

FAQ

How should I set resource limits for AI agent pods?

Start by profiling your agent under realistic load. Most Python-based agents with FastAPI use 200-500 MB of RAM at baseline. Set memory requests at your p50 usage and limits at your p99. For CPU, LLM-backed agents are I/O-bound, so 250m-500m CPU request is typically sufficient. Monitor with kubectl top pods and adjust based on actual usage patterns.

What happens to in-flight agent requests during a rolling update?

Kubernetes sends a SIGTERM to the old pod and waits for terminationGracePeriodSeconds (default 30 seconds) before forcefully killing it. Handle SIGTERM in your FastAPI app by completing in-flight requests and rejecting new ones. Set the grace period longer than your maximum expected agent response time to prevent dropped requests.

Should I use one pod per agent type or multiplex agents in a single pod?

For most teams, a single service that handles all agent types is simpler to operate. Route to different agent behaviors via a request parameter. Only split into separate Deployments when agent types have significantly different resource profiles — for example, a coding agent that needs 4 GB of RAM versus a simple Q&A agent that needs 512 MB.


#Kubernetes #AIAgents #Scaling #DevOps #Infrastructure #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.