Kubernetes for AI Agents: Scaling Agent Workloads with K8s
Deploy and scale AI agent services on Kubernetes with Deployments, Services, Horizontal Pod Autoscalers, resource limits, and health checks for production-grade reliability.
Why Kubernetes for AI Agent Workloads
A single FastAPI container running your AI agent handles one user well. But production workloads demand horizontal scaling, automatic recovery from crashes, rolling updates without downtime, and resource isolation. Kubernetes provides all of this through declarative configuration — you describe the desired state, and K8s continuously reconciles reality to match.
AI agents present unique scaling challenges: requests are long-running (seconds to minutes per LLM call), memory usage spikes with large context windows, and traffic patterns are bursty. Kubernetes gives you the primitives to handle all of these.
The Deployment Manifest
A Deployment defines how many replicas of your agent pod to run and how to update them:
# k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: agent-service
namespace: ai-agents
labels:
app: agent-service
spec:
replicas: 3
selector:
matchLabels:
app: agent-service
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 1
template:
metadata:
labels:
app: agent-service
spec:
containers:
- name: agent
image: registry.example.com/agent-service:1.0.0
ports:
- containerPort: 8000
env:
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: agent-secrets
key: openai-api-key
- name: AGENT_MODEL
value: "gpt-4o"
resources:
requests:
cpu: "250m"
memory: "512Mi"
limits:
cpu: "1000m"
memory: "1Gi"
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 10
periodSeconds: 15
timeoutSeconds: 5
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 5
periodSeconds: 10
Key decisions here: resource requests guarantee minimum allocation so the scheduler places pods intelligently. Limits prevent a single agent from consuming all node resources during a large context window request.
Service for Internal Traffic
A Service gives your agent pods a stable DNS name and load balances traffic:
# k8s/service.yaml
apiVersion: v1
kind: Service
metadata:
name: agent-service
namespace: ai-agents
spec:
selector:
app: agent-service
ports:
- port: 80
targetPort: 8000
protocol: TCP
type: ClusterIP
Other services in the cluster reach the agent at http://agent-service.ai-agents.svc.cluster.local.
Secrets Management
Store your API keys in Kubernetes Secrets, not in Deployment manifests:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
kubectl create secret generic agent-secrets \
--namespace ai-agents \
--from-literal=openai-api-key=sk-proj-your-key-here
Reference them in your Deployment with valueFrom.secretKeyRef as shown above.
Horizontal Pod Autoscaler
Scale pods automatically based on CPU utilization or custom metrics:
# k8s/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: agent-service-hpa
namespace: ai-agents
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: agent-service
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Pods
value: 2
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
The scaleDown.stabilizationWindowSeconds: 300 prevents thrashing — agent traffic is bursty, and you do not want Kubernetes removing pods only to recreate them a minute later.
Health Check Endpoint Design
Your /health endpoint should verify all critical dependencies:
@app.get("/health")
async def health():
checks = {}
try:
await redis_client.ping()
checks["redis"] = "ok"
except Exception:
checks["redis"] = "down"
overall = "ok" if all(v == "ok" for v in checks.values()) else "degraded"
status_code = 200 if overall == "ok" else 503
return JSONResponse(
content={"status": overall, "checks": checks},
status_code=status_code,
)
Applying the Manifests
kubectl create namespace ai-agents
kubectl apply -f k8s/deployment.yaml
kubectl apply -f k8s/service.yaml
kubectl apply -f k8s/hpa.yaml
# Watch the rollout
kubectl rollout status deployment/agent-service -n ai-agents
# Check pods
kubectl get pods -n ai-agents -l app=agent-service
FAQ
How should I set resource limits for AI agent pods?
Start by profiling your agent under realistic load. Most Python-based agents with FastAPI use 200-500 MB of RAM at baseline. Set memory requests at your p50 usage and limits at your p99. For CPU, LLM-backed agents are I/O-bound, so 250m-500m CPU request is typically sufficient. Monitor with kubectl top pods and adjust based on actual usage patterns.
What happens to in-flight agent requests during a rolling update?
Kubernetes sends a SIGTERM to the old pod and waits for terminationGracePeriodSeconds (default 30 seconds) before forcefully killing it. Handle SIGTERM in your FastAPI app by completing in-flight requests and rejecting new ones. Set the grace period longer than your maximum expected agent response time to prevent dropped requests.
Should I use one pod per agent type or multiplex agents in a single pod?
For most teams, a single service that handles all agent types is simpler to operate. Route to different agent behaviors via a request parameter. Only split into separate Deployments when agent types have significantly different resource profiles — for example, a coding agent that needs 4 GB of RAM versus a simple Q&A agent that needs 512 MB.
#Kubernetes #AIAgents #Scaling #DevOps #Infrastructure #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.