Kubernetes Fundamentals for AI Engineers: Pods, Deployments, and Services
Master core Kubernetes concepts every AI engineer needs — Pods, Deployments, Services, and ReplicaSets — with practical YAML manifests for deploying AI agent workloads to production clusters.
Why Kubernetes Matters for AI Agents
AI agents are not simple request-response APIs. They maintain conversation state, call external tools, spawn sub-agents, and consume unpredictable amounts of GPU and memory. Running them on a single server works for demos, but production demands orchestration — automatic restarts, scaling, rolling updates, and service discovery. Kubernetes provides all of this as a declarative platform.
This guide covers the three foundational Kubernetes resources you need to deploy any AI agent: Pods, Deployments, and Services.
Pods: The Smallest Deployable Unit
A Pod is one or more containers that share networking and storage. For an AI agent, a Pod typically contains the agent process itself and optionally a sidecar for logging or metrics.
# ai-agent-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: ai-agent
labels:
app: ai-agent
tier: inference
spec:
containers:
- name: agent
image: myregistry/ai-agent:1.0.0
ports:
- containerPort: 8000
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "2Gi"
cpu: "1000m"
env:
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: ai-secrets
key: openai-api-key
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 10
periodSeconds: 5
The resources block is critical for AI workloads. Without memory limits, a single agent loading a large model can starve other Pods on the node. The readinessProbe prevents Kubernetes from routing traffic to an agent that is still loading its model weights.
Deployments: Declarative Replica Management
You should never create Pods directly in production. A Deployment manages a set of identical Pod replicas and handles rolling updates.
# ai-agent-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ai-agent
namespace: ai-agents
spec:
replicas: 3
selector:
matchLabels:
app: ai-agent
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
metadata:
labels:
app: ai-agent
version: v1.0.0
spec:
containers:
- name: agent
image: myregistry/ai-agent:1.0.0
ports:
- containerPort: 8000
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "2Gi"
cpu: "1000m"
Setting maxUnavailable: 0 ensures zero downtime during updates. Kubernetes creates new Pods first, waits for them to pass readiness checks, then terminates old Pods one at a time.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Services: Stable Network Endpoints
Pods get ephemeral IP addresses that change on restart. A Service provides a stable DNS name and load balances across healthy Pod replicas.
# ai-agent-service.yaml
apiVersion: v1
kind: Service
metadata:
name: ai-agent-svc
namespace: ai-agents
spec:
type: ClusterIP
selector:
app: ai-agent
ports:
- port: 80
targetPort: 8000
protocol: TCP
Other services in the cluster reach your agent at ai-agent-svc.ai-agents.svc.cluster.local. For external access, use a LoadBalancer type or an Ingress resource with TLS termination.
Connecting From Python
Your Python application code does not change when moving to Kubernetes. The agent process simply listens on the configured port.
from fastapi import FastAPI
import os
app = FastAPI()
@app.get("/health")
async def health():
return {"status": "healthy"}
@app.post("/agent/invoke")
async def invoke_agent(request: dict):
# Agent logic here — Kubernetes handles scaling
api_key = os.environ["OPENAI_API_KEY"]
return {"response": "Agent processed your request"}
FAQ
When should I use a StatefulSet instead of a Deployment for AI agents?
Use a StatefulSet when your agent requires stable network identifiers or persistent storage that must survive Pod rescheduling. For example, agents that maintain a local vector store or checkpoint their conversation history to disk benefit from StatefulSets. Stateless agents that fetch all context from external databases should use Deployments.
How many replicas should I run for an AI agent Deployment?
Start with at least two replicas for high availability. Monitor CPU and memory utilization under realistic load, then scale. A common pattern is to set replicas to three for production and combine this with a Horizontal Pod Autoscaler that adjusts between two and ten replicas based on request latency or queue depth.
Can I run GPU workloads in Kubernetes Pods?
Yes. Install the NVIDIA device plugin for Kubernetes, then request GPUs in your resource spec with nvidia.com/gpu: 1 under limits. Kubernetes schedules the Pod only on nodes that have available GPUs. This is essential for agents that run local inference with models like Llama or Mistral.
#Kubernetes #AIDeployment #DevOps #Containers #Infrastructure #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.