Kubernetes Fundamentals for AI Engineers: Pods, Deployments, and Services

Why Kubernetes Matters for AI Agents

AI agents are not simple request-response APIs. They maintain conversation state, call external tools, spawn sub-agents, and consume unpredictable amounts of GPU and memory. Running them on a single server works for demos, but production demands orchestration — automatic restarts, scaling, rolling updates, and service discovery. Kubernetes provides all of this as a declarative platform.

This guide covers the three foundational Kubernetes resources you need to deploy any AI agent: Pods, Deployments, and Services.

Pods: The Smallest Deployable Unit

A Pod is one or more containers that share networking and storage. For an AI agent, a Pod typically contains the agent process itself and optionally a sidecar for logging or metrics.

# ai-agent-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: ai-agent
  labels:
    app: ai-agent
    tier: inference
spec:
  containers:
    - name: agent
      image: myregistry/ai-agent:1.0.0
      ports:
        - containerPort: 8000
      resources:
        requests:
          memory: "512Mi"
          cpu: "250m"
        limits:
          memory: "2Gi"
          cpu: "1000m"
      env:
        - name: OPENAI_API_KEY
          valueFrom:
            secretKeyRef:
              name: ai-secrets
              key: openai-api-key
      readinessProbe:
        httpGet:
          path: /health
          port: 8000
        initialDelaySeconds: 10
        periodSeconds: 5

The resources block is critical for AI workloads. Without memory limits, a single agent loading a large model can starve other Pods on the node. The readinessProbe prevents Kubernetes from routing traffic to an agent that is still loading its model weights.

Deployments: Declarative Replica Management

You should never create Pods directly in production. A Deployment manages a set of identical Pod replicas and handles rolling updates.

# ai-agent-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ai-agent
  namespace: ai-agents
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ai-agent
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: ai-agent
        version: v1.0.0
    spec:
      containers:
        - name: agent
          image: myregistry/ai-agent:1.0.0
          ports:
            - containerPort: 8000
          resources:
            requests:
              memory: "512Mi"
              cpu: "250m"
            limits:
              memory: "2Gi"
              cpu: "1000m"

Setting maxUnavailable: 0 ensures zero downtime during updates. Kubernetes creates new Pods first, waits for them to pass readiness checks, then terminates old Pods one at a time.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

Services: Stable Network Endpoints

Pods get ephemeral IP addresses that change on restart. A Service provides a stable DNS name and load balances across healthy Pod replicas.

# ai-agent-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: ai-agent-svc
  namespace: ai-agents
spec:
  type: ClusterIP
  selector:
    app: ai-agent
  ports:
    - port: 80
      targetPort: 8000
      protocol: TCP

Other services in the cluster reach your agent at ai-agent-svc.ai-agents.svc.cluster.local. For external access, use a LoadBalancer type or an Ingress resource with TLS termination.

Connecting From Python

Your Python application code does not change when moving to Kubernetes. The agent process simply listens on the configured port.

from fastapi import FastAPI
import os

app = FastAPI()

@app.get("/health")
async def health():
    return {"status": "healthy"}

@app.post("/agent/invoke")
async def invoke_agent(request: dict):
    # Agent logic here — Kubernetes handles scaling
    api_key = os.environ["OPENAI_API_KEY"]
    return {"response": "Agent processed your request"}

FAQ

When should I use a StatefulSet instead of a Deployment for AI agents?

Use a StatefulSet when your agent requires stable network identifiers or persistent storage that must survive Pod rescheduling. For example, agents that maintain a local vector store or checkpoint their conversation history to disk benefit from StatefulSets. Stateless agents that fetch all context from external databases should use Deployments.

How many replicas should I run for an AI agent Deployment?

Start with at least two replicas for high availability. Monitor CPU and memory utilization under realistic load, then scale. A common pattern is to set replicas to three for production and combine this with a Horizontal Pod Autoscaler that adjusts between two and ten replicas based on request latency or queue depth.

Can I run GPU workloads in Kubernetes Pods?

Yes. Install the NVIDIA device plugin for Kubernetes, then request GPUs in your resource spec with nvidia.com/gpu: 1 under limits. Kubernetes schedules the Pod only on nodes that have available GPUs. This is essential for agents that run local inference with models like Llama or Mistral.

#Kubernetes #AIDeployment #DevOps #Containers #Infrastructure #AgenticAI #LearnAI #AIEngineering

Kubernetes Fundamentals for AI Engineers: Pods, Deployments, and Services

Why Kubernetes Matters for AI Agents

Pods: The Smallest Deployable Unit

Deployments: Declarative Replica Management

Services: Stable Network Endpoints

Connecting From Python

FAQ

When should I use a StatefulSet instead of a Deployment for AI agents?

How many replicas should I run for an AI agent Deployment?

Can I run GPU workloads in Kubernetes Pods?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding