Agentic AI Microservices Architecture: Kubernetes Deployment Patterns
Learn proven Kubernetes deployment patterns for agentic AI microservices including pod design, service mesh, HPA scaling, and health checks for LLM agents.
Why Kubernetes Is the Default Platform for Multi-Agent Systems
Deploying a single LLM-powered service is straightforward. Deploying a multi-agent system where a triage agent, specialist agents, tool-execution workers, and memory services all need to communicate, scale independently, and recover from failures — that is an infrastructure problem that demands Kubernetes.
At CallSphere, we deploy multi-agent systems across 6 verticals, and every production deployment runs on Kubernetes. The orchestration primitives that K8s provides — pods, services, deployments, horizontal pod autoscalers, and network policies — map naturally onto the components of an agentic AI architecture.
This guide covers the deployment patterns we have validated in production, including pod design strategies, service mesh configuration for agent-to-agent communication, autoscaling for LLM workloads, resource management, and health checking for AI agents.
Pod Design Patterns for Agentic AI
The Sidecar Pattern: Shared Context Injection
The sidecar pattern attaches a helper container alongside your main agent container in the same pod. Both containers share the same network namespace and can communicate over localhost.
A common use case is injecting conversation context or RAG retrieval results into the agent container without coupling the retrieval logic to the agent code.
apiVersion: apps/v1
kind: Deployment
metadata:
name: specialist-agent
namespace: agentic-ai
spec:
replicas: 3
selector:
matchLabels:
app: specialist-agent
template:
metadata:
labels:
app: specialist-agent
spec:
containers:
- name: agent
image: registry.example.com/specialist-agent:v2.4.1
ports:
- containerPort: 8080
env:
- name: CONTEXT_SERVICE_URL
value: "http://localhost:9090"
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "2"
memory: "4Gi"
- name: context-sidecar
image: registry.example.com/rag-retriever:v1.2.0
ports:
- containerPort: 9090
resources:
requests:
cpu: "250m"
memory: "512Mi"
limits:
cpu: "1"
memory: "2Gi"
The agent container calls the sidecar on localhost:9090 to fetch relevant documents before constructing its LLM prompt. This keeps the agent image lean and the retrieval logic independently deployable.
The Ambassador Pattern: External API Abstraction
When your agents call multiple LLM providers — OpenAI, Anthropic, a self-hosted model — the ambassador pattern places a proxy container in the pod that handles provider routing, retry logic, and API key rotation.
containers:
- name: agent
image: registry.example.com/triage-agent:v3.0.0
env:
- name: LLM_ENDPOINT
value: "http://localhost:7070/v1/chat/completions"
- name: llm-ambassador
image: registry.example.com/llm-router:v1.5.0
ports:
- containerPort: 7070
env:
- name: PRIMARY_PROVIDER
value: "anthropic"
- name: FALLBACK_PROVIDER
value: "openai"
The agent sees a single endpoint. The ambassador handles failover, load distribution across providers, and response normalization.
The Init Container Pattern: Agent Configuration Loading
Init containers run before your main containers start. Use them to load agent system prompts, tool definitions, or guardrail configurations from a config store.
initContainers:
- name: load-agent-config
image: registry.example.com/config-loader:v1.0.0
command: ["sh", "-c", "wget -O /config/system-prompt.txt $PROMPT_URL && wget -O /config/tools.json $TOOLS_URL"]
volumeMounts:
- name: agent-config
mountPath: /config
containers:
- name: agent
image: registry.example.com/specialist-agent:v2.4.1
volumeMounts:
- name: agent-config
mountPath: /config
readOnly: true
volumes:
- name: agent-config
emptyDir: {}
Service Mesh for Agent-to-Agent Communication
In a multi-agent architecture, agents hand off conversations to each other, request tool executions, and share state. A service mesh like Istio or Linkerd adds observability, mutual TLS, traffic management, and retry policies to these inter-agent calls without modifying application code.
Key Service Mesh Benefits for Agent Systems
- Mutual TLS (mTLS): Encrypt all agent-to-agent traffic automatically. Critical when agents exchange PII or sensitive business context.
- Retries with budgets: LLM calls are inherently unreliable. Configure retry policies with exponential backoff at the mesh level.
- Traffic splitting: Route a percentage of conversations to a new agent version for canary testing.
- Circuit breaking: If a specialist agent is overloaded, the mesh can short-circuit requests rather than letting the queue build up.
Istio VirtualService for Agent Canary Deployment
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: billing-agent-routing
namespace: agentic-ai
spec:
hosts:
- billing-agent.agentic-ai.svc.cluster.local
http:
- route:
- destination:
host: billing-agent
subset: stable
weight: 90
- destination:
host: billing-agent
subset: canary
weight: 10
This sends 10% of traffic to the canary version of the billing agent, letting you validate prompt changes or model upgrades before full rollout.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Horizontal Pod Autoscaling for LLM Workloads
Standard CPU-based HPA does not work well for LLM agent workloads. The bottleneck is rarely CPU — it is waiting for LLM API responses and managing concurrent conversations. You need custom metrics.
Custom Metrics HPA Configuration
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: triage-agent-hpa
namespace: agentic-ai
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: triage-agent
minReplicas: 2
maxReplicas: 20
metrics:
- type: Pods
pods:
metric:
name: active_conversations
target:
type: AverageValue
averageValue: "15"
- type: Pods
pods:
metric:
name: llm_request_queue_depth
target:
type: AverageValue
averageValue: "5"
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Pods
value: 4
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 1
periodSeconds: 120
Key design decisions in this configuration:
- Scale on active conversations, not CPU. Each conversation holds state, so this metric directly reflects load.
- Fast scale-up (30s stabilization, add 4 pods per minute) because LLM workloads can spike quickly when a marketing campaign drives traffic.
- Slow scale-down (5 minute stabilization, remove 1 pod at a time) to avoid killing pods mid-conversation.
Resource Quotas and Limit Ranges
LLM agent workloads have unpredictable memory profiles. A single complex multi-turn conversation can consume significantly more memory than a simple query. Set resource quotas at the namespace level and limit ranges per pod.
apiVersion: v1
kind: ResourceQuota
metadata:
name: agentic-ai-quota
namespace: agentic-ai
spec:
hard:
requests.cpu: "40"
requests.memory: "80Gi"
limits.cpu: "80"
limits.memory: "160Gi"
pods: "100"
---
apiVersion: v1
kind: LimitRange
metadata:
name: agent-limits
namespace: agentic-ai
spec:
limits:
- type: Container
default:
cpu: "1"
memory: "2Gi"
defaultRequest:
cpu: "250m"
memory: "512Mi"
max:
cpu: "4"
memory: "8Gi"
Health Checks for AI Agents
Standard HTTP liveness probes are insufficient for AI agents. An agent can return 200 on a health endpoint while its LLM connection is broken, its tool registry is stale, or its conversation state store is unreachable.
Deep Health Check Implementation
Design your agent health endpoint to verify all critical dependencies:
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 10
periodSeconds: 15
failureThreshold: 3
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 20
periodSeconds: 10
failureThreshold: 2
startupProbe:
httpGet:
path: /health/startup
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 30
The startup probe is critical for agent containers that need to load large system prompts, initialize tool registries, or warm up embedding caches. A failureThreshold of 30 with a 5-second period gives the agent up to 2.5 minutes to start before Kubernetes kills it.
Your /health/ready endpoint should check:
- LLM provider connectivity (lightweight completion test)
- State store reachability (Redis or PostgreSQL ping)
- Tool registry loaded (expected tool count matches)
- Memory service accessible (vector DB connection)
Network Policies for Agent Isolation
Not every agent should talk to every other agent. Use Kubernetes NetworkPolicies to enforce the communication topology of your multi-agent system.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: specialist-agent-policy
namespace: agentic-ai
spec:
podSelector:
matchLabels:
role: specialist-agent
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
role: triage-agent
ports:
- port: 8080
egress:
- to:
- podSelector:
matchLabels:
role: tool-executor
ports:
- port: 8080
- to:
- namespaceSelector: {}
podSelector:
matchLabels:
app: redis
ports:
- port: 6379
This policy ensures specialist agents can only receive traffic from the triage agent and can only call tool executors and Redis. No direct internet access, no cross-agent chatter outside the defined topology.
Production Deployment Checklist
Before deploying a multi-agent system to production on Kubernetes, verify these items:
- Pod Disruption Budgets configured so rolling updates never take all agent replicas offline simultaneously
- Anti-affinity rules spread agent pods across nodes to survive node failures
- Secrets management via Kubernetes Secrets or an external vault for LLM API keys
- Persistent volume claims for any agent that maintains local state or caches
- RBAC policies limiting which service accounts can modify agent deployments
- Resource requests and limits set on every container to prevent noisy-neighbor problems
- Graceful shutdown handlers that drain active conversations before pod termination
Frequently Asked Questions
How many agent replicas should I run per deployment?
Start with a minimum of 2 replicas for high-availability and let the HPA scale from there. For latency-sensitive triage agents that handle initial user contact, consider a minimum of 3. Monitor the active_conversations metric for two weeks to establish a baseline before tuning.
Should I use one pod per agent type or combine agents in a single pod?
Use one pod per agent type. Combining agents in a single pod creates scaling coupling — if your billing agent needs more capacity but your scheduling agent does not, you waste resources. The only exception is tightly coupled agent-sidecar pairs like the context injection pattern described above.
Is a service mesh overkill for a small multi-agent system?
If you have fewer than 5 agent services, a service mesh adds operational complexity that may not be justified. Start with standard Kubernetes Services and add a mesh when you need canary deployments, mTLS, or advanced traffic management. Linkerd is lighter weight than Istio if you want to start small.
How do I handle long-running agent conversations during rolling updates?
Configure a terminationGracePeriodSeconds of at least 120 seconds on agent pods. Implement a SIGTERM handler in your agent code that stops accepting new conversations, waits for active ones to complete or checkpoint, then exits. Combine this with a PodDisruptionBudget to ensure at least 50% of replicas remain available during updates.
What monitoring should I have before going to production?
At minimum: request latency per agent (p50, p95, p99), active conversation count, LLM API error rate, token consumption per request, and pod restart count. Set alerts on error rate exceeding 5% and p99 latency exceeding your SLA threshold. Grafana dashboards with these metrics give your on-call team the visibility they need.
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.