Skip to content
Technology10 min read0 views

Agentic AI Microservices Architecture: Kubernetes Deployment Patterns

Learn proven Kubernetes deployment patterns for agentic AI microservices including pod design, service mesh, HPA scaling, and health checks for LLM agents.

Why Kubernetes Is the Default Platform for Multi-Agent Systems

Deploying a single LLM-powered service is straightforward. Deploying a multi-agent system where a triage agent, specialist agents, tool-execution workers, and memory services all need to communicate, scale independently, and recover from failures — that is an infrastructure problem that demands Kubernetes.

At CallSphere, we deploy multi-agent systems across 6 verticals, and every production deployment runs on Kubernetes. The orchestration primitives that K8s provides — pods, services, deployments, horizontal pod autoscalers, and network policies — map naturally onto the components of an agentic AI architecture.

This guide covers the deployment patterns we have validated in production, including pod design strategies, service mesh configuration for agent-to-agent communication, autoscaling for LLM workloads, resource management, and health checking for AI agents.

Pod Design Patterns for Agentic AI

The Sidecar Pattern: Shared Context Injection

The sidecar pattern attaches a helper container alongside your main agent container in the same pod. Both containers share the same network namespace and can communicate over localhost.

A common use case is injecting conversation context or RAG retrieval results into the agent container without coupling the retrieval logic to the agent code.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: specialist-agent
  namespace: agentic-ai
spec:
  replicas: 3
  selector:
    matchLabels:
      app: specialist-agent
  template:
    metadata:
      labels:
        app: specialist-agent
    spec:
      containers:
        - name: agent
          image: registry.example.com/specialist-agent:v2.4.1
          ports:
            - containerPort: 8080
          env:
            - name: CONTEXT_SERVICE_URL
              value: "http://localhost:9090"
          resources:
            requests:
              cpu: "500m"
              memory: "1Gi"
            limits:
              cpu: "2"
              memory: "4Gi"
        - name: context-sidecar
          image: registry.example.com/rag-retriever:v1.2.0
          ports:
            - containerPort: 9090
          resources:
            requests:
              cpu: "250m"
              memory: "512Mi"
            limits:
              cpu: "1"
              memory: "2Gi"

The agent container calls the sidecar on localhost:9090 to fetch relevant documents before constructing its LLM prompt. This keeps the agent image lean and the retrieval logic independently deployable.

The Ambassador Pattern: External API Abstraction

When your agents call multiple LLM providers — OpenAI, Anthropic, a self-hosted model — the ambassador pattern places a proxy container in the pod that handles provider routing, retry logic, and API key rotation.

      containers:
        - name: agent
          image: registry.example.com/triage-agent:v3.0.0
          env:
            - name: LLM_ENDPOINT
              value: "http://localhost:7070/v1/chat/completions"
        - name: llm-ambassador
          image: registry.example.com/llm-router:v1.5.0
          ports:
            - containerPort: 7070
          env:
            - name: PRIMARY_PROVIDER
              value: "anthropic"
            - name: FALLBACK_PROVIDER
              value: "openai"

The agent sees a single endpoint. The ambassador handles failover, load distribution across providers, and response normalization.

The Init Container Pattern: Agent Configuration Loading

Init containers run before your main containers start. Use them to load agent system prompts, tool definitions, or guardrail configurations from a config store.

      initContainers:
        - name: load-agent-config
          image: registry.example.com/config-loader:v1.0.0
          command: ["sh", "-c", "wget -O /config/system-prompt.txt $PROMPT_URL && wget -O /config/tools.json $TOOLS_URL"]
          volumeMounts:
            - name: agent-config
              mountPath: /config
      containers:
        - name: agent
          image: registry.example.com/specialist-agent:v2.4.1
          volumeMounts:
            - name: agent-config
              mountPath: /config
              readOnly: true
      volumes:
        - name: agent-config
          emptyDir: {}

Service Mesh for Agent-to-Agent Communication

In a multi-agent architecture, agents hand off conversations to each other, request tool executions, and share state. A service mesh like Istio or Linkerd adds observability, mutual TLS, traffic management, and retry policies to these inter-agent calls without modifying application code.

Key Service Mesh Benefits for Agent Systems

  • Mutual TLS (mTLS): Encrypt all agent-to-agent traffic automatically. Critical when agents exchange PII or sensitive business context.
  • Retries with budgets: LLM calls are inherently unreliable. Configure retry policies with exponential backoff at the mesh level.
  • Traffic splitting: Route a percentage of conversations to a new agent version for canary testing.
  • Circuit breaking: If a specialist agent is overloaded, the mesh can short-circuit requests rather than letting the queue build up.

Istio VirtualService for Agent Canary Deployment

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: billing-agent-routing
  namespace: agentic-ai
spec:
  hosts:
    - billing-agent.agentic-ai.svc.cluster.local
  http:
    - route:
        - destination:
            host: billing-agent
            subset: stable
          weight: 90
        - destination:
            host: billing-agent
            subset: canary
          weight: 10

This sends 10% of traffic to the canary version of the billing agent, letting you validate prompt changes or model upgrades before full rollout.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Horizontal Pod Autoscaling for LLM Workloads

Standard CPU-based HPA does not work well for LLM agent workloads. The bottleneck is rarely CPU — it is waiting for LLM API responses and managing concurrent conversations. You need custom metrics.

Custom Metrics HPA Configuration

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: triage-agent-hpa
  namespace: agentic-ai
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: triage-agent
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Pods
      pods:
        metric:
          name: active_conversations
        target:
          type: AverageValue
          averageValue: "15"
    - type: Pods
      pods:
        metric:
          name: llm_request_queue_depth
        target:
          type: AverageValue
          averageValue: "5"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Pods
          value: 4
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 1
          periodSeconds: 120

Key design decisions in this configuration:

  • Scale on active conversations, not CPU. Each conversation holds state, so this metric directly reflects load.
  • Fast scale-up (30s stabilization, add 4 pods per minute) because LLM workloads can spike quickly when a marketing campaign drives traffic.
  • Slow scale-down (5 minute stabilization, remove 1 pod at a time) to avoid killing pods mid-conversation.

Resource Quotas and Limit Ranges

LLM agent workloads have unpredictable memory profiles. A single complex multi-turn conversation can consume significantly more memory than a simple query. Set resource quotas at the namespace level and limit ranges per pod.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: agentic-ai-quota
  namespace: agentic-ai
spec:
  hard:
    requests.cpu: "40"
    requests.memory: "80Gi"
    limits.cpu: "80"
    limits.memory: "160Gi"
    pods: "100"
---
apiVersion: v1
kind: LimitRange
metadata:
  name: agent-limits
  namespace: agentic-ai
spec:
  limits:
    - type: Container
      default:
        cpu: "1"
        memory: "2Gi"
      defaultRequest:
        cpu: "250m"
        memory: "512Mi"
      max:
        cpu: "4"
        memory: "8Gi"

Health Checks for AI Agents

Standard HTTP liveness probes are insufficient for AI agents. An agent can return 200 on a health endpoint while its LLM connection is broken, its tool registry is stale, or its conversation state store is unreachable.

Deep Health Check Implementation

Design your agent health endpoint to verify all critical dependencies:

        livenessProbe:
          httpGet:
            path: /health/live
            port: 8080
          initialDelaySeconds: 10
          periodSeconds: 15
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /health/ready
            port: 8080
          initialDelaySeconds: 20
          periodSeconds: 10
          failureThreshold: 2
        startupProbe:
          httpGet:
            path: /health/startup
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
          failureThreshold: 30

The startup probe is critical for agent containers that need to load large system prompts, initialize tool registries, or warm up embedding caches. A failureThreshold of 30 with a 5-second period gives the agent up to 2.5 minutes to start before Kubernetes kills it.

Your /health/ready endpoint should check:

  1. LLM provider connectivity (lightweight completion test)
  2. State store reachability (Redis or PostgreSQL ping)
  3. Tool registry loaded (expected tool count matches)
  4. Memory service accessible (vector DB connection)

Network Policies for Agent Isolation

Not every agent should talk to every other agent. Use Kubernetes NetworkPolicies to enforce the communication topology of your multi-agent system.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: specialist-agent-policy
  namespace: agentic-ai
spec:
  podSelector:
    matchLabels:
      role: specialist-agent
  policyTypes:
    - Ingress
    - Egress
  ingress:
    - from:
        - podSelector:
            matchLabels:
              role: triage-agent
      ports:
        - port: 8080
  egress:
    - to:
        - podSelector:
            matchLabels:
              role: tool-executor
      ports:
        - port: 8080
    - to:
        - namespaceSelector: {}
          podSelector:
            matchLabels:
              app: redis
      ports:
        - port: 6379

This policy ensures specialist agents can only receive traffic from the triage agent and can only call tool executors and Redis. No direct internet access, no cross-agent chatter outside the defined topology.

Production Deployment Checklist

Before deploying a multi-agent system to production on Kubernetes, verify these items:

  • Pod Disruption Budgets configured so rolling updates never take all agent replicas offline simultaneously
  • Anti-affinity rules spread agent pods across nodes to survive node failures
  • Secrets management via Kubernetes Secrets or an external vault for LLM API keys
  • Persistent volume claims for any agent that maintains local state or caches
  • RBAC policies limiting which service accounts can modify agent deployments
  • Resource requests and limits set on every container to prevent noisy-neighbor problems
  • Graceful shutdown handlers that drain active conversations before pod termination

Frequently Asked Questions

How many agent replicas should I run per deployment?

Start with a minimum of 2 replicas for high-availability and let the HPA scale from there. For latency-sensitive triage agents that handle initial user contact, consider a minimum of 3. Monitor the active_conversations metric for two weeks to establish a baseline before tuning.

Should I use one pod per agent type or combine agents in a single pod?

Use one pod per agent type. Combining agents in a single pod creates scaling coupling — if your billing agent needs more capacity but your scheduling agent does not, you waste resources. The only exception is tightly coupled agent-sidecar pairs like the context injection pattern described above.

Is a service mesh overkill for a small multi-agent system?

If you have fewer than 5 agent services, a service mesh adds operational complexity that may not be justified. Start with standard Kubernetes Services and add a mesh when you need canary deployments, mTLS, or advanced traffic management. Linkerd is lighter weight than Istio if you want to start small.

How do I handle long-running agent conversations during rolling updates?

Configure a terminationGracePeriodSeconds of at least 120 seconds on agent pods. Implement a SIGTERM handler in your agent code that stops accepting new conversations, waits for active ones to complete or checkpoint, then exits. Combine this with a PodDisruptionBudget to ensure at least 50% of replicas remain available during updates.

What monitoring should I have before going to production?

At minimum: request latency per agent (p50, p95, p99), active conversation count, LLM API error rate, token consumption per request, and pod restart count. Set alerts on error rate exceeding 5% and p99 latency exceeding your SLA threshold. Grafana dashboards with these metrics give your on-call team the visibility they need.

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.