Skip to content
Learn Agentic AI12 min read0 views

Horizontal Pod Autoscaling for AI Agents: Scaling Based on Custom Metrics

Configure Kubernetes Horizontal Pod Autoscaler for AI agent workloads using CPU, memory, and custom metrics. Learn KEDA integration and scale-to-zero patterns for cost optimization.

Why AI Agents Need Autoscaling

AI agent workloads are inherently bursty. A customer support agent might handle 10 requests per minute during quiet hours and 500 during a product launch. Running enough replicas for peak load wastes money during idle periods. Running too few causes timeouts and dropped requests. Horizontal Pod Autoscaling (HPA) dynamically adjusts replica count based on observed metrics.

Basic HPA with CPU Metrics

The simplest HPA scales based on average CPU utilization across all Pods:

# ai-agent-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ai-agent-hpa
  namespace: ai-agents
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ai-agent
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Pods
          value: 4
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 1
          periodSeconds: 120

The behavior section is critical for AI agents. Scale-up is aggressive — add up to four Pods per minute when load spikes. Scale-down is conservative — remove one Pod every two minutes with a five-minute stabilization window to avoid flapping during variable traffic.

Custom Metrics with Prometheus

CPU utilization is a poor proxy for AI agent load. A better metric is request queue depth or average response latency. Export custom metrics from your agent:

from prometheus_client import Histogram, Gauge, start_http_server

# Track active agent sessions
active_sessions = Gauge(
    "ai_agent_active_sessions",
    "Number of active agent sessions"
)

# Track response latency
response_latency = Histogram(
    "ai_agent_response_seconds",
    "Time to generate agent response",
    buckets=[0.5, 1.0, 2.0, 5.0, 10.0, 30.0]
)

# Start metrics server on a separate port
start_http_server(9090)

Configure HPA to use the custom metric via the Prometheus adapter:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ai-agent-hpa-custom
  namespace: ai-agents
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ai-agent
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Pods
      pods:
        metric:
          name: ai_agent_active_sessions
        target:
          type: AverageValue
          averageValue: "10"

This configuration maintains an average of 10 active sessions per Pod. When sessions increase, Kubernetes adds replicas. When sessions drop, it removes them.

KEDA: Event-Driven Autoscaling

KEDA (Kubernetes Event-Driven Autoscaling) extends HPA with scalers for queues, databases, and external services. It also supports scale-to-zero, which standard HPA does not.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Install KEDA:

helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda --namespace keda --create-namespace

Create a ScaledObject that scales based on a Redis queue:

# ai-agent-keda.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: ai-agent-scaler
  namespace: ai-agents
spec:
  scaleTargetRef:
    name: ai-agent
  pollingInterval: 10
  cooldownPeriod: 300
  minReplicaCount: 0
  maxReplicaCount: 30
  triggers:
    - type: redis
      metadata:
        address: redis-host:6379
        listName: agent-task-queue
        listLength: "5"
        activationListLength: "1"

With minReplicaCount: 0, the Deployment scales to zero Pods when the queue is empty, and activates when at least one message appears. This saves significant cost for agents that handle periodic batch workloads.

Scale-to-Zero Pattern for AI Agents

Scale-to-zero works well for batch agents but requires careful handling of cold starts:

import asyncio
import signal

class GracefulAgent:
    def __init__(self):
        self.running = True
        signal.signal(signal.SIGTERM, self._shutdown)

    def _shutdown(self, signum, frame):
        self.running = False

    async def process_queue(self):
        """Process tasks until shutdown signal."""
        while self.running:
            task = await self.fetch_from_queue(timeout=5)
            if task:
                await self.handle_task(task)

    async def fetch_from_queue(self, timeout: int):
        # Redis BRPOP with timeout
        pass

    async def handle_task(self, task: dict):
        # Agent processing logic
        pass

FAQ

What metrics should I use for autoscaling AI agents?

Avoid relying solely on CPU. The best metrics depend on your agent type. For synchronous request-response agents, use request latency (p95) or concurrent connections. For queue-based agents, use queue depth divided by processing rate. For WebSocket-based conversational agents, use active session count. Combine multiple metrics — Kubernetes scales to the highest recommendation from any single metric.

How do I prevent autoscaling from causing cost overruns?

Set hard maxReplicas limits, implement resource quotas at the namespace level, and configure PodDisruptionBudgets. Use cloud provider billing alerts as a safety net. With KEDA, the cooldownPeriod prevents premature scale-up oscillation that can multiply Pod count unnecessarily.

What is the cold start time for a scaled-to-zero AI agent?

Cold start includes container pull time, application startup, model loading, and health check passage. For a well-optimized AI agent image without local models, expect 5 to 15 seconds. Pre-pulled images on nodes reduce this to 2 to 5 seconds. If cold start latency is unacceptable, set minReplicaCount: 1 to keep one warm replica.


#Kubernetes #Autoscaling #KEDA #AIAgents #CostOptimization #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.