Horizontal Pod Autoscaling for AI Agents: Scaling Based on Custom Metrics
Configure Kubernetes Horizontal Pod Autoscaler for AI agent workloads using CPU, memory, and custom metrics. Learn KEDA integration and scale-to-zero patterns for cost optimization.
Why AI Agents Need Autoscaling
AI agent workloads are inherently bursty. A customer support agent might handle 10 requests per minute during quiet hours and 500 during a product launch. Running enough replicas for peak load wastes money during idle periods. Running too few causes timeouts and dropped requests. Horizontal Pod Autoscaling (HPA) dynamically adjusts replica count based on observed metrics.
Basic HPA with CPU Metrics
The simplest HPA scales based on average CPU utilization across all Pods:
# ai-agent-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ai-agent-hpa
namespace: ai-agents
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ai-agent
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Pods
value: 4
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 1
periodSeconds: 120
The behavior section is critical for AI agents. Scale-up is aggressive — add up to four Pods per minute when load spikes. Scale-down is conservative — remove one Pod every two minutes with a five-minute stabilization window to avoid flapping during variable traffic.
Custom Metrics with Prometheus
CPU utilization is a poor proxy for AI agent load. A better metric is request queue depth or average response latency. Export custom metrics from your agent:
from prometheus_client import Histogram, Gauge, start_http_server
# Track active agent sessions
active_sessions = Gauge(
"ai_agent_active_sessions",
"Number of active agent sessions"
)
# Track response latency
response_latency = Histogram(
"ai_agent_response_seconds",
"Time to generate agent response",
buckets=[0.5, 1.0, 2.0, 5.0, 10.0, 30.0]
)
# Start metrics server on a separate port
start_http_server(9090)
Configure HPA to use the custom metric via the Prometheus adapter:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ai-agent-hpa-custom
namespace: ai-agents
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ai-agent
minReplicas: 2
maxReplicas: 20
metrics:
- type: Pods
pods:
metric:
name: ai_agent_active_sessions
target:
type: AverageValue
averageValue: "10"
This configuration maintains an average of 10 active sessions per Pod. When sessions increase, Kubernetes adds replicas. When sessions drop, it removes them.
KEDA: Event-Driven Autoscaling
KEDA (Kubernetes Event-Driven Autoscaling) extends HPA with scalers for queues, databases, and external services. It also supports scale-to-zero, which standard HPA does not.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Install KEDA:
helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda --namespace keda --create-namespace
Create a ScaledObject that scales based on a Redis queue:
# ai-agent-keda.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: ai-agent-scaler
namespace: ai-agents
spec:
scaleTargetRef:
name: ai-agent
pollingInterval: 10
cooldownPeriod: 300
minReplicaCount: 0
maxReplicaCount: 30
triggers:
- type: redis
metadata:
address: redis-host:6379
listName: agent-task-queue
listLength: "5"
activationListLength: "1"
With minReplicaCount: 0, the Deployment scales to zero Pods when the queue is empty, and activates when at least one message appears. This saves significant cost for agents that handle periodic batch workloads.
Scale-to-Zero Pattern for AI Agents
Scale-to-zero works well for batch agents but requires careful handling of cold starts:
import asyncio
import signal
class GracefulAgent:
def __init__(self):
self.running = True
signal.signal(signal.SIGTERM, self._shutdown)
def _shutdown(self, signum, frame):
self.running = False
async def process_queue(self):
"""Process tasks until shutdown signal."""
while self.running:
task = await self.fetch_from_queue(timeout=5)
if task:
await self.handle_task(task)
async def fetch_from_queue(self, timeout: int):
# Redis BRPOP with timeout
pass
async def handle_task(self, task: dict):
# Agent processing logic
pass
FAQ
What metrics should I use for autoscaling AI agents?
Avoid relying solely on CPU. The best metrics depend on your agent type. For synchronous request-response agents, use request latency (p95) or concurrent connections. For queue-based agents, use queue depth divided by processing rate. For WebSocket-based conversational agents, use active session count. Combine multiple metrics — Kubernetes scales to the highest recommendation from any single metric.
How do I prevent autoscaling from causing cost overruns?
Set hard maxReplicas limits, implement resource quotas at the namespace level, and configure PodDisruptionBudgets. Use cloud provider billing alerts as a safety net. With KEDA, the cooldownPeriod prevents premature scale-up oscillation that can multiply Pod count unnecessarily.
What is the cold start time for a scaled-to-zero AI agent?
Cold start includes container pull time, application startup, model loading, and health check passage. For a well-optimized AI agent image without local models, expect 5 to 15 seconds. Pre-pulled images on nodes reduce this to 2 to 5 seconds. If cold start latency is unacceptable, set minReplicaCount: 1 to keep one warm replica.
#Kubernetes #Autoscaling #KEDA #AIAgents #CostOptimization #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.