Service Discovery for AI Agent Microservices: Consul, Kubernetes DNS, and Eureka
Implement service discovery for AI agent microservices using Kubernetes DNS, Consul, and Eureka. Learn health checking, load balancing, and failover strategies that keep agent systems resilient.
The Service Discovery Problem in Agent Systems
In a monolithic agent, every component is reachable through a function call. When you decompose into microservices, the conversation manager needs to find the RAG service, the tool execution engine, and the memory store. These services may have multiple replicas, they may restart and get new IP addresses, and new instances may spin up during load spikes.
Hardcoding IP addresses or hostnames in configuration files breaks the moment a pod restarts. Service discovery is the mechanism that lets services find each other dynamically.
Kubernetes DNS: The Zero-Config Option
If your agent system runs on Kubernetes, you get service discovery out of the box. Every Kubernetes Service object creates a DNS entry that other pods can resolve:
# rag-service.yaml
apiVersion: v1
kind: Service
metadata:
name: rag-retrieval
namespace: agent-system
spec:
selector:
app: rag-retrieval
ports:
- port: 8002
targetPort: 8002
type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: rag-retrieval
namespace: agent-system
spec:
replicas: 3
selector:
matchLabels:
app: rag-retrieval
template:
metadata:
labels:
app: rag-retrieval
spec:
containers:
- name: app
image: agent-system/rag-retrieval:v2.1
ports:
- containerPort: 8002
readinessProbe:
httpGet:
path: /health
port: 8002
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
httpGet:
path: /health
port: 8002
initialDelaySeconds: 15
periodSeconds: 20
Any pod in the agent-system namespace can reach the RAG service at http://rag-retrieval:8002. Kubernetes automatically load-balances across the 3 replicas. The readiness probe ensures that traffic only reaches pods that are actually ready to serve requests.
In the conversation manager's configuration, the service URL is simply a Kubernetes DNS name:
import os
class ServiceConfig:
RAG_SERVICE_URL = os.getenv(
"RAG_SERVICE_URL", "http://rag-retrieval:8002"
)
TOOL_SERVICE_URL = os.getenv(
"TOOL_SERVICE_URL", "http://tool-execution:8001"
)
MEMORY_SERVICE_URL = os.getenv(
"MEMORY_SERVICE_URL", "http://memory-service:8003"
)
class ServiceClient:
def __init__(self, config: ServiceConfig):
self.config = config
self._client = httpx.AsyncClient(timeout=10.0)
async def retrieve_context(self, query: str, top_k: int = 5):
resp = await self._client.post(
f"{self.config.RAG_SERVICE_URL}/retrieve",
json={"query": query, "top_k": top_k},
)
resp.raise_for_status()
return resp.json()
Health Checking Patterns
Health checks are the foundation of service discovery. A service that registers itself but cannot serve requests is worse than a service that is not registered at all. Implement two health check endpoints:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
from fastapi import FastAPI
from datetime import datetime
app = FastAPI()
startup_time = datetime.utcnow()
is_ready = False
@app.get("/health/live")
async def liveness():
"""Am I running? Returns 200 if the process is alive."""
return {"status": "alive", "uptime_seconds": (
datetime.utcnow() - startup_time
).total_seconds()}
@app.get("/health/ready")
async def readiness():
"""Can I serve traffic? Checks all dependencies."""
checks = {}
try:
await vector_store.ping()
checks["vector_store"] = "ok"
except Exception:
checks["vector_store"] = "failed"
try:
await embedding_model.ping()
checks["embedding_model"] = "ok"
except Exception:
checks["embedding_model"] = "failed"
all_healthy = all(v == "ok" for v in checks.values())
if not all_healthy:
return JSONResponse(
status_code=503,
content={"status": "not_ready", "checks": checks},
)
return {"status": "ready", "checks": checks}
@app.on_event("startup")
async def on_startup():
global is_ready
await vector_store.connect()
await embedding_model.load()
is_ready = True
The liveness probe tells Kubernetes whether to restart the pod. The readiness probe tells Kubernetes whether to send traffic to it. A pod that has a healthy process but a disconnected database should fail readiness (removing it from the load balancer) without failing liveness (which would restart it unnecessarily).
Consul for Multi-Environment Discovery
When your agent services span multiple environments — some on Kubernetes, some on bare-metal GPU servers, some in a different cloud — Consul provides service discovery that works across boundaries:
import consul
class ConsulServiceRegistry:
def __init__(self, host: str = "consul-server", port: int = 8500):
self.client = consul.Consul(host=host, port=port)
def register(
self,
service_name: str,
service_id: str,
address: str,
port: int,
tags: list[str] = None,
):
self.client.agent.service.register(
name=service_name,
service_id=service_id,
address=address,
port=port,
tags=tags or [],
check=consul.Check.http(
f"http://{address}:{port}/health/ready",
interval="10s",
timeout="5s",
deregister="30s",
),
)
def discover(self, service_name: str) -> list[dict]:
_, services = self.client.health.service(
service_name, passing=True
)
return [
{
"address": svc["Service"]["Address"],
"port": svc["Service"]["Port"],
"tags": svc["Service"]["Tags"],
}
for svc in services
]
Client-Side Load Balancing
With service discovery returning multiple healthy instances, implement client-side load balancing for smarter routing:
import random
class LoadBalancedClient:
def __init__(self, registry: ConsulServiceRegistry, service: str):
self.registry = registry
self.service = service
self._instances: list[dict] = []
self._index = 0
async def refresh_instances(self):
self._instances = self.registry.discover(self.service)
def next_instance(self) -> dict:
if not self._instances:
raise RuntimeError(f"No healthy instances for {self.service}")
# Round-robin selection
instance = self._instances[self._index % len(self._instances)]
self._index += 1
return instance
async def call(self, path: str, payload: dict) -> dict:
instance = self.next_instance()
url = f"http://{instance['address']}:{instance['port']}{path}"
async with httpx.AsyncClient() as client:
resp = await client.post(url, json=payload, timeout=10.0)
resp.raise_for_status()
return resp.json()
FAQ
Is Kubernetes DNS sufficient, or do I need Consul?
Kubernetes DNS is sufficient if all your agent services run within a single Kubernetes cluster. It requires zero configuration and integrates natively with Kubernetes health checks. Add Consul only if your services span multiple clusters, include non-Kubernetes workloads (like GPU servers running outside the cluster), or you need advanced features like service mesh, key-value configuration, or multi-datacenter discovery.
How often should health checks run for AI agent services?
Every 10 seconds for readiness checks and every 20 seconds for liveness checks is a good default. AI services that load large models during startup should use a longer initialDelaySeconds (30-60 seconds) to avoid being killed before they finish loading. For latency-sensitive agent systems, consider reducing readiness check intervals to 5 seconds.
What happens when a service has zero healthy instances?
The calling service should implement a circuit breaker pattern. After a threshold of consecutive failures (e.g., 5), the circuit opens and the caller immediately returns an error instead of waiting for timeouts. This prevents cascading failures where one unhealthy service causes all upstream services to block on network timeouts.
#ServiceDiscovery #Kubernetes #Consul #Microservices #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.