Skip to content
Learn Agentic AI13 min read0 views

Service Discovery for AI Agent Microservices: Consul, Kubernetes DNS, and Eureka

Implement service discovery for AI agent microservices using Kubernetes DNS, Consul, and Eureka. Learn health checking, load balancing, and failover strategies that keep agent systems resilient.

The Service Discovery Problem in Agent Systems

In a monolithic agent, every component is reachable through a function call. When you decompose into microservices, the conversation manager needs to find the RAG service, the tool execution engine, and the memory store. These services may have multiple replicas, they may restart and get new IP addresses, and new instances may spin up during load spikes.

Hardcoding IP addresses or hostnames in configuration files breaks the moment a pod restarts. Service discovery is the mechanism that lets services find each other dynamically.

Kubernetes DNS: The Zero-Config Option

If your agent system runs on Kubernetes, you get service discovery out of the box. Every Kubernetes Service object creates a DNS entry that other pods can resolve:

# rag-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: rag-retrieval
  namespace: agent-system
spec:
  selector:
    app: rag-retrieval
  ports:
    - port: 8002
      targetPort: 8002
  type: ClusterIP
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: rag-retrieval
  namespace: agent-system
spec:
  replicas: 3
  selector:
    matchLabels:
      app: rag-retrieval
  template:
    metadata:
      labels:
        app: rag-retrieval
    spec:
      containers:
        - name: app
          image: agent-system/rag-retrieval:v2.1
          ports:
            - containerPort: 8002
          readinessProbe:
            httpGet:
              path: /health
              port: 8002
            initialDelaySeconds: 5
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /health
              port: 8002
            initialDelaySeconds: 15
            periodSeconds: 20

Any pod in the agent-system namespace can reach the RAG service at http://rag-retrieval:8002. Kubernetes automatically load-balances across the 3 replicas. The readiness probe ensures that traffic only reaches pods that are actually ready to serve requests.

In the conversation manager's configuration, the service URL is simply a Kubernetes DNS name:

import os

class ServiceConfig:
    RAG_SERVICE_URL = os.getenv(
        "RAG_SERVICE_URL", "http://rag-retrieval:8002"
    )
    TOOL_SERVICE_URL = os.getenv(
        "TOOL_SERVICE_URL", "http://tool-execution:8001"
    )
    MEMORY_SERVICE_URL = os.getenv(
        "MEMORY_SERVICE_URL", "http://memory-service:8003"
    )

class ServiceClient:
    def __init__(self, config: ServiceConfig):
        self.config = config
        self._client = httpx.AsyncClient(timeout=10.0)

    async def retrieve_context(self, query: str, top_k: int = 5):
        resp = await self._client.post(
            f"{self.config.RAG_SERVICE_URL}/retrieve",
            json={"query": query, "top_k": top_k},
        )
        resp.raise_for_status()
        return resp.json()

Health Checking Patterns

Health checks are the foundation of service discovery. A service that registers itself but cannot serve requests is worse than a service that is not registered at all. Implement two health check endpoints:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

from fastapi import FastAPI
from datetime import datetime

app = FastAPI()

startup_time = datetime.utcnow()
is_ready = False

@app.get("/health/live")
async def liveness():
    """Am I running? Returns 200 if the process is alive."""
    return {"status": "alive", "uptime_seconds": (
        datetime.utcnow() - startup_time
    ).total_seconds()}

@app.get("/health/ready")
async def readiness():
    """Can I serve traffic? Checks all dependencies."""
    checks = {}
    try:
        await vector_store.ping()
        checks["vector_store"] = "ok"
    except Exception:
        checks["vector_store"] = "failed"

    try:
        await embedding_model.ping()
        checks["embedding_model"] = "ok"
    except Exception:
        checks["embedding_model"] = "failed"

    all_healthy = all(v == "ok" for v in checks.values())
    if not all_healthy:
        return JSONResponse(
            status_code=503,
            content={"status": "not_ready", "checks": checks},
        )
    return {"status": "ready", "checks": checks}

@app.on_event("startup")
async def on_startup():
    global is_ready
    await vector_store.connect()
    await embedding_model.load()
    is_ready = True

The liveness probe tells Kubernetes whether to restart the pod. The readiness probe tells Kubernetes whether to send traffic to it. A pod that has a healthy process but a disconnected database should fail readiness (removing it from the load balancer) without failing liveness (which would restart it unnecessarily).

Consul for Multi-Environment Discovery

When your agent services span multiple environments — some on Kubernetes, some on bare-metal GPU servers, some in a different cloud — Consul provides service discovery that works across boundaries:

import consul

class ConsulServiceRegistry:
    def __init__(self, host: str = "consul-server", port: int = 8500):
        self.client = consul.Consul(host=host, port=port)

    def register(
        self,
        service_name: str,
        service_id: str,
        address: str,
        port: int,
        tags: list[str] = None,
    ):
        self.client.agent.service.register(
            name=service_name,
            service_id=service_id,
            address=address,
            port=port,
            tags=tags or [],
            check=consul.Check.http(
                f"http://{address}:{port}/health/ready",
                interval="10s",
                timeout="5s",
                deregister="30s",
            ),
        )

    def discover(self, service_name: str) -> list[dict]:
        _, services = self.client.health.service(
            service_name, passing=True
        )
        return [
            {
                "address": svc["Service"]["Address"],
                "port": svc["Service"]["Port"],
                "tags": svc["Service"]["Tags"],
            }
            for svc in services
        ]

Client-Side Load Balancing

With service discovery returning multiple healthy instances, implement client-side load balancing for smarter routing:

import random

class LoadBalancedClient:
    def __init__(self, registry: ConsulServiceRegistry, service: str):
        self.registry = registry
        self.service = service
        self._instances: list[dict] = []
        self._index = 0

    async def refresh_instances(self):
        self._instances = self.registry.discover(self.service)

    def next_instance(self) -> dict:
        if not self._instances:
            raise RuntimeError(f"No healthy instances for {self.service}")
        # Round-robin selection
        instance = self._instances[self._index % len(self._instances)]
        self._index += 1
        return instance

    async def call(self, path: str, payload: dict) -> dict:
        instance = self.next_instance()
        url = f"http://{instance['address']}:{instance['port']}{path}"
        async with httpx.AsyncClient() as client:
            resp = await client.post(url, json=payload, timeout=10.0)
            resp.raise_for_status()
            return resp.json()

FAQ

Is Kubernetes DNS sufficient, or do I need Consul?

Kubernetes DNS is sufficient if all your agent services run within a single Kubernetes cluster. It requires zero configuration and integrates natively with Kubernetes health checks. Add Consul only if your services span multiple clusters, include non-Kubernetes workloads (like GPU servers running outside the cluster), or you need advanced features like service mesh, key-value configuration, or multi-datacenter discovery.

How often should health checks run for AI agent services?

Every 10 seconds for readiness checks and every 20 seconds for liveness checks is a good default. AI services that load large models during startup should use a longer initialDelaySeconds (30-60 seconds) to avoid being killed before they finish loading. For latency-sensitive agent systems, consider reducing readiness check intervals to 5 seconds.

What happens when a service has zero healthy instances?

The calling service should implement a circuit breaker pattern. After a threshold of consecutive failures (e.g., 5), the circuit opens and the caller immediately returns an error instead of waiting for timeouts. This prevents cascading failures where one unhealthy service causes all upstream services to block on network timeouts.


#ServiceDiscovery #Kubernetes #Consul #Microservices #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.