Skip to content
Learn Agentic AI13 min read0 views

Kubernetes Operators for AI Agents: Custom Controllers for Agent Lifecycle Management

Build a Kubernetes Operator for AI agent lifecycle management using Custom Resource Definitions, reconciliation loops, and status management to automate agent provisioning and scaling.

What Is a Kubernetes Operator

A Kubernetes Operator extends the Kubernetes API with custom resources and controllers that encode domain-specific operational knowledge. Instead of manually creating Deployments, Services, ConfigMaps, and HPAs for each AI agent, you define an AIAgent custom resource and let the Operator reconcile all the underlying infrastructure automatically.

This transforms agent deployment from "create six YAML files and apply them in the right order" to "declare what agent you want and let the Operator handle the rest."

Custom Resource Definition (CRD)

First, define what an AIAgent resource looks like:

# crd-aiagent.yaml
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: aiagents.ai.example.com
spec:
  group: ai.example.com
  versions:
    - name: v1alpha1
      served: true
      storage: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              required: ["model", "replicas"]
              properties:
                model:
                  type: string
                  description: "LLM model to use"
                replicas:
                  type: integer
                  minimum: 1
                  maximum: 100
                temperature:
                  type: number
                  default: 0.7
                maxTokens:
                  type: integer
                  default: 4096
                image:
                  type: string
                tools:
                  type: array
                  items:
                    type: string
                autoscaling:
                  type: object
                  properties:
                    enabled:
                      type: boolean
                      default: false
                    minReplicas:
                      type: integer
                    maxReplicas:
                      type: integer
            status:
              type: object
              properties:
                phase:
                  type: string
                readyReplicas:
                  type: integer
                lastUpdated:
                  type: string
                conditions:
                  type: array
                  items:
                    type: object
                    properties:
                      type:
                        type: string
                      status:
                        type: string
                      message:
                        type: string
      subresources:
        status: {}
      additionalPrinterColumns:
        - name: Model
          type: string
          jsonPath: .spec.model
        - name: Replicas
          type: integer
          jsonPath: .spec.replicas
        - name: Phase
          type: string
          jsonPath: .status.phase
  scope: Namespaced
  names:
    plural: aiagents
    singular: aiagent
    kind: AIAgent
    shortNames:
      - aia

Apply the CRD and now you can create AIAgent resources:

# my-support-agent.yaml
apiVersion: ai.example.com/v1alpha1
kind: AIAgent
metadata:
  name: support-agent
  namespace: ai-agents
spec:
  model: "gpt-4o"
  replicas: 3
  temperature: 0.5
  maxTokens: 2048
  image: "myregistry/support-agent:2.0.0"
  tools:
    - "knowledge-base-search"
    - "ticket-creator"
    - "calendar-lookup"
  autoscaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 15

Building the Operator in Python with Kopf

Kopf is a Python framework for building Kubernetes Operators. It handles watch streams, retry logic, and status updates.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

# operator.py
import kopf
import kubernetes
from kubernetes import client

@kopf.on.create("ai.example.com", "v1alpha1", "aiagents")
async def create_agent(spec, name, namespace, logger, **kwargs):
    """Reconcile when a new AIAgent is created."""
    logger.info(f"Creating AI agent: {name}")

    apps_v1 = client.AppsV1Api()
    core_v1 = client.CoreV1Api()

    # Create ConfigMap with agent settings
    configmap = client.V1ConfigMap(
        metadata=client.V1ObjectMeta(
            name=f"{name}-config",
            namespace=namespace,
        ),
        data={
            "MODEL_NAME": spec.get("model", "gpt-4o"),
            "TEMPERATURE": str(spec.get("temperature", 0.7)),
            "MAX_TOKENS": str(spec.get("maxTokens", 4096)),
            "TOOLS": ",".join(spec.get("tools", [])),
        },
    )
    kopf.adopt(configmap)
    core_v1.create_namespaced_config_map(namespace, configmap)

    # Create Deployment
    deployment = build_deployment(name, namespace, spec)
    kopf.adopt(deployment)
    apps_v1.create_namespaced_deployment(namespace, deployment)

    # Create Service
    service = build_service(name, namespace, spec)
    kopf.adopt(service)
    core_v1.create_namespaced_service(namespace, service)

    return {"phase": "Running", "readyReplicas": 0}


def build_deployment(name: str, namespace: str, spec: dict):
    """Build a Deployment object from AIAgent spec."""
    return client.V1Deployment(
        metadata=client.V1ObjectMeta(
            name=name,
            namespace=namespace,
        ),
        spec=client.V1DeploymentSpec(
            replicas=spec.get("replicas", 1),
            selector=client.V1LabelSelector(
                match_labels={"aiagent": name}
            ),
            template=client.V1PodTemplateSpec(
                metadata=client.V1ObjectMeta(
                    labels={"aiagent": name}
                ),
                spec=client.V1PodSpec(
                    containers=[
                        client.V1Container(
                            name="agent",
                            image=spec["image"],
                            ports=[client.V1ContainerPort(
                                container_port=8000
                            )],
                            env_from=[
                                client.V1EnvFromSource(
                                    config_map_ref=client.V1ConfigMapEnvSource(
                                        name=f"{name}-config"
                                    )
                                )
                            ],
                        )
                    ]
                ),
            ),
        ),
    )


def build_service(name: str, namespace: str, spec: dict):
    return client.V1Service(
        metadata=client.V1ObjectMeta(
            name=f"{name}-svc",
            namespace=namespace,
        ),
        spec=client.V1ServiceSpec(
            selector={"aiagent": name},
            ports=[client.V1ServicePort(
                port=80, target_port=8000
            )],
        ),
    )

Handling Updates with the Reconciliation Loop

When someone changes the AIAgent spec, the Operator detects the diff and updates resources:

@kopf.on.update("ai.example.com", "v1alpha1", "aiagents")
async def update_agent(spec, name, namespace, diff, logger, **kwargs):
    """Reconcile when an AIAgent spec changes."""
    apps_v1 = client.AppsV1Api()
    core_v1 = client.CoreV1Api()

    for field, old_val, new_val in diff:
        logger.info(f"Field changed: {field} from {old_val} to {new_val}")

    # Update ConfigMap
    configmap_patch = {
        "data": {
            "MODEL_NAME": spec.get("model", "gpt-4o"),
            "TEMPERATURE": str(spec.get("temperature", 0.7)),
            "MAX_TOKENS": str(spec.get("maxTokens", 4096)),
        }
    }
    core_v1.patch_namespaced_config_map(
        f"{name}-config", namespace, configmap_patch
    )

    # Update Deployment replicas and image
    deployment_patch = {
        "spec": {
            "replicas": spec.get("replicas", 1),
            "template": {
                "spec": {
                    "containers": [{
                        "name": "agent",
                        "image": spec["image"],
                    }]
                }
            }
        }
    }
    apps_v1.patch_namespaced_deployment(
        name, namespace, deployment_patch
    )

    return {"phase": "Updating"}

Status Management

Update the custom resource status to reflect the actual state:

@kopf.timer("ai.example.com", "v1alpha1", "aiagents", interval=30)
async def monitor_agent(spec, name, namespace, patch, logger, **kwargs):
    """Periodically check agent health and update status."""
    apps_v1 = client.AppsV1Api()

    try:
        deployment = apps_v1.read_namespaced_deployment(name, namespace)
        ready = deployment.status.ready_replicas or 0
        desired = deployment.spec.replicas

        phase = "Running" if ready == desired else "Scaling"

        patch.status["readyReplicas"] = ready
        patch.status["phase"] = phase
        patch.status["lastUpdated"] = "2026-03-17T00:00:00Z"
    except kubernetes.client.exceptions.ApiException as e:
        patch.status["phase"] = "Error"
        logger.error(f"Failed to read deployment: {e}")

Using the Operator

Once deployed, managing agents becomes declarative:

# Create an agent
kubectl apply -f my-support-agent.yaml

# List all agents
kubectl get aiagents -n ai-agents

# Scale an agent (edit the spec)
kubectl patch aiagent support-agent -n ai-agents \
  --type=merge -p '{"spec": {"replicas": 5}}'

# Delete an agent (cleans up all child resources)
kubectl delete aiagent support-agent -n ai-agents

FAQ

When should I build an Operator versus using Helm charts?

Use Helm when your deployment is a one-time packaging problem — you need to template and parameterize YAML. Build an Operator when you need ongoing lifecycle management — automatic scaling adjustments, health monitoring, backup scheduling, or coordinated multi-resource updates that respond to runtime conditions. Operators encode operational knowledge that Helm charts cannot express.

How do I test a Kubernetes Operator locally?

Use kind (Kubernetes in Docker) or minikube to run a local cluster. Kopf supports running outside the cluster with kopf run operator.py which connects to your kubeconfig context. Write integration tests that create custom resources and assert the expected child resources appear. Use pytest with the kubernetes client library to verify Deployment, Service, and ConfigMap creation.

What happens to child resources when the custom resource is deleted?

When you call kopf.adopt() on child resources, Kubernetes sets owner references. Deleting the parent AIAgent triggers garbage collection of all owned Deployments, Services, and ConfigMaps automatically. This prevents orphaned resources. Without adoption, you must handle cleanup manually in a @kopf.on.delete handler.


#KubernetesOperators #CRD #AIAgents #CustomControllers #Automation #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.