Kubernetes Operators for AI Agents: Custom Controllers for Agent Lifecycle Management
Build a Kubernetes Operator for AI agent lifecycle management using Custom Resource Definitions, reconciliation loops, and status management to automate agent provisioning and scaling.
What Is a Kubernetes Operator
A Kubernetes Operator extends the Kubernetes API with custom resources and controllers that encode domain-specific operational knowledge. Instead of manually creating Deployments, Services, ConfigMaps, and HPAs for each AI agent, you define an AIAgent custom resource and let the Operator reconcile all the underlying infrastructure automatically.
This transforms agent deployment from "create six YAML files and apply them in the right order" to "declare what agent you want and let the Operator handle the rest."
Custom Resource Definition (CRD)
First, define what an AIAgent resource looks like:
# crd-aiagent.yaml
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: aiagents.ai.example.com
spec:
group: ai.example.com
versions:
- name: v1alpha1
served: true
storage: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
required: ["model", "replicas"]
properties:
model:
type: string
description: "LLM model to use"
replicas:
type: integer
minimum: 1
maximum: 100
temperature:
type: number
default: 0.7
maxTokens:
type: integer
default: 4096
image:
type: string
tools:
type: array
items:
type: string
autoscaling:
type: object
properties:
enabled:
type: boolean
default: false
minReplicas:
type: integer
maxReplicas:
type: integer
status:
type: object
properties:
phase:
type: string
readyReplicas:
type: integer
lastUpdated:
type: string
conditions:
type: array
items:
type: object
properties:
type:
type: string
status:
type: string
message:
type: string
subresources:
status: {}
additionalPrinterColumns:
- name: Model
type: string
jsonPath: .spec.model
- name: Replicas
type: integer
jsonPath: .spec.replicas
- name: Phase
type: string
jsonPath: .status.phase
scope: Namespaced
names:
plural: aiagents
singular: aiagent
kind: AIAgent
shortNames:
- aia
Apply the CRD and now you can create AIAgent resources:
# my-support-agent.yaml
apiVersion: ai.example.com/v1alpha1
kind: AIAgent
metadata:
name: support-agent
namespace: ai-agents
spec:
model: "gpt-4o"
replicas: 3
temperature: 0.5
maxTokens: 2048
image: "myregistry/support-agent:2.0.0"
tools:
- "knowledge-base-search"
- "ticket-creator"
- "calendar-lookup"
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 15
Building the Operator in Python with Kopf
Kopf is a Python framework for building Kubernetes Operators. It handles watch streams, retry logic, and status updates.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
# operator.py
import kopf
import kubernetes
from kubernetes import client
@kopf.on.create("ai.example.com", "v1alpha1", "aiagents")
async def create_agent(spec, name, namespace, logger, **kwargs):
"""Reconcile when a new AIAgent is created."""
logger.info(f"Creating AI agent: {name}")
apps_v1 = client.AppsV1Api()
core_v1 = client.CoreV1Api()
# Create ConfigMap with agent settings
configmap = client.V1ConfigMap(
metadata=client.V1ObjectMeta(
name=f"{name}-config",
namespace=namespace,
),
data={
"MODEL_NAME": spec.get("model", "gpt-4o"),
"TEMPERATURE": str(spec.get("temperature", 0.7)),
"MAX_TOKENS": str(spec.get("maxTokens", 4096)),
"TOOLS": ",".join(spec.get("tools", [])),
},
)
kopf.adopt(configmap)
core_v1.create_namespaced_config_map(namespace, configmap)
# Create Deployment
deployment = build_deployment(name, namespace, spec)
kopf.adopt(deployment)
apps_v1.create_namespaced_deployment(namespace, deployment)
# Create Service
service = build_service(name, namespace, spec)
kopf.adopt(service)
core_v1.create_namespaced_service(namespace, service)
return {"phase": "Running", "readyReplicas": 0}
def build_deployment(name: str, namespace: str, spec: dict):
"""Build a Deployment object from AIAgent spec."""
return client.V1Deployment(
metadata=client.V1ObjectMeta(
name=name,
namespace=namespace,
),
spec=client.V1DeploymentSpec(
replicas=spec.get("replicas", 1),
selector=client.V1LabelSelector(
match_labels={"aiagent": name}
),
template=client.V1PodTemplateSpec(
metadata=client.V1ObjectMeta(
labels={"aiagent": name}
),
spec=client.V1PodSpec(
containers=[
client.V1Container(
name="agent",
image=spec["image"],
ports=[client.V1ContainerPort(
container_port=8000
)],
env_from=[
client.V1EnvFromSource(
config_map_ref=client.V1ConfigMapEnvSource(
name=f"{name}-config"
)
)
],
)
]
),
),
),
)
def build_service(name: str, namespace: str, spec: dict):
return client.V1Service(
metadata=client.V1ObjectMeta(
name=f"{name}-svc",
namespace=namespace,
),
spec=client.V1ServiceSpec(
selector={"aiagent": name},
ports=[client.V1ServicePort(
port=80, target_port=8000
)],
),
)
Handling Updates with the Reconciliation Loop
When someone changes the AIAgent spec, the Operator detects the diff and updates resources:
@kopf.on.update("ai.example.com", "v1alpha1", "aiagents")
async def update_agent(spec, name, namespace, diff, logger, **kwargs):
"""Reconcile when an AIAgent spec changes."""
apps_v1 = client.AppsV1Api()
core_v1 = client.CoreV1Api()
for field, old_val, new_val in diff:
logger.info(f"Field changed: {field} from {old_val} to {new_val}")
# Update ConfigMap
configmap_patch = {
"data": {
"MODEL_NAME": spec.get("model", "gpt-4o"),
"TEMPERATURE": str(spec.get("temperature", 0.7)),
"MAX_TOKENS": str(spec.get("maxTokens", 4096)),
}
}
core_v1.patch_namespaced_config_map(
f"{name}-config", namespace, configmap_patch
)
# Update Deployment replicas and image
deployment_patch = {
"spec": {
"replicas": spec.get("replicas", 1),
"template": {
"spec": {
"containers": [{
"name": "agent",
"image": spec["image"],
}]
}
}
}
}
apps_v1.patch_namespaced_deployment(
name, namespace, deployment_patch
)
return {"phase": "Updating"}
Status Management
Update the custom resource status to reflect the actual state:
@kopf.timer("ai.example.com", "v1alpha1", "aiagents", interval=30)
async def monitor_agent(spec, name, namespace, patch, logger, **kwargs):
"""Periodically check agent health and update status."""
apps_v1 = client.AppsV1Api()
try:
deployment = apps_v1.read_namespaced_deployment(name, namespace)
ready = deployment.status.ready_replicas or 0
desired = deployment.spec.replicas
phase = "Running" if ready == desired else "Scaling"
patch.status["readyReplicas"] = ready
patch.status["phase"] = phase
patch.status["lastUpdated"] = "2026-03-17T00:00:00Z"
except kubernetes.client.exceptions.ApiException as e:
patch.status["phase"] = "Error"
logger.error(f"Failed to read deployment: {e}")
Using the Operator
Once deployed, managing agents becomes declarative:
# Create an agent
kubectl apply -f my-support-agent.yaml
# List all agents
kubectl get aiagents -n ai-agents
# Scale an agent (edit the spec)
kubectl patch aiagent support-agent -n ai-agents \
--type=merge -p '{"spec": {"replicas": 5}}'
# Delete an agent (cleans up all child resources)
kubectl delete aiagent support-agent -n ai-agents
FAQ
When should I build an Operator versus using Helm charts?
Use Helm when your deployment is a one-time packaging problem — you need to template and parameterize YAML. Build an Operator when you need ongoing lifecycle management — automatic scaling adjustments, health monitoring, backup scheduling, or coordinated multi-resource updates that respond to runtime conditions. Operators encode operational knowledge that Helm charts cannot express.
How do I test a Kubernetes Operator locally?
Use kind (Kubernetes in Docker) or minikube to run a local cluster. Kopf supports running outside the cluster with kopf run operator.py which connects to your kubeconfig context. Write integration tests that create custom resources and assert the expected child resources appear. Use pytest with the kubernetes client library to verify Deployment, Service, and ConfigMap creation.
What happens to child resources when the custom resource is deleted?
When you call kopf.adopt() on child resources, Kubernetes sets owner references. Deleting the parent AIAgent triggers garbage collection of all owned Deployments, Services, and ConfigMaps automatically. This prevents orphaned resources. Without adoption, you must handle cleanup manually in a @kopf.on.delete handler.
#KubernetesOperators #CRD #AIAgents #CustomControllers #Automation #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.