Runbooks for AI Agent Operations: Documenting Procedures for Common Issues

Why Runbooks Are Critical for Agent Operations

AI agent systems fail in domain-specific ways that generic operations experience cannot cover. When an agent starts hallucinating tool calls at 3 AM, the on-call engineer needs specific, tested procedures — not general troubleshooting instincts.

Runbooks bridge the gap between the team that built the agent and the team that operates it. They encode expert knowledge into repeatable procedures that any qualified operator can follow under pressure.

Runbook Design Principles

Effective runbooks are structured, testable, and maintained as code.

from dataclasses import dataclass, field
from typing import List, Optional
from enum import Enum

class Severity(Enum):
    SEV1 = "sev1"
    SEV2 = "sev2"
    SEV3 = "sev3"

@dataclass
class RunbookStep:
    description: str
    command: Optional[str] = None
    expected_output: Optional[str] = None
    if_unexpected: Optional[str] = None  # what to do if output differs
    automated: bool = False

@dataclass
class Runbook:
    title: str
    alert_name: str
    severity: Severity
    symptoms: List[str]
    prerequisites: List[str]
    steps: List[RunbookStep]
    escalation: str
    last_tested: str
    owner: str

    def validate(self) -> List[str]:
        """Check runbook quality."""
        issues = []
        if not self.symptoms:
            issues.append("Missing symptom descriptions")
        if not self.escalation:
            issues.append("Missing escalation path")
        for i, step in enumerate(self.steps):
            if step.command and not step.expected_output:
                issues.append(f"Step {i+1} has command but no expected output")
            if step.command and not step.if_unexpected:
                issues.append(f"Step {i+1} missing guidance for unexpected output")
        return issues

Every step with a command must document what the output should look like. Without expected outputs, operators cannot tell if a diagnostic step revealed the problem or not.

Example: Agent Stuck in Reasoning Loop

This is one of the most common AI agent failures — the agent repeatedly calls the LLM without converging on a final answer.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

# runbook-stuck-reasoning-loop.yaml
title: "Agent Stuck in Reasoning Loop"
alert_name: "agent_llm_calls_excessive"
severity: sev2
symptoms:
  - "Alert: agent_llm_calls per task > 15 (threshold: 10)"
  - "Agent task duration exceeds 120 seconds"
  - "LLM token consumption spiking for specific agent instance"

prerequisites:
  - "Access to agent monitoring dashboard"
  - "kubectl access to agent namespace"
  - "Access to agent log aggregation (Grafana/Loki)"

steps:
  - description: "Identify the stuck agent instance"
    command: "kubectl get pods -n agents -l app=ai-agent --sort-by=.status.startTime"
    expected_output: "List of pods with STATUS Running. Look for pods with high RESTARTS or long AGE."
    if_unexpected: "If no pods are running, escalate to Sev1 — full agent outage."

  - description: "Check agent logs for loop pattern"
    command: >
      kubectl logs -n agents <pod-name> --tail=100 |
      grep -c 'llm_call_start'
    expected_output: "Number of recent LLM calls. If > 20 in last 100 lines, confirms loop."
    if_unexpected: "If LLM calls are normal, check tool call patterns instead."

  - description: "Inspect the current task context"
    command: >
      curl -s http://<agent-pod-ip>:8080/debug/current-task |
      python3 -m json.tool
    expected_output: "JSON showing current task, conversation history, and tool calls."
    if_unexpected: "If endpoint returns 500, agent process may be deadlocked."

  - description: "Force-terminate the stuck task"
    command: "curl -X POST http://<agent-pod-ip>:8080/admin/cancel-task/<task-id>"
    expected_output: '{"status": "cancelled", "task_id": "<task-id>"}'
    if_unexpected: "If cancel fails, proceed to pod restart."

  - description: "Restart the agent pod if task cancellation failed"
    command: "kubectl delete pod -n agents <pod-name>"
    expected_output: "Pod deleted, replacement scheduled by deployment controller."

  - description: "Verify recovery"
    command: "kubectl get pods -n agents -l app=ai-agent"
    expected_output: "All pods in Running state with 0 recent restarts."

escalation: "If loop recurs within 1 hour, escalate to AI team lead. May indicate a prompt regression or model behavior change."
owner: "ai-platform-team"
last_tested: "2026-03-01"

Automating Runbook Steps

Many runbook steps can be partially or fully automated. The goal is not to replace the operator but to reduce time-to-resolution.

import subprocess
import json

class RunbookAutomator:
    def __init__(self, k8s_namespace: str, notifier):
        self.namespace = k8s_namespace
        self.notifier = notifier

    async def diagnose_stuck_agent(self, pod_name: str) -> dict:
        """Automated diagnosis for stuck reasoning loop."""
        diagnosis = {}

        # Step 1: Get pod status
        result = subprocess.run(
            ["kubectl", "get", "pod", pod_name, "-n", self.namespace, "-o", "json"],
            capture_output=True, text=True,
        )
        pod_info = json.loads(result.stdout)
        diagnosis["restarts"] = pod_info["status"]["containerStatuses"][0]["restartCount"]
        diagnosis["phase"] = pod_info["status"]["phase"]

        # Step 2: Count recent LLM calls from logs
        result = subprocess.run(
            ["kubectl", "logs", pod_name, "-n", self.namespace, "--tail=200"],
            capture_output=True, text=True,
        )
        llm_calls = result.stdout.count("llm_call_start")
        diagnosis["recent_llm_calls"] = llm_calls
        diagnosis["likely_stuck"] = llm_calls > 30

        # Step 3: Get current task info
        try:
            result = subprocess.run(
                ["kubectl", "exec", pod_name, "-n", self.namespace, "--",
                 "curl", "-s", "http://localhost:8080/debug/current-task"],
                capture_output=True, text=True, timeout=10,
            )
            diagnosis["current_task"] = json.loads(result.stdout)
        except (subprocess.TimeoutExpired, json.JSONDecodeError):
            diagnosis["current_task"] = "unreachable"

        return diagnosis

    async def auto_remediate(self, pod_name: str, diagnosis: dict) -> str:
        if diagnosis.get("current_task") == "unreachable":
            # Process is deadlocked, restart pod
            subprocess.run(
                ["kubectl", "delete", "pod", pod_name, "-n", self.namespace],
            )
            return "pod_restarted"

        if diagnosis.get("likely_stuck"):
            # Try graceful task cancellation first
            task_id = diagnosis["current_task"].get("task_id")
            if task_id:
                subprocess.run(
                    ["kubectl", "exec", pod_name, "-n", self.namespace, "--",
                     "curl", "-X", "POST",
                     f"http://localhost:8080/admin/cancel-task/{task_id}"],
                )
                return "task_cancelled"

        return "no_action_needed"

Automated remediation should always log what it did and notify the team. Silent auto-fixes hide systemic problems.

Knowledge Transfer and Runbook Maintenance

Runbooks rot quickly if not maintained. Establish a review cadence.

# runbook-maintenance-schedule.yaml
maintenance:
  review_cadence: "monthly"
  testing_cadence: "quarterly"
  owner_rotation: true

  review_checklist:
    - "Are all commands still valid? (API endpoints, kubectl contexts)"
    - "Are expected outputs still accurate?"
    - "Has the alert threshold changed?"
    - "Have new failure modes been discovered since last review?"
    - "Are escalation contacts still current?"

  new_engineer_onboarding:
    - "Walk through each Sev1 runbook hands-on"
    - "Run a simulated incident using staging environment"
    - "Shadow an on-call shift before taking primary"

FAQ

How detailed should runbook steps be?

Detailed enough that an engineer who has never seen the system before can follow them at 3 AM while sleep-deprived. Include exact commands, expected outputs, and what to do when the output is unexpected. Avoid vague instructions like "check if the agent is working" — instead write "run this command and verify the output contains status: healthy."

Should runbooks be stored as code or in a wiki?

Store them as code in your repository, version-controlled alongside the system they describe. Wiki-based runbooks drift from reality because they are not updated during code changes. When runbooks live in the same repo, pull request reviewers can flag when a code change should trigger a runbook update.

How do I prioritize which runbooks to write first?

Start with the incidents that have already happened. Review your last 3 months of incidents and write runbooks for the top 5 most frequent issues. Then write runbooks for the highest-severity potential failures, even if they have not occurred yet. A Sev1 runbook you never use is better than a Sev1 incident with no runbook.

#Runbooks #AIAgents #Operations #IncidentResponse #Documentation #AgenticAI #LearnAI #AIEngineering

Runbooks for AI Agent Operations: Documenting Procedures for Common Issues

Why Runbooks Are Critical for Agent Operations

Runbook Design Principles

Example: Agent Stuck in Reasoning Loop

Automating Runbook Steps

Knowledge Transfer and Runbook Maintenance

FAQ

How detailed should runbook steps be?

Should runbooks be stored as code or in a wiki?

How do I prioritize which runbooks to write first?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding