Building an Incident Response Agent: Automated Triage, Diagnosis, and Remediation

Why Incident Response Needs an Agent

Traditional incident response relies on a human being woken at 3 AM, reading an alert, opening a runbook, copying commands, and deciding whether the fix worked. Every step introduces latency and human error. An AI incident response agent compresses this cycle from minutes to seconds by automating triage, diagnosis, and first-pass remediation while keeping humans in the loop for high-risk actions.

The core loop is simple: Ingest alert, classify severity, run diagnostics, attempt fix, escalate if needed, document everything.

Architecture Overview

An incident response agent sits between your alerting system (PagerDuty, Opsgenie, Prometheus Alertmanager) and your infrastructure. It receives webhook payloads, enriches them with context, and decides what to do.

# alert-webhook-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: alertmanager-config
data:
  alertmanager.yml: |
    receivers:
      - name: incident-agent
        webhook_configs:
          - url: "http://incident-agent:8080/api/alerts"
            send_resolved: true
    route:
      receiver: incident-agent
      group_by: ['alertname', 'namespace']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 4h

Building the Alert Ingestion Layer

The agent needs to normalize alerts from different sources into a common format before it can reason about them.

from dataclasses import dataclass, field
from datetime import datetime
from enum import Enum
from typing import Optional

class Severity(Enum):
    CRITICAL = "critical"
    HIGH = "high"
    MEDIUM = "medium"
    LOW = "low"

@dataclass
class NormalizedAlert:
    alert_id: str
    source: str  # "prometheus", "pagerduty", "cloudwatch"
    title: str
    description: str
    severity: Severity
    service: str
    namespace: str
    labels: dict = field(default_factory=dict)
    started_at: datetime = field(default_factory=datetime.utcnow)
    raw_payload: dict = field(default_factory=dict)

def normalize_prometheus_alert(payload: dict) -> NormalizedAlert:
    """Convert Prometheus Alertmanager webhook to normalized format."""
    alert = payload["alerts"][0]
    labels = alert.get("labels", {})

    severity_map = {
        "critical": Severity.CRITICAL,
        "warning": Severity.HIGH,
        "info": Severity.LOW,
    }

    return NormalizedAlert(
        alert_id=alert["fingerprint"],
        source="prometheus",
        title=labels.get("alertname", "Unknown Alert"),
        description=alert.get("annotations", {}).get("summary", ""),
        severity=severity_map.get(labels.get("severity", "info"), Severity.MEDIUM),
        service=labels.get("service", labels.get("job", "unknown")),
        namespace=labels.get("namespace", "default"),
        labels=labels,
        raw_payload=payload,
    )

The Triage and Diagnosis Engine

The agent uses an LLM to classify the alert and select the right diagnostic runbook. This is where the AI reasoning happens.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

import openai
import json

TRIAGE_PROMPT = """You are an SRE incident triage agent. Given the alert below,
determine:
1. The likely root cause category (one of: resource_exhaustion, network,
   application_crash, certificate_expiry, disk_pressure, database, config_drift)
2. The diagnostic commands to run (return as a list)
3. Whether automated remediation is safe (true/false)
4. The escalation urgency (immediate, 15min, 1hr, next_business_day)

Alert: {alert_title}
Description: {alert_description}
Service: {service}
Severity: {severity}
Labels: {labels}

Respond in JSON with keys: root_cause_category, diagnostic_commands,
safe_to_auto_remediate, escalation_urgency, reasoning.
"""

async def triage_alert(alert: NormalizedAlert) -> dict:
    client = openai.AsyncOpenAI()
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a senior SRE."},
            {"role": "user", "content": TRIAGE_PROMPT.format(
                alert_title=alert.title,
                alert_description=alert.description,
                service=alert.service,
                severity=alert.severity.value,
                labels=json.dumps(alert.labels),
            )},
        ],
        response_format={"type": "json_object"},
        temperature=0.1,
    )
    return json.loads(response.choices[0].message.content)

Automated Remediation with Safety Gates

The critical design principle: never auto-remediate without a safety gate. The agent checks severity, blast radius, and time-of-day before taking action.

import subprocess
from typing import Optional

SAFE_REMEDIATIONS = {
    "resource_exhaustion": "kubectl rollout restart deployment/{service} -n {namespace}",
    "disk_pressure": "kubectl exec -n {namespace} deploy/{service} -- find /tmp -mtime +7 -delete",
    "certificate_expiry": "kubectl delete secret {service}-tls -n {namespace}",
}

async def attempt_remediation(
    alert: NormalizedAlert,
    triage: dict,
) -> Optional[str]:
    category = triage["root_cause_category"]
    if not triage["safe_to_auto_remediate"]:
        return None

    if alert.severity == Severity.CRITICAL:
        # Critical alerts always need human approval first
        return None

    template = SAFE_REMEDIATIONS.get(category)
    if not template:
        return None

    command = template.format(
        service=alert.service,
        namespace=alert.namespace,
    )
    result = subprocess.run(
        command.split(), capture_output=True, text=True, timeout=60
    )
    return f"Executed: {command}\nOutput: {result.stdout}\nErrors: {result.stderr}"

Post-Incident Report Generation

After every incident, the agent generates a structured report for the team.

async def generate_postincident_report(
    alert: NormalizedAlert,
    triage: dict,
    remediation_result: Optional[str],
    timeline: list[dict],
) -> str:
    client = openai.AsyncOpenAI()
    response = await client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "user",
            "content": f"""Generate a post-incident report:

Alert: {alert.title} ({alert.severity.value})
Service: {alert.service}
Root Cause Category: {triage['root_cause_category']}
Reasoning: {triage['reasoning']}
Auto-remediation Applied: {remediation_result or 'None (escalated to human)'}
Timeline: {json.dumps(timeline, default=str)}

Format as markdown with: Summary, Timeline, Root Cause, Remediation, Action Items."""
        }],
    )
    return response.choices[0].message.content

FAQ

How do I prevent the agent from causing more damage during remediation?

Implement a blast radius limiter. Track which services the agent has touched in the last hour. If it has already remediated the same service twice, force escalation to a human. Also keep all remediations behind a dry-run mode that you enable first in staging.

Should the agent handle alert storms where hundreds of alerts fire at once?

Yes, but with deduplication and grouping. Use the Alertmanager group_by configuration to batch related alerts. The agent should deduplicate by fingerprint and prioritize the highest-severity alert in each group rather than processing them individually.

What monitoring should I put on the incident response agent itself?

Treat it like any critical service. Monitor its webhook endpoint latency, LLM API error rates, remediation success/failure ratios, and escalation counts. Set up a separate alert path that bypasses the agent so you get notified if the agent itself goes down.

#IncidentResponse #DevOps #SRE #Automation #Python #AgenticAI #LearnAI #AIEngineering

Building an Incident Response Agent: Automated Triage, Diagnosis, and Remediation

Why Incident Response Needs an Agent

Architecture Overview

Building the Alert Ingestion Layer

The Triage and Diagnosis Engine

Automated Remediation with Safety Gates

Post-Incident Report Generation

FAQ

How do I prevent the agent from causing more damage during remediation?

Should the agent handle alert storms where hundreds of alerts fire at once?

What monitoring should I put on the incident response agent itself?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding