Building a Monitoring Alert Agent: Responding to Infrastructure Events Automatically

Why Monitoring Alerts Need AI Agents

On-call engineers are drowning in alerts. The average production system generates hundreds of alerts daily, and most of them are noise — transient spikes, known issues, or low-severity warnings that resolve on their own. Engineers spend more time triaging alerts than fixing problems.

An AI monitoring agent changes this dynamic. It receives every alert from your monitoring stack (Prometheus, Datadog, PagerDuty), classifies severity using historical context, attempts automated remediation for known issues, and only escalates to humans when the problem genuinely requires human judgment. The agent acts as a first-responder that handles the routine so engineers can focus on the complex.

Alert Ingestion Endpoint

Most monitoring tools support webhook notifications. Build a single endpoint that normalizes alerts from different sources into a common format.

import os
from fastapi import FastAPI, Request, BackgroundTasks
from pydantic import BaseModel
from datetime import datetime
from openai import AsyncOpenAI

app = FastAPI()
llm = AsyncOpenAI()


class NormalizedAlert(BaseModel):
    source: str  # "prometheus", "datadog", "pagerduty"
    alert_name: str
    severity: str  # "critical", "warning", "info"
    message: str
    labels: dict
    timestamp: datetime
    raw_payload: dict


def normalize_prometheus_alert(payload: dict) -> list[NormalizedAlert]:
    alerts = []
    for alert in payload.get("alerts", []):
        alerts.append(NormalizedAlert(
            source="prometheus",
            alert_name=alert["labels"].get("alertname", "unknown"),
            severity=alert["labels"].get("severity", "warning"),
            message=alert.get("annotations", {}).get("summary", ""),
            labels=alert.get("labels", {}),
            timestamp=datetime.fromisoformat(
                alert["startsAt"].replace("Z", "+00:00")
            ),
            raw_payload=alert,
        ))
    return alerts


@app.post("/alerts/{source}")
async def receive_alert(
    source: str, request: Request, background_tasks: BackgroundTasks
):
    payload = await request.json()

    normalizers = {
        "prometheus": normalize_prometheus_alert,
        "datadog": normalize_datadog_alert,
        "pagerduty": normalize_pagerduty_alert,
    }
    normalizer = normalizers.get(source)
    if not normalizer:
        return {"status": "unknown_source"}

    alerts = normalizer(payload)
    for alert in alerts:
        background_tasks.add_task(process_alert, alert)

    return {"status": "accepted", "alert_count": len(alerts)}

Severity Classification with AI

The monitoring tool's severity is a starting point, but the agent should reclassify based on broader context — time of day, affected services, and recent deployment history.

async def classify_alert_severity(alert: NormalizedAlert) -> dict:
    recent_deploys = await get_recent_deployments(hours=4)
    similar_alerts = await get_similar_recent_alerts(alert.alert_name, hours=1)
    current_hour = datetime.utcnow().hour

    prompt = f"""Classify this infrastructure alert.

Alert: {alert.alert_name}
Original Severity: {alert.severity}
Message: {alert.message}
Labels: {alert.labels}
Time: {alert.timestamp} (current hour UTC: {current_hour})
Similar alerts in last hour: {len(similar_alerts)}
Recent deployments: {[d['service'] for d in recent_deploys]}

Assess the alert and respond with:
EFFECTIVE_SEVERITY: [critical/high/medium/low/noise]
LIKELY_CAUSE: [one sentence]
IS_DEPLOYMENT_RELATED: [yes/no]
AUTO_REMEDIATION_POSSIBLE: [yes/no]
RECOMMENDED_ACTION: [description]"""

    response = await llm.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
    )
    return parse_classification(response.choices[0].message.content)

Automated Runbook Execution

For known issues with documented remediation steps, the agent can execute runbook actions automatically.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

import subprocess
import asyncio

RUNBOOKS = {
    "HighMemoryUsage": {
        "description": "Memory usage above 90%",
        "auto_remediate": True,
        "steps": [
            {"action": "identify_process", "cmd": "ps aux --sort=-%mem | head -5"},
            {"action": "clear_cache", "cmd": "sync; echo 3 > /proc/sys/vm/drop_caches"},
            {"action": "restart_if_needed", "service": "app-server"},
        ],
    },
    "DiskSpaceLow": {
        "description": "Disk usage above 85%",
        "auto_remediate": True,
        "steps": [
            {"action": "find_large_files", "cmd": "find /var/log -size +100M -type f"},
            {"action": "rotate_logs", "cmd": "logrotate -f /etc/logrotate.conf"},
        ],
    },
}


async def execute_runbook(alert_name: str, labels: dict) -> dict:
    runbook = RUNBOOKS.get(alert_name)
    if not runbook or not runbook["auto_remediate"]:
        return {"executed": False, "reason": "No auto-remediation runbook"}

    results = []
    for step in runbook["steps"]:
        if "cmd" in step:
            proc = await asyncio.create_subprocess_shell(
                step["cmd"],
                stdout=asyncio.subprocess.PIPE,
                stderr=asyncio.subprocess.PIPE,
            )
            stdout, stderr = await proc.communicate()
            results.append({
                "action": step["action"],
                "exit_code": proc.returncode,
                "output": stdout.decode()[:500],
            })

    return {"executed": True, "steps": results}

Alert Processing Pipeline

Tie everything together in a processing pipeline that classifies, attempts remediation, and escalates when necessary.

async def process_alert(alert: NormalizedAlert):
    classification = await classify_alert_severity(alert)

    if classification["effective_severity"] == "noise":
        await log_suppressed_alert(alert, classification)
        return

    runbook_result = None
    if classification.get("auto_remediation_possible") == "yes":
        runbook_result = await execute_runbook(alert.alert_name, alert.labels)

    if runbook_result and runbook_result["executed"]:
        summary = await summarize_remediation(alert, runbook_result)
        await send_slack_notification(
            channel="#ops-automated",
            message=f"Auto-remediated: {alert.alert_name}\n{summary}",
        )
        return

    if classification["effective_severity"] in ("critical", "high"):
        await escalate_to_oncall(alert, classification)
    else:
        await send_slack_notification(
            channel="#ops-alerts",
            message=format_alert_message(alert, classification),
        )


async def escalate_to_oncall(alert: NormalizedAlert, classification: dict):
    oncall = await get_current_oncall_engineer()
    context = await gather_incident_context(alert)

    prompt = f"""Write a concise incident summary for the on-call engineer.

Alert: {alert.alert_name}
Severity: {classification['effective_severity']}
Likely Cause: {classification['likely_cause']}
Context: {context}

Include: what is happening, what is affected, and suggested first steps."""

    response = await llm.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
    )

    await page_engineer(
        engineer=oncall,
        title=f"[{classification['effective_severity'].upper()}] {alert.alert_name}",
        body=response.choices[0].message.content,
    )

FAQ

How do I prevent alert storms from overwhelming the agent?

Implement alert grouping and rate limiting. Group alerts with the same name and similar labels into a single incident within a time window (e.g., 5 minutes). Use a token bucket or sliding window counter to cap the number of alerts processed per minute per alert type.

Is it safe to let an AI agent execute remediation commands?

Only for well-tested, idempotent operations with clear safety boundaries. Never give the agent root access or the ability to delete data. Use a whitelist of allowed commands, run them in isolated environments when possible, and always log every command executed. Require human approval for any action that could cause data loss.

How do I measure whether the agent is actually reducing on-call burden?

Track three metrics: mean time to acknowledge (MTTA), mean time to resolve (MTTR), and the percentage of alerts auto-resolved versus escalated. Compare these before and after deploying the agent. A well-tuned agent should reduce MTTA to near zero for auto-remediated issues and cut escalations by 40-60%.

#InfrastructureMonitoring #DevOps #AIAgents #Alerting #IncidentResponse #AgenticAI #LearnAI #AIEngineering

Building a Monitoring Alert Agent: Responding to Infrastructure Events Automatically

Why Monitoring Alerts Need AI Agents

Alert Ingestion Endpoint

Severity Classification with AI

Automated Runbook Execution

Alert Processing Pipeline

FAQ

How do I prevent alert storms from overwhelming the agent?

Is it safe to let an AI agent execute remediation commands?

How do I measure whether the agent is actually reducing on-call burden?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding