AI Agent for Infrastructure Monitoring: Anomaly Detection and Auto-Remediation

Beyond Threshold-Based Alerting

Traditional monitoring fires an alert when a metric crosses a static threshold: CPU above 90%, memory above 85%, disk above 80%. This approach generates noise. A CPU spike to 92% during a batch job at 2 AM is normal. The same spike at 2 PM during low traffic is concerning. An AI monitoring agent learns what "normal" looks like for each metric at each time of day and raises alerts only when the pattern breaks.

Metrics Ingestion Pipeline

The agent pulls metrics from Prometheus using PromQL and stores them in a time-series buffer for analysis.

import httpx
import numpy as np
from datetime import datetime, timedelta
from dataclasses import dataclass

@dataclass
class MetricSeries:
    name: str
    labels: dict
    timestamps: list[float]
    values: list[float]

class PrometheusClient:
    def __init__(self, base_url: str = "http://prometheus:9090"):
        self.base_url = base_url
        self.client = httpx.AsyncClient(timeout=30)

    async def query_range(
        self, query: str, hours_back: int = 24, step: str = "5m"
    ) -> MetricSeries:
        end = datetime.utcnow()
        start = end - timedelta(hours=hours_back)
        resp = await self.client.get(
            f"{self.base_url}/api/v1/query_range",
            params={
                "query": query,
                "start": start.isoformat() + "Z",
                "end": end.isoformat() + "Z",
                "step": step,
            },
        )
        data = resp.json()["data"]["result"][0]
        timestamps = [float(v[0]) for v in data["values"]]
        values = [float(v[1]) for v in data["values"]]
        return MetricSeries(
            name=query,
            labels=data["metric"],
            timestamps=timestamps,
            values=values,
        )

MONITORED_QUERIES = [
    'rate(container_cpu_usage_seconds_total{namespace="production"}[5m])',
    'container_memory_usage_bytes{namespace="production"}',
    'rate(http_requests_total{namespace="production"}[5m])',
    'histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))',
]

Anomaly Detection with Z-Score and Seasonal Decomposition

The agent combines simple statistical methods with time-aware baselines. Z-score catches sudden spikes while seasonal decomposition handles expected daily or weekly patterns.

from scipy import stats
from collections import defaultdict

class AnomalyDetector:
    def __init__(self, z_threshold: float = 3.0):
        self.z_threshold = z_threshold
        self.baselines: dict[str, list[float]] = defaultdict(list)

    def detect_zscore_anomalies(self, series: MetricSeries) -> list[dict]:
        values = np.array(series.values)
        if len(values) < 10:
            return []

        z_scores = np.abs(stats.zscore(values))
        anomalies = []
        for i, z in enumerate(z_scores):
            if z > self.z_threshold:
                anomalies.append({
                    "timestamp": series.timestamps[i],
                    "value": series.values[i],
                    "z_score": float(z),
                    "method": "zscore",
                    "metric": series.name,
                })
        return anomalies

    def detect_seasonal_anomalies(
        self, series: MetricSeries, period_hours: int = 24
    ) -> list[dict]:
        """Compare current values against same time-of-day from previous periods."""
        values = np.array(series.values)
        timestamps = np.array(series.timestamps)
        samples_per_period = period_hours * 12  # 5min intervals

        if len(values) < samples_per_period * 2:
            return []

        anomalies = []
        for i in range(samples_per_period, len(values)):
            historical_idx = range(
                i % samples_per_period, i, samples_per_period
            )
            historical = values[list(historical_idx)]
            if len(historical) < 3:
                continue

            mean = np.mean(historical)
            std = np.std(historical)
            if std == 0:
                continue

            deviation = abs(values[i] - mean) / std
            if deviation > self.z_threshold:
                anomalies.append({
                    "timestamp": timestamps[i],
                    "value": float(values[i]),
                    "expected": float(mean),
                    "deviation": float(deviation),
                    "method": "seasonal",
                    "metric": series.name,
                })
        return anomalies

Human Approval Gate for Remediation

When the agent detects an anomaly, it proposes a remediation and waits for approval on critical actions.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

import asyncio
from enum import Enum

class ApprovalStatus(Enum):
    PENDING = "pending"
    APPROVED = "approved"
    DENIED = "denied"
    AUTO_APPROVED = "auto_approved"

class ApprovalGate:
    def __init__(self, slack_webhook: str, auto_approve_severity: str = "low"):
        self.slack_webhook = slack_webhook
        self.auto_approve_severity = auto_approve_severity
        self.pending: dict[str, asyncio.Event] = {}
        self.decisions: dict[str, ApprovalStatus] = {}

    async def request_approval(
        self, action_id: str, description: str, severity: str
    ) -> ApprovalStatus:
        if severity == self.auto_approve_severity:
            return ApprovalStatus.AUTO_APPROVED

        event = asyncio.Event()
        self.pending[action_id] = event

        await self._send_slack_message(
            f"*Remediation Approval Required*\n"
            f"Action: {description}\n"
            f"Severity: {severity}\n"
            f"Reply with: /approve {action_id} or /deny {action_id}"
        )

        try:
            await asyncio.wait_for(event.wait(), timeout=300)
            return self.decisions.get(action_id, ApprovalStatus.DENIED)
        except asyncio.TimeoutError:
            return ApprovalStatus.DENIED

    async def _send_slack_message(self, text: str):
        async with httpx.AsyncClient() as client:
            await client.post(self.slack_webhook, json={"text": text})

The Monitoring Agent Loop

Tie everything together in a continuous monitoring loop that runs every cycle.

async def monitoring_loop(interval_seconds: int = 300):
    prom = PrometheusClient()
    detector = AnomalyDetector(z_threshold=3.0)
    gate = ApprovalGate(slack_webhook="https://hooks.slack.com/...")

    while True:
        for query in MONITORED_QUERIES:
            series = await prom.query_range(query, hours_back=48)
            anomalies = detector.detect_zscore_anomalies(series)
            anomalies += detector.detect_seasonal_anomalies(series)

            for anomaly in anomalies:
                severity = "high" if anomaly.get("z_score", 0) > 5 else "medium"
                status = await gate.request_approval(
                    action_id=f"{query}-{anomaly['timestamp']}",
                    description=f"Scale up pods for {query}",
                    severity=severity,
                )
                if status in (ApprovalStatus.APPROVED, ApprovalStatus.AUTO_APPROVED):
                    await execute_remediation(query, anomaly)

        await asyncio.sleep(interval_seconds)

FAQ

How do I avoid false positives from noisy metrics?

Use a sliding window to require multiple consecutive anomalous data points before triggering. A single spike is noise. Three consecutive 5-minute intervals above the threshold is a real problem. Also tune your z-score threshold per metric since some are naturally more variable than others.

Should the agent train its own ML model or use statistical methods?

Start with statistical methods like z-score and seasonal decomposition. They are interpretable, require no training data, and work well for most infrastructure metrics. Graduate to ML models (isolation forest, LSTM autoencoders) only when you have metrics with complex non-linear patterns that statistical methods miss.

How do I handle the cold-start problem when there is no historical data?

Fall back to static thresholds for the first 48-72 hours of data collection. Once the agent has enough history, automatically switch to anomaly detection. Log a warning when operating in cold-start mode so the team knows alerting quality may be lower.

#InfrastructureMonitoring #AnomalyDetection #DevOps #SRE #Python #AgenticAI #LearnAI #AIEngineering

AI Agent for Infrastructure Monitoring: Anomaly Detection and Auto-Remediation

Beyond Threshold-Based Alerting

Metrics Ingestion Pipeline

Anomaly Detection with Z-Score and Seasonal Decomposition

Human Approval Gate for Remediation

The Monitoring Agent Loop

FAQ

How do I avoid false positives from noisy metrics?

Should the agent train its own ML model or use statistical methods?

How do I handle the cold-start problem when there is no historical data?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding