Skip to content
Learn Agentic AI10 min read0 views

AI Agent for Infrastructure Monitoring: Anomaly Detection and Auto-Remediation

Build an AI agent that continuously ingests infrastructure metrics, detects anomalies using statistical and ML methods, and triggers automated remediation with human approval gates.

Beyond Threshold-Based Alerting

Traditional monitoring fires an alert when a metric crosses a static threshold: CPU above 90%, memory above 85%, disk above 80%. This approach generates noise. A CPU spike to 92% during a batch job at 2 AM is normal. The same spike at 2 PM during low traffic is concerning. An AI monitoring agent learns what "normal" looks like for each metric at each time of day and raises alerts only when the pattern breaks.

Metrics Ingestion Pipeline

The agent pulls metrics from Prometheus using PromQL and stores them in a time-series buffer for analysis.

import httpx
import numpy as np
from datetime import datetime, timedelta
from dataclasses import dataclass

@dataclass
class MetricSeries:
    name: str
    labels: dict
    timestamps: list[float]
    values: list[float]

class PrometheusClient:
    def __init__(self, base_url: str = "http://prometheus:9090"):
        self.base_url = base_url
        self.client = httpx.AsyncClient(timeout=30)

    async def query_range(
        self, query: str, hours_back: int = 24, step: str = "5m"
    ) -> MetricSeries:
        end = datetime.utcnow()
        start = end - timedelta(hours=hours_back)
        resp = await self.client.get(
            f"{self.base_url}/api/v1/query_range",
            params={
                "query": query,
                "start": start.isoformat() + "Z",
                "end": end.isoformat() + "Z",
                "step": step,
            },
        )
        data = resp.json()["data"]["result"][0]
        timestamps = [float(v[0]) for v in data["values"]]
        values = [float(v[1]) for v in data["values"]]
        return MetricSeries(
            name=query,
            labels=data["metric"],
            timestamps=timestamps,
            values=values,
        )

MONITORED_QUERIES = [
    'rate(container_cpu_usage_seconds_total{namespace="production"}[5m])',
    'container_memory_usage_bytes{namespace="production"}',
    'rate(http_requests_total{namespace="production"}[5m])',
    'histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))',
]

Anomaly Detection with Z-Score and Seasonal Decomposition

The agent combines simple statistical methods with time-aware baselines. Z-score catches sudden spikes while seasonal decomposition handles expected daily or weekly patterns.

from scipy import stats
from collections import defaultdict

class AnomalyDetector:
    def __init__(self, z_threshold: float = 3.0):
        self.z_threshold = z_threshold
        self.baselines: dict[str, list[float]] = defaultdict(list)

    def detect_zscore_anomalies(self, series: MetricSeries) -> list[dict]:
        values = np.array(series.values)
        if len(values) < 10:
            return []

        z_scores = np.abs(stats.zscore(values))
        anomalies = []
        for i, z in enumerate(z_scores):
            if z > self.z_threshold:
                anomalies.append({
                    "timestamp": series.timestamps[i],
                    "value": series.values[i],
                    "z_score": float(z),
                    "method": "zscore",
                    "metric": series.name,
                })
        return anomalies

    def detect_seasonal_anomalies(
        self, series: MetricSeries, period_hours: int = 24
    ) -> list[dict]:
        """Compare current values against same time-of-day from previous periods."""
        values = np.array(series.values)
        timestamps = np.array(series.timestamps)
        samples_per_period = period_hours * 12  # 5min intervals

        if len(values) < samples_per_period * 2:
            return []

        anomalies = []
        for i in range(samples_per_period, len(values)):
            historical_idx = range(
                i % samples_per_period, i, samples_per_period
            )
            historical = values[list(historical_idx)]
            if len(historical) < 3:
                continue

            mean = np.mean(historical)
            std = np.std(historical)
            if std == 0:
                continue

            deviation = abs(values[i] - mean) / std
            if deviation > self.z_threshold:
                anomalies.append({
                    "timestamp": timestamps[i],
                    "value": float(values[i]),
                    "expected": float(mean),
                    "deviation": float(deviation),
                    "method": "seasonal",
                    "metric": series.name,
                })
        return anomalies

Human Approval Gate for Remediation

When the agent detects an anomaly, it proposes a remediation and waits for approval on critical actions.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

import asyncio
from enum import Enum

class ApprovalStatus(Enum):
    PENDING = "pending"
    APPROVED = "approved"
    DENIED = "denied"
    AUTO_APPROVED = "auto_approved"

class ApprovalGate:
    def __init__(self, slack_webhook: str, auto_approve_severity: str = "low"):
        self.slack_webhook = slack_webhook
        self.auto_approve_severity = auto_approve_severity
        self.pending: dict[str, asyncio.Event] = {}
        self.decisions: dict[str, ApprovalStatus] = {}

    async def request_approval(
        self, action_id: str, description: str, severity: str
    ) -> ApprovalStatus:
        if severity == self.auto_approve_severity:
            return ApprovalStatus.AUTO_APPROVED

        event = asyncio.Event()
        self.pending[action_id] = event

        await self._send_slack_message(
            f"*Remediation Approval Required*\n"
            f"Action: {description}\n"
            f"Severity: {severity}\n"
            f"Reply with: /approve {action_id} or /deny {action_id}"
        )

        try:
            await asyncio.wait_for(event.wait(), timeout=300)
            return self.decisions.get(action_id, ApprovalStatus.DENIED)
        except asyncio.TimeoutError:
            return ApprovalStatus.DENIED

    async def _send_slack_message(self, text: str):
        async with httpx.AsyncClient() as client:
            await client.post(self.slack_webhook, json={"text": text})

The Monitoring Agent Loop

Tie everything together in a continuous monitoring loop that runs every cycle.

async def monitoring_loop(interval_seconds: int = 300):
    prom = PrometheusClient()
    detector = AnomalyDetector(z_threshold=3.0)
    gate = ApprovalGate(slack_webhook="https://hooks.slack.com/...")

    while True:
        for query in MONITORED_QUERIES:
            series = await prom.query_range(query, hours_back=48)
            anomalies = detector.detect_zscore_anomalies(series)
            anomalies += detector.detect_seasonal_anomalies(series)

            for anomaly in anomalies:
                severity = "high" if anomaly.get("z_score", 0) > 5 else "medium"
                status = await gate.request_approval(
                    action_id=f"{query}-{anomaly['timestamp']}",
                    description=f"Scale up pods for {query}",
                    severity=severity,
                )
                if status in (ApprovalStatus.APPROVED, ApprovalStatus.AUTO_APPROVED):
                    await execute_remediation(query, anomaly)

        await asyncio.sleep(interval_seconds)

FAQ

How do I avoid false positives from noisy metrics?

Use a sliding window to require multiple consecutive anomalous data points before triggering. A single spike is noise. Three consecutive 5-minute intervals above the threshold is a real problem. Also tune your z-score threshold per metric since some are naturally more variable than others.

Should the agent train its own ML model or use statistical methods?

Start with statistical methods like z-score and seasonal decomposition. They are interpretable, require no training data, and work well for most infrastructure metrics. Graduate to ML models (isolation forest, LSTM autoencoders) only when you have metrics with complex non-linear patterns that statistical methods miss.

How do I handle the cold-start problem when there is no historical data?

Fall back to static thresholds for the first 48-72 hours of data collection. Once the agent has enough history, automatically switch to anomaly detection. Log a warning when operating in cold-start mode so the team knows alerting quality may be lower.


#InfrastructureMonitoring #AnomalyDetection #DevOps #SRE #Python #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.