AI Agent for Infrastructure Monitoring: Anomaly Detection and Auto-Remediation
Build an AI agent that continuously ingests infrastructure metrics, detects anomalies using statistical and ML methods, and triggers automated remediation with human approval gates.
Beyond Threshold-Based Alerting
Traditional monitoring fires an alert when a metric crosses a static threshold: CPU above 90%, memory above 85%, disk above 80%. This approach generates noise. A CPU spike to 92% during a batch job at 2 AM is normal. The same spike at 2 PM during low traffic is concerning. An AI monitoring agent learns what "normal" looks like for each metric at each time of day and raises alerts only when the pattern breaks.
Metrics Ingestion Pipeline
The agent pulls metrics from Prometheus using PromQL and stores them in a time-series buffer for analysis.
import httpx
import numpy as np
from datetime import datetime, timedelta
from dataclasses import dataclass
@dataclass
class MetricSeries:
name: str
labels: dict
timestamps: list[float]
values: list[float]
class PrometheusClient:
def __init__(self, base_url: str = "http://prometheus:9090"):
self.base_url = base_url
self.client = httpx.AsyncClient(timeout=30)
async def query_range(
self, query: str, hours_back: int = 24, step: str = "5m"
) -> MetricSeries:
end = datetime.utcnow()
start = end - timedelta(hours=hours_back)
resp = await self.client.get(
f"{self.base_url}/api/v1/query_range",
params={
"query": query,
"start": start.isoformat() + "Z",
"end": end.isoformat() + "Z",
"step": step,
},
)
data = resp.json()["data"]["result"][0]
timestamps = [float(v[0]) for v in data["values"]]
values = [float(v[1]) for v in data["values"]]
return MetricSeries(
name=query,
labels=data["metric"],
timestamps=timestamps,
values=values,
)
MONITORED_QUERIES = [
'rate(container_cpu_usage_seconds_total{namespace="production"}[5m])',
'container_memory_usage_bytes{namespace="production"}',
'rate(http_requests_total{namespace="production"}[5m])',
'histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))',
]
Anomaly Detection with Z-Score and Seasonal Decomposition
The agent combines simple statistical methods with time-aware baselines. Z-score catches sudden spikes while seasonal decomposition handles expected daily or weekly patterns.
from scipy import stats
from collections import defaultdict
class AnomalyDetector:
def __init__(self, z_threshold: float = 3.0):
self.z_threshold = z_threshold
self.baselines: dict[str, list[float]] = defaultdict(list)
def detect_zscore_anomalies(self, series: MetricSeries) -> list[dict]:
values = np.array(series.values)
if len(values) < 10:
return []
z_scores = np.abs(stats.zscore(values))
anomalies = []
for i, z in enumerate(z_scores):
if z > self.z_threshold:
anomalies.append({
"timestamp": series.timestamps[i],
"value": series.values[i],
"z_score": float(z),
"method": "zscore",
"metric": series.name,
})
return anomalies
def detect_seasonal_anomalies(
self, series: MetricSeries, period_hours: int = 24
) -> list[dict]:
"""Compare current values against same time-of-day from previous periods."""
values = np.array(series.values)
timestamps = np.array(series.timestamps)
samples_per_period = period_hours * 12 # 5min intervals
if len(values) < samples_per_period * 2:
return []
anomalies = []
for i in range(samples_per_period, len(values)):
historical_idx = range(
i % samples_per_period, i, samples_per_period
)
historical = values[list(historical_idx)]
if len(historical) < 3:
continue
mean = np.mean(historical)
std = np.std(historical)
if std == 0:
continue
deviation = abs(values[i] - mean) / std
if deviation > self.z_threshold:
anomalies.append({
"timestamp": timestamps[i],
"value": float(values[i]),
"expected": float(mean),
"deviation": float(deviation),
"method": "seasonal",
"metric": series.name,
})
return anomalies
Human Approval Gate for Remediation
When the agent detects an anomaly, it proposes a remediation and waits for approval on critical actions.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
import asyncio
from enum import Enum
class ApprovalStatus(Enum):
PENDING = "pending"
APPROVED = "approved"
DENIED = "denied"
AUTO_APPROVED = "auto_approved"
class ApprovalGate:
def __init__(self, slack_webhook: str, auto_approve_severity: str = "low"):
self.slack_webhook = slack_webhook
self.auto_approve_severity = auto_approve_severity
self.pending: dict[str, asyncio.Event] = {}
self.decisions: dict[str, ApprovalStatus] = {}
async def request_approval(
self, action_id: str, description: str, severity: str
) -> ApprovalStatus:
if severity == self.auto_approve_severity:
return ApprovalStatus.AUTO_APPROVED
event = asyncio.Event()
self.pending[action_id] = event
await self._send_slack_message(
f"*Remediation Approval Required*\n"
f"Action: {description}\n"
f"Severity: {severity}\n"
f"Reply with: /approve {action_id} or /deny {action_id}"
)
try:
await asyncio.wait_for(event.wait(), timeout=300)
return self.decisions.get(action_id, ApprovalStatus.DENIED)
except asyncio.TimeoutError:
return ApprovalStatus.DENIED
async def _send_slack_message(self, text: str):
async with httpx.AsyncClient() as client:
await client.post(self.slack_webhook, json={"text": text})
The Monitoring Agent Loop
Tie everything together in a continuous monitoring loop that runs every cycle.
async def monitoring_loop(interval_seconds: int = 300):
prom = PrometheusClient()
detector = AnomalyDetector(z_threshold=3.0)
gate = ApprovalGate(slack_webhook="https://hooks.slack.com/...")
while True:
for query in MONITORED_QUERIES:
series = await prom.query_range(query, hours_back=48)
anomalies = detector.detect_zscore_anomalies(series)
anomalies += detector.detect_seasonal_anomalies(series)
for anomaly in anomalies:
severity = "high" if anomaly.get("z_score", 0) > 5 else "medium"
status = await gate.request_approval(
action_id=f"{query}-{anomaly['timestamp']}",
description=f"Scale up pods for {query}",
severity=severity,
)
if status in (ApprovalStatus.APPROVED, ApprovalStatus.AUTO_APPROVED):
await execute_remediation(query, anomaly)
await asyncio.sleep(interval_seconds)
FAQ
How do I avoid false positives from noisy metrics?
Use a sliding window to require multiple consecutive anomalous data points before triggering. A single spike is noise. Three consecutive 5-minute intervals above the threshold is a real problem. Also tune your z-score threshold per metric since some are naturally more variable than others.
Should the agent train its own ML model or use statistical methods?
Start with statistical methods like z-score and seasonal decomposition. They are interpretable, require no training data, and work well for most infrastructure metrics. Graduate to ML models (isolation forest, LSTM autoencoders) only when you have metrics with complex non-linear patterns that statistical methods miss.
How do I handle the cold-start problem when there is no historical data?
Fall back to static thresholds for the first 48-72 hours of data collection. Once the agent has enough history, automatically switch to anomaly detection. Log a warning when operating in cold-start mode so the team knows alerting quality may be lower.
#InfrastructureMonitoring #AnomalyDetection #DevOps #SRE #Python #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.