Skip to content
Learn Agentic AI
Learn Agentic AI15 min read0 views

Agent Monitoring with Prometheus and Grafana: Building AI-Specific Dashboards

Build production monitoring dashboards for AI agents tracking response latency, tool call success rates, token usage, cost per interaction, and SLA compliance.

Why Standard APM Is Not Enough for AI Agents

Your existing Prometheus and Grafana setup tracks HTTP request latency, error rates, CPU usage, and memory consumption. These metrics tell you whether your server is healthy but tell you nothing about whether your agent is performing well. An agent can return HTTP 200 with a perfectly formatted JSON response that contains completely wrong information. Standard application performance monitoring (APM) is blind to this failure mode.

Agent monitoring requires a new category of metrics that capture the AI-specific dimensions of system health: model inference time (separate from total latency), tool call success and failure rates, token consumption and cost, response quality scores, and conversation-level metrics like resolution rate and escalation rate.

This guide walks through instrumenting an AI agent application with Prometheus metrics and building Grafana dashboards that give you real-time visibility into agent behavior.

Instrumenting Your Agent with Prometheus Metrics

The first step is defining the metrics your agent will emit. Prometheus supports four metric types: counters (monotonically increasing), gauges (can go up and down), histograms (distribution of values), and summaries. Agent monitoring uses all four.

# agent_metrics.py — Prometheus metric definitions for AI agents
from prometheus_client import Counter, Histogram, Gauge, Info

# ── Request-level metrics ──
AGENT_REQUESTS_TOTAL = Counter(
    "agent_requests_total",
    "Total number of agent requests",
    ["agent_name", "status"],  # status: success, error, timeout
)

AGENT_REQUEST_DURATION = Histogram(
    "agent_request_duration_seconds",
    "Total time to process an agent request (including all tool calls)",
    ["agent_name"],
    buckets=[0.5, 1.0, 2.0, 3.0, 5.0, 8.0, 13.0, 21.0, 30.0, 60.0],
)

# ── Model inference metrics ──
MODEL_INFERENCE_DURATION = Histogram(
    "model_inference_duration_seconds",
    "Time spent on LLM inference calls (excludes tool execution)",
    ["agent_name", "model_id"],
    buckets=[0.2, 0.5, 1.0, 2.0, 3.0, 5.0, 10.0],
)

MODEL_INFERENCE_CALLS = Counter(
    "model_inference_calls_total",
    "Total number of LLM inference calls per request",
    ["agent_name", "model_id"],
)

# ── Token metrics ──
TOKEN_USAGE = Counter(
    "agent_token_usage_total",
    "Total tokens consumed",
    ["agent_name", "model_id", "token_type"],  # token_type: input, output
)

ESTIMATED_COST = Counter(
    "agent_estimated_cost_dollars",
    "Estimated cost of LLM usage in dollars",
    ["agent_name", "model_id"],
)

# ── Tool call metrics ──
TOOL_CALLS_TOTAL = Counter(
    "agent_tool_calls_total",
    "Total number of tool calls",
    ["agent_name", "tool_name", "status"],  # status: success, error, timeout
)

TOOL_CALL_DURATION = Histogram(
    "agent_tool_call_duration_seconds",
    "Duration of individual tool calls",
    ["agent_name", "tool_name"],
    buckets=[0.05, 0.1, 0.25, 0.5, 1.0, 2.0, 5.0, 10.0],
)

# ── Quality metrics (updated by async evaluation jobs) ──
AGENT_QUALITY_SCORE = Gauge(
    "agent_quality_score",
    "Rolling average quality score from evaluation sampling",
    ["agent_name", "metric_type"],  # metric_type: groundedness, relevance, safety
)

# ── Conversation metrics ──
CONVERSATION_TURNS = Histogram(
    "agent_conversation_turns",
    "Number of turns per conversation",
    ["agent_name"],
    buckets=[1, 2, 3, 5, 8, 13, 20],
)

ESCALATION_RATE = Gauge(
    "agent_escalation_rate",
    "Percentage of conversations escalated to humans (rolling 1h window)",
    ["agent_name"],
)

Wrapping Agent Execution with Metrics Collection

With metrics defined, instrument the agent's execution path. The key is to measure each phase independently: total request time, model inference time, and tool execution time. This lets you diagnose whether slowdowns come from the model, the tools, or the orchestration logic.

# agent_instrumented.py — Agent wrapper with Prometheus instrumentation
import time
from contextlib import asynccontextmanager
from agent_metrics import (
    AGENT_REQUESTS_TOTAL, AGENT_REQUEST_DURATION,
    MODEL_INFERENCE_DURATION, MODEL_INFERENCE_CALLS,
    TOKEN_USAGE, ESTIMATED_COST,
    TOOL_CALLS_TOTAL, TOOL_CALL_DURATION,
)

# Cost per token (example rates, adjust per model)
COST_PER_TOKEN = {
    "gemini-2.0-flash": {"input": 0.00000015, "output": 0.0000006},
    "gemini-2.0-pro": {"input": 0.00000125, "output": 0.000005},
    "gpt-4o": {"input": 0.0000025, "output": 0.00001},
}


@asynccontextmanager
async def track_model_call(agent_name: str, model_id: str):
    """Context manager to track model inference duration and token usage."""
    MODEL_INFERENCE_CALLS.labels(agent_name=agent_name, model_id=model_id).inc()
    start = time.perf_counter()
    result_holder = {"response": None}
    yield result_holder
    duration = time.perf_counter() - start
    MODEL_INFERENCE_DURATION.labels(
        agent_name=agent_name, model_id=model_id
    ).observe(duration)

    # Record token usage if available
    response = result_holder.get("response")
    if response and hasattr(response, "usage"):
        input_tokens = response.usage.input_tokens
        output_tokens = response.usage.output_tokens
        TOKEN_USAGE.labels(
            agent_name=agent_name, model_id=model_id, token_type="input"
        ).inc(input_tokens)
        TOKEN_USAGE.labels(
            agent_name=agent_name, model_id=model_id, token_type="output"
        ).inc(output_tokens)

        # Estimate cost
        rates = COST_PER_TOKEN.get(model_id, {"input": 0, "output": 0})
        cost = input_tokens * rates["input"] + output_tokens * rates["output"]
        ESTIMATED_COST.labels(agent_name=agent_name, model_id=model_id).inc(cost)


async def execute_tool_with_metrics(
    agent_name: str, tool_name: str, tool_fn, arguments: dict
):
    """Execute a tool function and record metrics."""
    start = time.perf_counter()
    try:
        result = await tool_fn(**arguments)
        TOOL_CALLS_TOTAL.labels(
            agent_name=agent_name, tool_name=tool_name, status="success"
        ).inc()
        return result
    except TimeoutError:
        TOOL_CALLS_TOTAL.labels(
            agent_name=agent_name, tool_name=tool_name, status="timeout"
        ).inc()
        raise
    except Exception:
        TOOL_CALLS_TOTAL.labels(
            agent_name=agent_name, tool_name=tool_name, status="error"
        ).inc()
        raise
    finally:
        duration = time.perf_counter() - start
        TOOL_CALL_DURATION.labels(
            agent_name=agent_name, tool_name=tool_name
        ).observe(duration)


async def run_agent_with_metrics(agent, agent_name: str, user_input: str) -> str:
    """Full agent execution with comprehensive metrics."""
    start = time.perf_counter()
    status = "success"

    try:
        response = await agent.run(user_input)
        return response.text
    except Exception as e:
        status = "error"
        raise
    finally:
        duration = time.perf_counter() - start
        AGENT_REQUESTS_TOTAL.labels(agent_name=agent_name, status=status).inc()
        AGENT_REQUEST_DURATION.labels(agent_name=agent_name).observe(duration)

Prometheus Configuration for Agent Scraping

Configure Prometheus to scrape agent metrics. If your agent runs as a FastAPI application, the prometheus_client library's built-in HTTP server or a Starlette middleware handles exposition.

# prometheus.yml — Agent scrape configuration
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: "ai-agents"
    metrics_path: /metrics
    static_configs:
      - targets:
          - "agent-service:8000"    # Main agent application
        labels:
          environment: "production"
          team: "ai-platform"

    # Relabel to extract agent name from metrics
    metric_relabel_configs:
      - source_labels: [__name__]
        regex: "agent_.*"
        action: keep

  - job_name: "agent-canary"
    metrics_path: /metrics
    static_configs:
      - targets:
          - "agent-canary:8000"
        labels:
          environment: "canary"
          team: "ai-platform"

Building the Grafana Dashboard

The Grafana dashboard for AI agents should have four sections: overview, model performance, tool performance, and cost tracking. Each section answers different operational questions.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Overview panel shows request volume, error rate, and P50/P95/P99 latency. These are the first panels you check during an incident.

Model performance shows inference latency by model, token usage trends, and inference call count per request (which reveals how many LLM round-trips the agent needs).

Tool performance shows per-tool success rates, latency distributions, and call volume. When a tool's error rate spikes, you know exactly which integration broke.

Cost tracking shows estimated cost per hour, per day, and per interaction. This is critical for budget management and for detecting cost anomalies (like a prompt change that quadruples token usage).

{
  "dashboard": {
    "title": "AI Agent Operations",
    "panels": [
      {
        "title": "Request Rate (per second)",
        "type": "timeseries",
        "targets": [
          {
            "expr": "sum(rate(agent_requests_total[5m])) by (agent_name, status)",
            "legendFormat": "{{agent_name}} - {{status}}"
          }
        ]
      },
      {
        "title": "Request Latency (P50 / P95 / P99)",
        "type": "timeseries",
        "targets": [
          {
            "expr": "histogram_quantile(0.50, sum(rate(agent_request_duration_seconds_bucket[5m])) by (le, agent_name))",
            "legendFormat": "{{agent_name}} P50"
          },
          {
            "expr": "histogram_quantile(0.95, sum(rate(agent_request_duration_seconds_bucket[5m])) by (le, agent_name))",
            "legendFormat": "{{agent_name}} P95"
          },
          {
            "expr": "histogram_quantile(0.99, sum(rate(agent_request_duration_seconds_bucket[5m])) by (le, agent_name))",
            "legendFormat": "{{agent_name}} P99"
          }
        ]
      },
      {
        "title": "Tool Call Success Rate",
        "type": "timeseries",
        "targets": [
          {
            "expr": "sum(rate(agent_tool_calls_total{status='success'}[5m])) by (tool_name) / sum(rate(agent_tool_calls_total[5m])) by (tool_name) * 100",
            "legendFormat": "{{tool_name}}"
          }
        ],
        "fieldConfig": {
          "defaults": { "unit": "percent", "min": 0, "max": 100 }
        }
      },
      {
        "title": "Estimated Cost ($/hour)",
        "type": "stat",
        "targets": [
          {
            "expr": "sum(rate(agent_estimated_cost_dollars[1h])) * 3600",
            "legendFormat": "Cost/Hour"
          }
        ],
        "fieldConfig": {
          "defaults": { "unit": "currencyUSD" }
        }
      },
      {
        "title": "Token Usage by Model",
        "type": "timeseries",
        "targets": [
          {
            "expr": "sum(rate(agent_token_usage_total[5m])) by (model_id, token_type) * 60",
            "legendFormat": "{{model_id}} {{token_type}}"
          }
        ]
      },
      {
        "title": "Agent Quality Score (Rolling)",
        "type": "gauge",
        "targets": [
          {
            "expr": "agent_quality_score{metric_type='groundedness'}",
            "legendFormat": "Groundedness"
          },
          {
            "expr": "agent_quality_score{metric_type='relevance'}",
            "legendFormat": "Relevance"
          }
        ],
        "fieldConfig": {
          "defaults": { "min": 0, "max": 1, "thresholds": {
            "steps": [
              { "value": 0, "color": "red" },
              { "value": 0.7, "color": "yellow" },
              { "value": 0.85, "color": "green" }
            ]
          }}
        }
      }
    ]
  }
}

Alerting Rules for Agent-Specific Failures

Standard alerts (high error rate, high latency) apply to agents. But agents also need quality-specific alerts that fire when the agent is technically healthy but producing poor results.

# prometheus-alert-rules.yml
groups:
  - name: ai-agent-alerts
    rules:
      - alert: AgentHighErrorRate
        expr: |
          sum(rate(agent_requests_total{status="error"}[5m])) by (agent_name)
          / sum(rate(agent_requests_total[5m])) by (agent_name) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Agent {{ $labels.agent_name }} error rate above 5%"

      - alert: AgentHighLatency
        expr: |
          histogram_quantile(0.95,
            sum(rate(agent_request_duration_seconds_bucket[5m])) by (le, agent_name)
          ) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Agent {{ $labels.agent_name }} P95 latency above 10s"

      - alert: ToolCallFailureSpike
        expr: |
          sum(rate(agent_tool_calls_total{status="error"}[5m])) by (tool_name)
          / sum(rate(agent_tool_calls_total[5m])) by (tool_name) > 0.1
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "Tool {{ $labels.tool_name }} failure rate above 10%"

      - alert: AgentQualityDegradation
        expr: agent_quality_score{metric_type="groundedness"} < 0.70
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Agent {{ $labels.agent_name }} groundedness score dropped below 0.70"

      - alert: AgentCostAnomaly
        expr: |
          sum(rate(agent_estimated_cost_dollars[1h])) * 3600
          > 2 * sum(rate(agent_estimated_cost_dollars[1h] offset 1d)) * 3600
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Agent cost per hour is 2x higher than same time yesterday"

FAQ

How do you measure agent quality in real time without slowing down responses?

Use asynchronous evaluation sampling. For every Nth request (e.g., 1 in 20), send the agent's input and output to a background evaluation job that runs an LLM-as-judge assessment. Update the quality_score gauge metric with the rolling average. This adds zero latency to the user-facing request and provides near-real-time quality visibility.

Keep high-resolution (15-second) metrics for 7 days, downsample to 1-minute resolution for 30 days, and 5-minute resolution for 90 days. Token usage and cost counters should be retained longer (180+ days) for budgeting and trend analysis. Use Prometheus's remote_write with a long-term storage backend like Thanos or Cortex for extended retention.

How do you handle multi-model agents in the dashboard?

Use the model_id label on all model-specific metrics. The Grafana dashboard should include a model_id variable selector so operators can filter to a specific model or view all models side by side. For model cascading setups, add a panel that shows the distribution of requests across models to verify the routing logic is working as intended.

Can this monitoring setup detect prompt injection attacks?

Not directly, but it provides indirect signals. Prompt injection attempts often cause unusual tool-call patterns (calling tools the agent normally does not use), higher token usage (injected prompts are longer), and lower quality scores (the agent's response deviates from its normal behavior). Set up alerts on these anomalies and investigate when they co-occur.

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.