Real-Time Agent Dashboards with Grafana: Visualizing Performance and Health Metrics

Why Grafana for Agent Monitoring

Grafana is the standard for operational dashboards because it connects to virtually any data source, renders time-series data beautifully, and provides a robust alerting engine. For AI agents, you need to visualize metrics that span multiple layers: API latency, token throughput, error rates, conversation volume, and model performance — often from different backends.

A single Grafana dashboard can pull from Prometheus for infrastructure metrics, PostgreSQL for business metrics, and Loki for log-based insights, presenting a unified view of agent health.

Exporting Agent Metrics to Prometheus

The first step is instrumenting your agent code to export metrics in a format Grafana can consume. Prometheus is the most common metrics backend. Use the prometheus-client library to expose counters, histograms, and gauges.

from prometheus_client import (
    Counter, Histogram, Gauge, start_http_server
)

# Define metrics
CONVERSATION_TOTAL = Counter(
    "agent_conversations_total",
    "Total conversations started",
    ["agent_name"],
)

MESSAGE_LATENCY = Histogram(
    "agent_message_latency_seconds",
    "Time to generate agent response",
    ["agent_name", "model"],
    buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0],
)

TOKEN_USAGE = Counter(
    "agent_tokens_total",
    "Total tokens consumed",
    ["agent_name", "model", "token_type"],
)

ACTIVE_CONVERSATIONS = Gauge(
    "agent_active_conversations",
    "Currently active conversations",
    ["agent_name"],
)

ERROR_TOTAL = Counter(
    "agent_errors_total",
    "Total errors encountered",
    ["agent_name", "error_type"],
)

# Start metrics server on port 8090
start_http_server(8090)

Instrumenting the Agent Loop

Wrap your agent's message handling with metric recording. The key is to capture timing, token counts, and outcomes at every step.

import time

class InstrumentedAgent:
    def __init__(self, name: str, model: str = "gpt-4o"):
        self.name = name
        self.model = model

    async def handle_message(
        self, conversation_id: str, user_message: str
    ) -> str:
        ACTIVE_CONVERSATIONS.labels(agent_name=self.name).inc()
        start_time = time.time()
        try:
            response = await self._generate_response(user_message)
            latency = time.time() - start_time
            MESSAGE_LATENCY.labels(
                agent_name=self.name, model=self.model
            ).observe(latency)
            TOKEN_USAGE.labels(
                agent_name=self.name,
                model=self.model,
                token_type="prompt",
            ).inc(response["prompt_tokens"])
            TOKEN_USAGE.labels(
                agent_name=self.name,
                model=self.model,
                token_type="completion",
            ).inc(response["completion_tokens"])
            return response["content"]
        except Exception as exc:
            ERROR_TOTAL.labels(
                agent_name=self.name,
                error_type=type(exc).__name__,
            ).inc()
            raise
        finally:
            ACTIVE_CONVERSATIONS.labels(agent_name=self.name).dec()

Grafana Data Source Configuration

Configure Prometheus as a data source in Grafana. If you also want to query business metrics from PostgreSQL, add it as a second data source.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

# grafana_provisioning.py — generate provisioning YAML
import yaml

datasources = {
    "apiVersion": 1,
    "datasources": [
        {
            "name": "Prometheus",
            "type": "prometheus",
            "url": "http://prometheus:9090",
            "access": "proxy",
            "isDefault": True,
        },
        {
            "name": "PostgreSQL",
            "type": "postgres",
            "url": "postgres-host:5432",
            "database": "agent_analytics",
            "user": "grafana_reader",
            "jsonData": {"sslmode": "require"},
            "secureJsonData": {"password": "${GRAFANA_PG_PASSWORD}"},
        },
    ],
}

with open("/etc/grafana/provisioning/datasources/agents.yaml", "w") as f:
    yaml.dump(datasources, f)

Dashboard Panel Design

An effective agent dashboard has four sections: overview, performance, errors, and cost. Each section contains panels that answer specific operational questions.

# Dashboard JSON model generator
def create_agent_dashboard() -> dict:
    return {
        "dashboard": {
            "title": "AI Agent Operations",
            "panels": [
                {
                    "title": "Conversations per Minute",
                    "type": "timeseries",
                    "targets": [{
                        "expr": "rate(agent_conversations_total[5m]) * 60",
                        "legendFormat": "{{agent_name}}",
                    }],
                    "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
                },
                {
                    "title": "P95 Response Latency",
                    "type": "timeseries",
                    "targets": [{
                        "expr": (
                            "histogram_quantile(0.95, "
                            "rate(agent_message_latency_seconds_bucket[5m]))"
                        ),
                        "legendFormat": "{{agent_name}}",
                    }],
                    "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
                },
                {
                    "title": "Error Rate",
                    "type": "stat",
                    "targets": [{
                        "expr": (
                            "rate(agent_errors_total[5m]) / "
                            "rate(agent_conversations_total[5m]) * 100"
                        ),
                    }],
                    "gridPos": {"h": 4, "w": 6, "x": 0, "y": 8},
                },
                {
                    "title": "Active Conversations",
                    "type": "gauge",
                    "targets": [{
                        "expr": "agent_active_conversations",
                    }],
                    "gridPos": {"h": 4, "w": 6, "x": 6, "y": 8},
                },
            ],
        },
    }

Alert Rules

Dashboards are useless if nobody is looking at them. Alerts bridge the gap by notifying the team when metrics cross critical thresholds.

def create_alert_rules() -> list[dict]:
    return [
        {
            "name": "High Agent Latency",
            "condition": (
                "histogram_quantile(0.95, "
                "rate(agent_message_latency_seconds_bucket[5m])) > 5"
            ),
            "for": "5m",
            "severity": "warning",
            "message": "Agent P95 latency exceeds 5 seconds",
        },
        {
            "name": "Elevated Error Rate",
            "condition": (
                "rate(agent_errors_total[5m]) / "
                "rate(agent_conversations_total[5m]) > 0.05"
            ),
            "for": "3m",
            "severity": "critical",
            "message": "Agent error rate exceeds 5%",
        },
        {
            "name": "Token Budget Exceeded",
            "condition": (
                "increase(agent_tokens_total[1h]) > 1000000"
            ),
            "for": "0m",
            "severity": "warning",
            "message": "Agent consumed over 1M tokens in the past hour",
        },
    ]

FAQ

Should I use Prometheus or push metrics directly to Grafana Cloud?

Prometheus works best if you already run Kubernetes or have infrastructure for scraping. For simpler setups, Grafana Cloud with the OpenTelemetry Collector lets you push metrics directly without managing Prometheus. The dashboards and PromQL queries work the same either way.

How long should I retain high-resolution metrics?

Keep 15-second resolution data for 7 days, 1-minute aggregations for 30 days, and 5-minute aggregations for 1 year. This balances storage costs with the ability to investigate recent incidents in detail and spot long-term trends. Configure Prometheus retention rules or use Thanos for long-term storage.

What is the most important single panel for an agent dashboard?

The error rate panel. Token usage and latency are important for optimization, but errors directly impact user experience. A spike in errors means users are getting failed responses. Display error rate as a percentage with a threshold line at your SLA target (typically 1-2%) and configure an alert when it exceeds that threshold for more than 3 minutes.

#Grafana #Monitoring #Dashboards #Observability #AIAgents #AgenticAI #LearnAI #AIEngineering

Real-Time Agent Dashboards with Grafana: Visualizing Performance and Health Metrics

Why Grafana for Agent Monitoring

Exporting Agent Metrics to Prometheus

Instrumenting the Agent Loop

Grafana Data Source Configuration

Dashboard Panel Design

Alert Rules

FAQ

Should I use Prometheus or push metrics directly to Grafana Cloud?

How long should I retain high-resolution metrics?

What is the most important single panel for an agent dashboard?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding