Agentic AI Observability: OpenTelemetry, Grafana, and Custom Agent Metrics

Why Traditional Observability Falls Short for Agentic AI

Standard application monitoring tracks HTTP status codes, request latency, and error rates. These metrics tell you when your web server is healthy. They tell you almost nothing about whether your AI agents are performing well.

An agent can return HTTP 200 on every request while producing hallucinated responses, calling the wrong tools, handing off to the wrong specialist, or burning through token budgets 10x faster than expected. Agent-specific observability requires tracking a fundamentally different set of signals: LLM response quality, tool execution success rates, handoff patterns, token consumption, conversation resolution rates, and multi-agent trace propagation.

At CallSphere, our observability stack processes millions of agent telemetry events daily across voice and chat channels. This guide covers the instrumentation patterns, custom metrics, and dashboarding strategies we rely on.

OpenTelemetry Instrumentation for Agent Systems

OpenTelemetry (OTel) provides a vendor-neutral standard for traces, metrics, and logs. For agentic AI, the key is creating meaningful spans that represent agent-level operations rather than just HTTP calls.

Setting Up the OTel SDK

from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.sdk.resources import Resource

resource = Resource.create({
    "service.name": "triage-agent",
    "service.version": "2.4.1",
    "deployment.environment": "production",
    "agent.type": "triage",
})

# Traces
trace_provider = TracerProvider(resource=resource)
trace_provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="otel-collector:4317"))
)
trace.set_tracer_provider(trace_provider)

# Metrics
metric_reader = PeriodicExportingMetricReader(
    OTLPMetricExporter(endpoint="otel-collector:4317"),
    export_interval_millis=15000,
)
meter_provider = MeterProvider(resource=resource, metric_readers=[metric_reader])
metrics.set_meter_provider(meter_provider)

tracer = trace.get_tracer("agentic-ai")
meter = metrics.get_meter("agentic-ai")

Creating Agent-Specific Spans

The critical insight is that each agent operation — receiving a message, thinking, calling a tool, generating a response, handing off — should be a distinct span within a conversation trace.

# Define custom metrics
llm_token_counter = meter.create_counter(
    "agent.llm.tokens",
    description="Total LLM tokens consumed",
    unit="tokens",
)

llm_duration_histogram = meter.create_histogram(
    "agent.llm.duration",
    description="LLM API call duration",
    unit="ms",
)

tool_execution_counter = meter.create_counter(
    "agent.tool.executions",
    description="Tool execution count by tool and status",
)

handoff_counter = meter.create_counter(
    "agent.handoffs",
    description="Agent handoff count",
)

conversation_duration_histogram = meter.create_histogram(
    "agent.conversation.duration",
    description="Total conversation duration",
    unit="seconds",
)

class InstrumentedAgent:
    def __init__(self, agent_name: str):
        self.agent_name = agent_name

    async def process_message(self, conversation_id: str, message: str):
        with tracer.start_as_current_span(
            "agent.process_message",
            attributes={
                "agent.name": self.agent_name,
                "conversation.id": conversation_id,
                "message.length": len(message),
            },
        ) as span:
            # Step 1: Think (LLM call)
            with tracer.start_as_current_span("agent.llm_call") as llm_span:
                start = time.time()
                response = await self.call_llm(message)
                duration_ms = (time.time() - start) * 1000

                llm_span.set_attribute("llm.model", response.model)
                llm_span.set_attribute("llm.input_tokens", response.usage.input_tokens)
                llm_span.set_attribute("llm.output_tokens", response.usage.output_tokens)

                llm_token_counter.add(
                    response.usage.input_tokens + response.usage.output_tokens,
                    {"agent.name": self.agent_name, "model": response.model, "type": "total"},
                )
                llm_duration_histogram.record(
                    duration_ms,
                    {"agent.name": self.agent_name, "model": response.model},
                )

            # Step 2: Execute tools if needed
            if response.tool_calls:
                for tool_call in response.tool_calls:
                    with tracer.start_as_current_span(
                        "agent.tool_execution",
                        attributes={
                            "tool.name": tool_call.name,
                            "tool.input_size": len(str(tool_call.input)),
                        },
                    ) as tool_span:
                        try:
                            result = await self.execute_tool(tool_call)
                            tool_span.set_attribute("tool.status", "success")
                            tool_execution_counter.add(1, {
                                "agent.name": self.agent_name,
                                "tool.name": tool_call.name,
                                "status": "success",
                            })
                        except Exception as e:
                            tool_span.set_attribute("tool.status", "error")
                            tool_span.set_attribute("tool.error", str(e))
                            tool_execution_counter.add(1, {
                                "agent.name": self.agent_name,
                                "tool.name": tool_call.name,
                                "status": "error",
                            })

            # Step 3: Handoff if needed
            if response.handoff_target:
                with tracer.start_as_current_span(
                    "agent.handoff",
                    attributes={
                        "handoff.from": self.agent_name,
                        "handoff.to": response.handoff_target,
                        "handoff.reason": response.handoff_reason,
                    },
                ):
                    handoff_counter.add(1, {
                        "from_agent": self.agent_name,
                        "to_agent": response.handoff_target,
                    })
                    await self.handoff(response.handoff_target, conversation_id)

            return response

Distributed Tracing Across Agents

In a multi-agent system, a single user conversation may traverse the triage agent, a specialist agent, tool executors, and back. The trace must follow the entire journey.

Trace Context Propagation

When one agent hands off to another, propagate the trace context:

from opentelemetry.context import attach, detach
from opentelemetry.trace.propagation import get_current_span
from opentelemetry.propagate import inject, extract

async def handoff_to_agent(target_agent: str, conversation_id: str, context: dict):
    """Hand off conversation with trace context."""
    # Inject current trace context into headers
    headers = {}
    inject(headers)

    message = {
        "conversation_id": conversation_id,
        "context": context,
        "trace_headers": headers,  # Contains traceparent and tracestate
    }

    await nats_client.publish(f"agents.{target_agent}.inbox", json.dumps(message).encode())

async def receive_handoff(message):
    """Receive handoff and continue the trace."""
    data = json.loads(message.data)
    # Extract trace context from the handoff message
    ctx = extract(data.get("trace_headers", {}))
    token = attach(ctx)

    try:
        with tracer.start_as_current_span(
            "agent.handle_handoff",
            attributes={"agent.name": self.agent_name},
        ):
            await self.process_conversation(data["conversation_id"], data["context"])
    finally:
        detach(token)

This produces traces that show the full conversation journey:

Trace: conv_abc123 (total: 12.4s)
  |-- triage-agent.process_message (2.1s)
  |   |-- agent.llm_call (1.8s) [model: haiku, tokens: 450]
  |   |-- agent.handoff (0.1s) [to: billing-agent]
  |
  |-- billing-agent.handle_handoff (8.2s)
  |   |-- agent.llm_call (1.2s) [model: sonnet, tokens: 890]
  |   |-- agent.tool_execution (5.8s) [tool: lookup_invoice]
  |   |-- agent.llm_call (1.1s) [model: sonnet, tokens: 620]
  |
  |-- triage-agent.send_response (0.3s)

Custom Agent Metrics for Grafana Dashboards

Key Metrics to Track

Beyond standard infrastructure metrics, agentic AI systems need these agent-specific metrics:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

Metric	Type	Purpose
agent.llm.tokens	Counter	Token consumption by agent, model, direction
agent.llm.duration	Histogram	LLM API latency distribution
agent.tool.executions	Counter	Tool call count by tool name and status
agent.handoffs	Counter	Handoff frequency between agent pairs
agent.conversation.duration	Histogram	End-to-end conversation time
agent.conversation.turns	Histogram	Number of turns per conversation
agent.conversation.resolution	Counter	Resolved vs escalated vs abandoned
agent.errors	Counter	Agent errors by type (LLM timeout, tool failure, etc.)
agent.active_conversations	UpDownCounter	Current concurrent conversations per agent
agent.cost.usd	Counter	Estimated cost per agent per model

Grafana Dashboard Configuration

Configure Grafana to read from your Prometheus/Mimir backend that receives OTel metrics:

# grafana/dashboards/agent-overview.json (key panels)
panels:
  - title: "Token Consumption (last 1h)"
    type: timeseries
    targets:
      - expr: "sum(rate(agent_llm_tokens_total[5m])) by (agent_name, model)"
        legendFormat: "{{agent_name}} / {{model}}"

  - title: "LLM Latency P95"
    type: stat
    targets:
      - expr: "histogram_quantile(0.95, sum(rate(agent_llm_duration_bucket[5m])) by (le, agent_name))"
        legendFormat: "{{agent_name}}"

  - title: "Tool Error Rate"
    type: gauge
    targets:
      - expr: >
          sum(rate(agent_tool_executions_total{status="error"}[5m])) by (tool_name)
          /
          sum(rate(agent_tool_executions_total[5m])) by (tool_name)
          * 100

  - title: "Handoff Sankey"
    type: nodeGraph
    targets:
      - expr: "sum(increase(agent_handoffs_total[1h])) by (from_agent, to_agent)"

  - title: "Active Conversations"
    type: timeseries
    targets:
      - expr: "sum(agent_active_conversations) by (agent_name)"

  - title: "Estimated Hourly Cost"
    type: stat
    targets:
      - expr: "sum(increase(agent_cost_usd_total[1h])) by (agent_name)"

Cost Estimation Metric

Track estimated cost as a first-class metric:

COST_PER_1K_TOKENS = {
    "claude-3-5-haiku-20241022": {"input": 0.001, "output": 0.005},
    "claude-sonnet-4-20250514": {"input": 0.003, "output": 0.015},
    "claude-opus-4-20250514": {"input": 0.015, "output": 0.075},
}

cost_counter = meter.create_counter("agent.cost.usd", unit="USD", description="Estimated LLM cost")

def record_cost(model: str, input_tokens: int, output_tokens: int, agent_name: str):
    rates = COST_PER_1K_TOKENS.get(model, {"input": 0.01, "output": 0.03})
    cost = (input_tokens / 1000 * rates["input"]) + (output_tokens / 1000 * rates["output"])
    cost_counter.add(cost, {"agent.name": agent_name, "model": model})

Alerting Strategies for Agent Systems

Critical Alerts

Set up PagerDuty or Opsgenie alerts for these conditions:

# Prometheus alerting rules
groups:
  - name: agent-critical
    rules:
      - alert: AgentToolErrorRateHigh
        expr: >
          sum(rate(agent_tool_executions_total{status="error"}[5m])) by (tool_name)
          /
          sum(rate(agent_tool_executions_total[5m])) by (tool_name)
          > 0.1
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "Tool {{ $labels.tool_name }} error rate above 10%"

      - alert: AgentLLMLatencyP99High
        expr: >
          histogram_quantile(0.99, sum(rate(agent_llm_duration_bucket[5m])) by (le))
          > 30000
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "LLM P99 latency exceeds 30 seconds"

      - alert: AgentConversationQueueBacklog
        expr: agent_active_conversations > 100
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "Agent {{ $labels.agent_name }} has 100+ active conversations"

      - alert: AgentTokenBudgetExceeded
        expr: >
          sum(increase(agent_cost_usd_total[1h])) > 500
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Hourly agent cost exceeded $500 budget"

Anomaly Detection Alerts

Beyond threshold-based alerts, use anomaly detection for metrics that vary by time of day:

Token consumption spikes: A prompt injection attack or infinite loop in agent reasoning will cause sudden token spikes.
Handoff loops: If agent A hands to B and B immediately hands back to A, that is a handoff loop that burns tokens without progress.
Conversation duration outliers: A conversation lasting 10x the median suggests an agent is stuck or the user is being bounced between agents.

Log Correlation with Traces

Structure your agent logs so they can be correlated with traces in Grafana:

import logging
from opentelemetry.trace import get_current_span

class AgentLogger:
    def __init__(self, agent_name: str):
        self.logger = logging.getLogger(agent_name)

    def info(self, message: str, **kwargs):
        span = get_current_span()
        context = span.get_span_context() if span else None
        self.logger.info(
            message,
            extra={
                "trace_id": format(context.trace_id, '032x') if context else None,
                "span_id": format(context.span_id, '016x') if context else None,
                "agent_name": self.agent_name,
                **kwargs,
            },
        )

With trace_id in every log line, clicking a trace in Grafana Tempo shows the correlated logs in Loki. This makes debugging agent behavior dramatically faster — you see the trace timeline and the detailed log messages side by side.

OpenTelemetry Collector Configuration

Deploy the OTel Collector as a DaemonSet in Kubernetes to receive telemetry from all agent pods and export to your backends:

apiVersion: v1
kind: ConfigMap
metadata:
  name: otel-collector-config
  namespace: agentic-ai
data:
  config.yaml: |
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
    processors:
      batch:
        timeout: 5s
        send_batch_size: 1024
      memory_limiter:
        check_interval: 1s
        limit_mib: 512
    exporters:
      prometheusremotewrite:
        endpoint: "http://mimir:9009/api/v1/push"
      otlp/tempo:
        endpoint: "tempo:4317"
        tls:
          insecure: true
      loki:
        endpoint: "http://loki:3100/loki/api/v1/push"
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [otlp/tempo]
        metrics:
          receivers: [otlp]
          processors: [memory_limiter, batch]
          exporters: [prometheusremotewrite]

Frequently Asked Questions

How much overhead does OpenTelemetry add to agent performance?

With the batch span processor and reasonable sampling, OTel adds less than 2ms of overhead per request. The bigger consideration is the cardinality of your metrics — avoid using conversation_id or user_id as metric labels since that creates unbounded cardinality. Use those as span attributes in traces instead.

Should I trace every LLM call or sample?

In production, trace 100% of conversations but sample LLM call details. Record the span for every LLM call (for latency tracking) but only attach full prompt and response content to 5-10% of traces (for debugging). Use a tail-based sampler that always captures traces with errors or high latency.

What is the best way to track conversation quality without human review?

Use LLM-as-a-judge: run a separate evaluation model that scores a sample of conversation transcripts on helpfulness, accuracy, and safety. Record these scores as metrics. Set alerts when average scores drop below your baseline. This is not as reliable as human review but scales to millions of conversations.

How do I debug a specific conversation that a user reported as broken?

Search traces by conversation_id in Grafana Tempo. The trace shows every agent that handled the conversation, every LLM call with model and token counts, every tool execution with inputs and outputs, and every handoff with context. Correlated logs in Loki show detailed debug messages. This is faster than reading database records.

How much storage does full agent observability require?

With 1 million conversations per month, expect approximately 50GB for traces (with 10% detail sampling), 5GB for metrics (with proper aggregation), and 20GB for structured logs (with 30-day retention). The storage cost is a fraction of the LLM API cost and pays for itself on the first production incident you debug quickly.