Agentic AI Observability: OpenTelemetry, Grafana, and Custom Agent Metrics
Build a full observability stack for agentic AI with OpenTelemetry tracing, Grafana dashboards, custom agent metrics, and alerting strategies.
Why Traditional Observability Falls Short for Agentic AI
Standard application monitoring tracks HTTP status codes, request latency, and error rates. These metrics tell you when your web server is healthy. They tell you almost nothing about whether your AI agents are performing well.
An agent can return HTTP 200 on every request while producing hallucinated responses, calling the wrong tools, handing off to the wrong specialist, or burning through token budgets 10x faster than expected. Agent-specific observability requires tracking a fundamentally different set of signals: LLM response quality, tool execution success rates, handoff patterns, token consumption, conversation resolution rates, and multi-agent trace propagation.
At CallSphere, our observability stack processes millions of agent telemetry events daily across voice and chat channels. This guide covers the instrumentation patterns, custom metrics, and dashboarding strategies we rely on.
OpenTelemetry Instrumentation for Agent Systems
OpenTelemetry (OTel) provides a vendor-neutral standard for traces, metrics, and logs. For agentic AI, the key is creating meaningful spans that represent agent-level operations rather than just HTTP calls.
Setting Up the OTel SDK
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.sdk.resources import Resource
resource = Resource.create({
"service.name": "triage-agent",
"service.version": "2.4.1",
"deployment.environment": "production",
"agent.type": "triage",
})
# Traces
trace_provider = TracerProvider(resource=resource)
trace_provider.add_span_processor(
BatchSpanProcessor(OTLPSpanExporter(endpoint="otel-collector:4317"))
)
trace.set_tracer_provider(trace_provider)
# Metrics
metric_reader = PeriodicExportingMetricReader(
OTLPMetricExporter(endpoint="otel-collector:4317"),
export_interval_millis=15000,
)
meter_provider = MeterProvider(resource=resource, metric_readers=[metric_reader])
metrics.set_meter_provider(meter_provider)
tracer = trace.get_tracer("agentic-ai")
meter = metrics.get_meter("agentic-ai")
Creating Agent-Specific Spans
The critical insight is that each agent operation — receiving a message, thinking, calling a tool, generating a response, handing off — should be a distinct span within a conversation trace.
# Define custom metrics
llm_token_counter = meter.create_counter(
"agent.llm.tokens",
description="Total LLM tokens consumed",
unit="tokens",
)
llm_duration_histogram = meter.create_histogram(
"agent.llm.duration",
description="LLM API call duration",
unit="ms",
)
tool_execution_counter = meter.create_counter(
"agent.tool.executions",
description="Tool execution count by tool and status",
)
handoff_counter = meter.create_counter(
"agent.handoffs",
description="Agent handoff count",
)
conversation_duration_histogram = meter.create_histogram(
"agent.conversation.duration",
description="Total conversation duration",
unit="seconds",
)
class InstrumentedAgent:
def __init__(self, agent_name: str):
self.agent_name = agent_name
async def process_message(self, conversation_id: str, message: str):
with tracer.start_as_current_span(
"agent.process_message",
attributes={
"agent.name": self.agent_name,
"conversation.id": conversation_id,
"message.length": len(message),
},
) as span:
# Step 1: Think (LLM call)
with tracer.start_as_current_span("agent.llm_call") as llm_span:
start = time.time()
response = await self.call_llm(message)
duration_ms = (time.time() - start) * 1000
llm_span.set_attribute("llm.model", response.model)
llm_span.set_attribute("llm.input_tokens", response.usage.input_tokens)
llm_span.set_attribute("llm.output_tokens", response.usage.output_tokens)
llm_token_counter.add(
response.usage.input_tokens + response.usage.output_tokens,
{"agent.name": self.agent_name, "model": response.model, "type": "total"},
)
llm_duration_histogram.record(
duration_ms,
{"agent.name": self.agent_name, "model": response.model},
)
# Step 2: Execute tools if needed
if response.tool_calls:
for tool_call in response.tool_calls:
with tracer.start_as_current_span(
"agent.tool_execution",
attributes={
"tool.name": tool_call.name,
"tool.input_size": len(str(tool_call.input)),
},
) as tool_span:
try:
result = await self.execute_tool(tool_call)
tool_span.set_attribute("tool.status", "success")
tool_execution_counter.add(1, {
"agent.name": self.agent_name,
"tool.name": tool_call.name,
"status": "success",
})
except Exception as e:
tool_span.set_attribute("tool.status", "error")
tool_span.set_attribute("tool.error", str(e))
tool_execution_counter.add(1, {
"agent.name": self.agent_name,
"tool.name": tool_call.name,
"status": "error",
})
# Step 3: Handoff if needed
if response.handoff_target:
with tracer.start_as_current_span(
"agent.handoff",
attributes={
"handoff.from": self.agent_name,
"handoff.to": response.handoff_target,
"handoff.reason": response.handoff_reason,
},
):
handoff_counter.add(1, {
"from_agent": self.agent_name,
"to_agent": response.handoff_target,
})
await self.handoff(response.handoff_target, conversation_id)
return response
Distributed Tracing Across Agents
In a multi-agent system, a single user conversation may traverse the triage agent, a specialist agent, tool executors, and back. The trace must follow the entire journey.
Trace Context Propagation
When one agent hands off to another, propagate the trace context:
from opentelemetry.context import attach, detach
from opentelemetry.trace.propagation import get_current_span
from opentelemetry.propagate import inject, extract
async def handoff_to_agent(target_agent: str, conversation_id: str, context: dict):
"""Hand off conversation with trace context."""
# Inject current trace context into headers
headers = {}
inject(headers)
message = {
"conversation_id": conversation_id,
"context": context,
"trace_headers": headers, # Contains traceparent and tracestate
}
await nats_client.publish(f"agents.{target_agent}.inbox", json.dumps(message).encode())
async def receive_handoff(message):
"""Receive handoff and continue the trace."""
data = json.loads(message.data)
# Extract trace context from the handoff message
ctx = extract(data.get("trace_headers", {}))
token = attach(ctx)
try:
with tracer.start_as_current_span(
"agent.handle_handoff",
attributes={"agent.name": self.agent_name},
):
await self.process_conversation(data["conversation_id"], data["context"])
finally:
detach(token)
This produces traces that show the full conversation journey:
Trace: conv_abc123 (total: 12.4s)
|-- triage-agent.process_message (2.1s)
| |-- agent.llm_call (1.8s) [model: haiku, tokens: 450]
| |-- agent.handoff (0.1s) [to: billing-agent]
|
|-- billing-agent.handle_handoff (8.2s)
| |-- agent.llm_call (1.2s) [model: sonnet, tokens: 890]
| |-- agent.tool_execution (5.8s) [tool: lookup_invoice]
| |-- agent.llm_call (1.1s) [model: sonnet, tokens: 620]
|
|-- triage-agent.send_response (0.3s)
Custom Agent Metrics for Grafana Dashboards
Key Metrics to Track
Beyond standard infrastructure metrics, agentic AI systems need these agent-specific metrics:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
| Metric | Type | Purpose |
|---|---|---|
| agent.llm.tokens | Counter | Token consumption by agent, model, direction |
| agent.llm.duration | Histogram | LLM API latency distribution |
| agent.tool.executions | Counter | Tool call count by tool name and status |
| agent.handoffs | Counter | Handoff frequency between agent pairs |
| agent.conversation.duration | Histogram | End-to-end conversation time |
| agent.conversation.turns | Histogram | Number of turns per conversation |
| agent.conversation.resolution | Counter | Resolved vs escalated vs abandoned |
| agent.errors | Counter | Agent errors by type (LLM timeout, tool failure, etc.) |
| agent.active_conversations | UpDownCounter | Current concurrent conversations per agent |
| agent.cost.usd | Counter | Estimated cost per agent per model |
Grafana Dashboard Configuration
Configure Grafana to read from your Prometheus/Mimir backend that receives OTel metrics:
# grafana/dashboards/agent-overview.json (key panels)
panels:
- title: "Token Consumption (last 1h)"
type: timeseries
targets:
- expr: "sum(rate(agent_llm_tokens_total[5m])) by (agent_name, model)"
legendFormat: "{{agent_name}} / {{model}}"
- title: "LLM Latency P95"
type: stat
targets:
- expr: "histogram_quantile(0.95, sum(rate(agent_llm_duration_bucket[5m])) by (le, agent_name))"
legendFormat: "{{agent_name}}"
- title: "Tool Error Rate"
type: gauge
targets:
- expr: >
sum(rate(agent_tool_executions_total{status="error"}[5m])) by (tool_name)
/
sum(rate(agent_tool_executions_total[5m])) by (tool_name)
* 100
- title: "Handoff Sankey"
type: nodeGraph
targets:
- expr: "sum(increase(agent_handoffs_total[1h])) by (from_agent, to_agent)"
- title: "Active Conversations"
type: timeseries
targets:
- expr: "sum(agent_active_conversations) by (agent_name)"
- title: "Estimated Hourly Cost"
type: stat
targets:
- expr: "sum(increase(agent_cost_usd_total[1h])) by (agent_name)"
Cost Estimation Metric
Track estimated cost as a first-class metric:
COST_PER_1K_TOKENS = {
"claude-3-5-haiku-20241022": {"input": 0.001, "output": 0.005},
"claude-sonnet-4-20250514": {"input": 0.003, "output": 0.015},
"claude-opus-4-20250514": {"input": 0.015, "output": 0.075},
}
cost_counter = meter.create_counter("agent.cost.usd", unit="USD", description="Estimated LLM cost")
def record_cost(model: str, input_tokens: int, output_tokens: int, agent_name: str):
rates = COST_PER_1K_TOKENS.get(model, {"input": 0.01, "output": 0.03})
cost = (input_tokens / 1000 * rates["input"]) + (output_tokens / 1000 * rates["output"])
cost_counter.add(cost, {"agent.name": agent_name, "model": model})
Alerting Strategies for Agent Systems
Critical Alerts
Set up PagerDuty or Opsgenie alerts for these conditions:
# Prometheus alerting rules
groups:
- name: agent-critical
rules:
- alert: AgentToolErrorRateHigh
expr: >
sum(rate(agent_tool_executions_total{status="error"}[5m])) by (tool_name)
/
sum(rate(agent_tool_executions_total[5m])) by (tool_name)
> 0.1
for: 3m
labels:
severity: critical
annotations:
summary: "Tool {{ $labels.tool_name }} error rate above 10%"
- alert: AgentLLMLatencyP99High
expr: >
histogram_quantile(0.99, sum(rate(agent_llm_duration_bucket[5m])) by (le))
> 30000
for: 5m
labels:
severity: critical
annotations:
summary: "LLM P99 latency exceeds 30 seconds"
- alert: AgentConversationQueueBacklog
expr: agent_active_conversations > 100
for: 2m
labels:
severity: warning
annotations:
summary: "Agent {{ $labels.agent_name }} has 100+ active conversations"
- alert: AgentTokenBudgetExceeded
expr: >
sum(increase(agent_cost_usd_total[1h])) > 500
for: 1m
labels:
severity: critical
annotations:
summary: "Hourly agent cost exceeded $500 budget"
Anomaly Detection Alerts
Beyond threshold-based alerts, use anomaly detection for metrics that vary by time of day:
- Token consumption spikes: A prompt injection attack or infinite loop in agent reasoning will cause sudden token spikes.
- Handoff loops: If agent A hands to B and B immediately hands back to A, that is a handoff loop that burns tokens without progress.
- Conversation duration outliers: A conversation lasting 10x the median suggests an agent is stuck or the user is being bounced between agents.
Log Correlation with Traces
Structure your agent logs so they can be correlated with traces in Grafana:
import logging
from opentelemetry.trace import get_current_span
class AgentLogger:
def __init__(self, agent_name: str):
self.logger = logging.getLogger(agent_name)
def info(self, message: str, **kwargs):
span = get_current_span()
context = span.get_span_context() if span else None
self.logger.info(
message,
extra={
"trace_id": format(context.trace_id, '032x') if context else None,
"span_id": format(context.span_id, '016x') if context else None,
"agent_name": self.agent_name,
**kwargs,
},
)
With trace_id in every log line, clicking a trace in Grafana Tempo shows the correlated logs in Loki. This makes debugging agent behavior dramatically faster — you see the trace timeline and the detailed log messages side by side.
OpenTelemetry Collector Configuration
Deploy the OTel Collector as a DaemonSet in Kubernetes to receive telemetry from all agent pods and export to your backends:
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
namespace: agentic-ai
data:
config.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
processors:
batch:
timeout: 5s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 512
exporters:
prometheusremotewrite:
endpoint: "http://mimir:9009/api/v1/push"
otlp/tempo:
endpoint: "tempo:4317"
tls:
insecure: true
loki:
endpoint: "http://loki:3100/loki/api/v1/push"
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp/tempo]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheusremotewrite]
Frequently Asked Questions
How much overhead does OpenTelemetry add to agent performance?
With the batch span processor and reasonable sampling, OTel adds less than 2ms of overhead per request. The bigger consideration is the cardinality of your metrics — avoid using conversation_id or user_id as metric labels since that creates unbounded cardinality. Use those as span attributes in traces instead.
Should I trace every LLM call or sample?
In production, trace 100% of conversations but sample LLM call details. Record the span for every LLM call (for latency tracking) but only attach full prompt and response content to 5-10% of traces (for debugging). Use a tail-based sampler that always captures traces with errors or high latency.
What is the best way to track conversation quality without human review?
Use LLM-as-a-judge: run a separate evaluation model that scores a sample of conversation transcripts on helpfulness, accuracy, and safety. Record these scores as metrics. Set alerts when average scores drop below your baseline. This is not as reliable as human review but scales to millions of conversations.
How do I debug a specific conversation that a user reported as broken?
Search traces by conversation_id in Grafana Tempo. The trace shows every agent that handled the conversation, every LLM call with model and token counts, every tool execution with inputs and outputs, and every handoff with context. Correlated logs in Loki show detailed debug messages. This is faster than reading database records.
How much storage does full agent observability require?
With 1 million conversations per month, expect approximately 50GB for traces (with 10% detail sampling), 5GB for metrics (with proper aggregation), and 20GB for structured logs (with 30-day retention). The storage cost is a fraction of the LLM API cost and pays for itself on the first production incident you debug quickly.
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.