AI Agent Observability: Tracing, Logging, and Monitoring with OpenTelemetry

Why Agent Observability Is Different from Traditional APM

Traditional application performance monitoring tracks HTTP requests through a call stack: request arrives, hits middleware, queries the database, returns a response. The flow is deterministic and the duration is measured in milliseconds.

AI agent execution is fundamentally different. An agent receives a prompt, reasons about it (often in multiple loops), calls tools, evaluates results, may call more tools, and eventually produces an output. The execution path is non-deterministic — the same input may produce different tool call sequences. Duration ranges from 500ms for a simple lookup to 3 minutes for a multi-step research task. And the most expensive resource is not CPU or memory but LLM API tokens.

Standard APM tools will tell you "this endpoint took 4.2 seconds." Agent observability must tell you: "This agent made 3 LLM calls, invoked 2 tools, consumed 12,400 tokens costing $0.037, and the second tool call failed with a timeout before the agent self-corrected."

Setting Up OpenTelemetry for AI Agents

OpenTelemetry (OTel) is the industry-standard observability framework. It provides three signals — traces, metrics, and logs — with vendor-neutral instrumentation that exports to any backend (Jaeger, Grafana Tempo, Datadog, Honeycomb).

from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import (
    PeriodicExportingMetricReader,
)
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import (
    OTLPSpanExporter,
)
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import (
    OTLPMetricExporter,
)

def setup_observability(service_name: str = "ai-agent-service"):
    # Traces
    trace_provider = TracerProvider()
    trace_provider.add_span_processor(
        BatchSpanProcessor(OTLPSpanExporter())
    )
    trace.set_tracer_provider(trace_provider)

    # Metrics
    metric_reader = PeriodicExportingMetricReader(
        OTLPMetricExporter(), export_interval_millis=10_000
    )
    meter_provider = MeterProvider(metric_readers=[metric_reader])
    metrics.set_meter_provider(meter_provider)

    return (
        trace.get_tracer(service_name),
        metrics.get_meter(service_name),
    )

tracer, meter = setup_observability()

Distributed Tracing Across Agent Calls

The core of agent observability is the trace. Each user request creates a root span, and every significant operation within the agent creates a child span. This produces a trace tree that shows exactly what happened, in what order, and how long each step took.

from opentelemetry import trace
from opentelemetry.trace import StatusCode
from functools import wraps
import time

tracer = trace.get_tracer("agent-service")

class TracedAgent:
    def __init__(self, name: str, model: str):
        self.name = name
        self.model = model

    async def run(self, user_message: str) -> str:
        with tracer.start_as_current_span(
            "agent.run",
            attributes={
                "agent.name": self.name,
                "agent.model": self.model,
                "input.length": len(user_message),
            },
        ) as span:
            try:
                # Step 1: LLM reasoning
                response = await self._call_llm(user_message)

                # Step 2: Tool calls (if any)
                tool_results = []
                for tool_call in response.get("tool_calls", []):
                    result = await self._execute_tool(tool_call)
                    tool_results.append(result)

                # Step 3: Final response
                if tool_results:
                    final = await self._call_llm_with_results(
                        user_message, tool_results
                    )
                else:
                    final = response["content"]

                span.set_attribute("output.length", len(final))
                span.set_status(StatusCode.OK)
                return final

            except Exception as e:
                span.set_status(StatusCode.ERROR, str(e))
                span.record_exception(e)
                raise

    async def _call_llm(self, prompt: str) -> dict:
        with tracer.start_as_current_span(
            "llm.call",
            attributes={
                "llm.model": self.model,
                "llm.prompt_tokens": len(prompt) // 4,
            },
        ) as span:
            start = time.time()
            # Actual LLM call here
            result = {"content": "response", "tool_calls": []}
            duration = time.time() - start

            span.set_attribute("llm.duration_seconds", duration)
            span.set_attribute(
                "llm.completion_tokens",
                len(result["content"]) // 4,
            )
            span.set_attribute(
                "llm.total_tokens",
                len(prompt) // 4 + len(result["content"]) // 4,
            )
            return result

    async def _execute_tool(self, tool_call: dict) -> dict:
        with tracer.start_as_current_span(
            "tool.execute",
            attributes={
                "tool.name": tool_call["name"],
                "tool.input_size": len(str(tool_call.get("args", {}))),
            },
        ) as span:
            try:
                result = await self._run_tool(
                    tool_call["name"], tool_call.get("args", {})
                )
                span.set_attribute("tool.success", True)
                span.set_attribute(
                    "tool.output_size", len(str(result))
                )
                return result
            except Exception as e:
                span.set_attribute("tool.success", False)
                span.set_attribute("tool.error", str(e))
                span.set_status(StatusCode.ERROR, str(e))
                raise

    async def _run_tool(self, name: str, args: dict) -> dict:
        return {"result": f"Tool {name} executed"}

    async def _call_llm_with_results(self, prompt: str,
                                      results: list) -> str:
        return "Final response with tool results"

Each span in the trace carries structured attributes: the agent name, model used, token counts, tool names, success/failure status, and timing. When you view this trace in Jaeger or Grafana Tempo, you see the entire agent execution as a tree with timing bars for each operation.

Structured Logging for Agents

Logs complement traces by capturing detailed context that does not fit in span attributes. Use structured JSON logging with correlation IDs that link logs to traces.

import structlog
import logging
from opentelemetry import trace

def setup_structured_logging():
    structlog.configure(
        processors=[
            structlog.contextvars.merge_contextvars,
            structlog.processors.add_log_level,
            structlog.processors.TimeStamper(fmt="iso"),
            add_trace_context,
            structlog.processors.JSONRenderer(),
        ],
        logger_factory=structlog.stdlib.LoggerFactory(),
    )

def add_trace_context(logger, method_name, event_dict):
    span = trace.get_current_span()
    if span and span.is_recording():
        ctx = span.get_span_context()
        event_dict["trace_id"] = format(ctx.trace_id, "032x")
        event_dict["span_id"] = format(ctx.span_id, "016x")
    return event_dict

logger = structlog.get_logger()

# Usage in agent code
async def handle_agent_task(task_id: str, user_input: str):
    log = logger.bind(task_id=task_id)

    log.info("agent_task_started",
             input_length=len(user_input),
             agent="billing_specialist")

    # After LLM call
    log.info("llm_call_completed",
             model="gpt-4.1",
             prompt_tokens=1240,
             completion_tokens=380,
             duration_ms=1850,
             cost_usd=0.0124)

    # After tool call
    log.info("tool_executed",
             tool_name="lookup_invoice",
             success=True,
             duration_ms=45)

    # On error
    log.error("tool_execution_failed",
              tool_name="process_refund",
              error="connection_timeout",
              retry_attempt=2)

What to Log vs What to Trace

Trace: The structure and timing of execution (what happened in what order and how long it took). Use spans for LLM calls, tool executions, agent handoffs, and the overall request lifecycle.

Log: The details and context within each step (what the LLM was asked, what the tool returned, why a decision was made). Logs are searchable and filterable; traces show relationships.

Neither: Full prompt text and full LLM responses in production (too large, may contain PII). Store these in a separate audit system with appropriate access controls if needed for debugging.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

Agent-Specific Metrics

Beyond traces and logs, agent systems need custom metrics that capture agent-specific behavior patterns.

from opentelemetry import metrics

meter = metrics.get_meter("agent-service")

# Token usage
token_counter = meter.create_counter(
    "agent.tokens.total",
    description="Total tokens consumed by agent LLM calls",
    unit="tokens",
)

# Cost tracking
cost_counter = meter.create_counter(
    "agent.cost.usd",
    description="Cumulative LLM API cost in USD",
    unit="usd",
)

# Agent latency
agent_duration = meter.create_histogram(
    "agent.task.duration",
    description="End-to-end agent task duration",
    unit="seconds",
)

# Tool success rate
tool_calls = meter.create_counter(
    "agent.tool.calls",
    description="Number of tool invocations",
)

# Escalation rate
escalations = meter.create_counter(
    "agent.escalations",
    description="Number of tasks escalated to supervisor or human",
)

# Usage in agent code
def record_llm_call(model: str, prompt_tokens: int,
                     completion_tokens: int, cost: float):
    total = prompt_tokens + completion_tokens
    token_counter.add(total, {"model": model, "type": "total"})
    token_counter.add(
        prompt_tokens, {"model": model, "type": "prompt"}
    )
    token_counter.add(
        completion_tokens, {"model": model, "type": "completion"}
    )
    cost_counter.add(cost, {"model": model})

def record_tool_call(tool_name: str, success: bool,
                      duration_s: float):
    tool_calls.add(1, {
        "tool": tool_name,
        "success": str(success),
    })

def record_escalation(agent_name: str, reason: str):
    escalations.add(1, {
        "agent": agent_name,
        "reason": reason,
    })

Building Dashboards

The metrics above power four critical dashboards:

Agent Performance Dashboard — Shows task completion rate, average duration, error rate, and escalation rate per agent. This is the first dashboard your on-call team looks at when something goes wrong.

Token and Cost Dashboard — Tracks token consumption and cost per model, per agent, and per hour. Set alerts when hourly spend exceeds 2x the rolling average. This catches prompt injection attacks (which inflate token usage) and regression bugs (which increase LLM call counts).

Tool Health Dashboard — Monitors tool invocation counts, success rates, and latency. A failing external API shows up here before it cascades into agent errors.

Trace Explorer — A searchable interface for individual traces. Filter by agent name, duration, error status, or token count. Use this for debugging specific user-reported issues.

Alert Patterns for Production Agents

# Alert rule definitions (Prometheus/Grafana format conceptually)

ALERT_RULES = {
    "high_error_rate": {
        "condition": "rate(agent.tool.calls{success='False'}[5m]) "
                     "/ rate(agent.tool.calls[5m]) > 0.15",
        "severity": "critical",
        "action": "Page on-call, check tool dependencies",
    },
    "token_cost_spike": {
        "condition": "rate(agent.cost.usd[1h]) > "
                     "2 * avg_over_time(agent.cost.usd[7d])",
        "severity": "warning",
        "action": "Check for prompt injection or agent loops",
    },
    "high_latency": {
        "condition": "histogram_quantile(0.95, "
                     "agent.task.duration) > 30",
        "severity": "warning",
        "action": "Check LLM provider status, review tool latency",
    },
    "escalation_spike": {
        "condition": "rate(agent.escalations[15m]) > "
                     "3 * avg_over_time(agent.escalations[24h])",
        "severity": "warning",
        "action": "Check specialist agent health, review recent "
                  "model or prompt changes",
    },
}

The most important alert is the token cost spike. A runaway agent loop can burn through thousands of dollars in minutes. Always set a hard per-request token budget in your agent code as a circuit breaker, independent of the alert.

Tracing Multi-Agent Handoffs

When agents hand off to other agents, the trace must follow the conversation across agent boundaries. Use OpenTelemetry context propagation to link spans across agents.

from opentelemetry.context import attach, detach
from opentelemetry.trace.propagation import (
    TraceContextTextMapPropagator,
)

propagator = TraceContextTextMapPropagator()

async def handoff_to_agent(target_agent, message: str,
                            context: dict):
    # Inject trace context into the handoff message
    carrier = {}
    propagator.inject(carrier)
    context["trace_carrier"] = carrier

    # Target agent extracts and continues the trace
    return await target_agent.handle_handoff(message, context)

async def handle_handoff(self, message: str, context: dict):
    carrier = context.get("trace_carrier", {})
    ctx = propagator.extract(carrier)
    token = attach(ctx)
    try:
        with tracer.start_as_current_span(
            "agent.handoff.receive",
            attributes={
                "agent.name": self.name,
                "handoff.source": context.get("source_agent"),
            },
        ):
            return await self.run(message)
    finally:
        detach(token)

This ensures that a single trace spans the entire user journey, even if it crosses five different agents. In your trace viewer, you see the complete story: triage classified the request (200ms), billing specialist looked up the invoice (1.2s), and the supervisor approved the refund (800ms).

FAQ

What is the overhead of OpenTelemetry instrumentation?

Minimal when configured correctly. The BatchSpanProcessor buffers spans and exports them asynchronously, adding less than 1ms of overhead per span. Metric counters are lock-free atomic operations. The main cost is serialization and network export, which happens in background threads. In benchmarks, OTel adds less than 2% overhead to overall request latency.

Should you log full LLM prompts and responses?

Not in production logs. Full prompts and completions can contain PII, are large (inflating log storage costs), and are rarely needed in real-time. Instead, log summary attributes: token counts, model used, whether tools were called, and a content hash for deduplication. Store full prompt/response pairs in a separate audit system with retention policies and access controls for post-incident investigation.

How do you trace agents that use streaming responses?

Create the span when the stream starts and end it when the stream completes. Record first-token latency and total-token latency as separate attributes. For agents that make decisions mid-stream (processing streaming tool call arguments), create child spans for each decision point within the stream.

What observability backend works best for AI agents?

Any OpenTelemetry-compatible backend works. Grafana Cloud (Tempo for traces, Loki for logs, Mimir for metrics) is popular for self-hosted stacks. Datadog and Honeycomb provide managed solutions with good AI-specific features. The key is choosing a backend that supports high-cardinality attributes (agent name, model, tool name) and long trace durations (minutes, not milliseconds).