Observability for AI Voice Agents: Distributed Tracing, Metrics, and Logs

The "it's slow sometimes" ticket

The worst voice-agent ticket you will ever get is "it's slow sometimes." Without proper observability you cannot tell if it was the carrier, the STT stage, the LLM first token, the tool call, or the TTS stream. With proper observability you can pull up one trace and see exactly which stage blew its budget.

This post walks through the observability stack CallSphere runs in production — distributed traces, RED metrics, structured logs, and SLO dashboards that fire alerts before customers notice.

per-call trace
  │
  ├── span: network_in
  ├── span: stt
  ├── span: llm_first_token
  ├── span: tool_call (repeated)
  ├── span: tts_first_frame
  └── span: network_out

Architecture overview

┌─────────────┐   OTLP   ┌─────────────┐
│ Voice edge  │────────► │ Collector   │
└─────────────┘          └──────┬──────┘
                                │
             ┌──────────────────┼──────────────────┐
             ▼                  ▼                  ▼
       ┌───────────┐     ┌───────────┐      ┌───────────┐
       │ Traces    │     │ Metrics   │      │ Logs      │
       │ (Tempo)   │     │ (Prom)    │      │ (Loki)    │
       └───────────┘     └───────────┘      └───────────┘
                                │
                                ▼
                         ┌───────────┐
                         │ Grafana   │
                         │ + alerts  │
                         └───────────┘

Prerequisites

OpenTelemetry SDK in your edge service.
A collector (OTel Collector).
Storage backends: Tempo/Jaeger for traces, Prometheus for metrics, Loki for logs.
Grafana for dashboards.

Step-by-step walkthrough

1. Instrument spans per stage

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

provider = TracerProvider()
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter(endpoint="collector:4317", insecure=True)))
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("voice-edge")

async def handle_turn(audio):
    with tracer.start_as_current_span("turn") as span:
        span.set_attribute("call_id", current_call_id())
        with tracer.start_as_current_span("stt") as s:
            text = await stt(audio)
            s.set_attribute("stt.chars", len(text))
        with tracer.start_as_current_span("llm") as s:
            first_token_at = None
            async for token in llm_stream(text):
                if first_token_at is None:
                    first_token_at = time.time()
                    s.set_attribute("llm.first_token_ms", (first_token_at - s.start_time) * 1000)

2. Use the Call SID as the trace ID

Carrier Call SID is the one ID that everyone — ops, support, legal — agrees on. Use it as the trace root so you can paste a Call SID into Grafana and get the whole pipeline.

from opentelemetry.trace import SpanContext, TraceFlags

def trace_id_from_call_sid(sid: str) -> int:
    return int.from_bytes(hashlib.sha256(sid.encode()).digest()[:16], "big")

3. Emit RED metrics

Rate, Errors, Duration — for every stage.

from prometheus_client import Counter, Histogram

STT_LAT = Histogram("stt_duration_seconds", "STT stage duration", buckets=[0.05, 0.1, 0.2, 0.5, 1, 2])
LLM_FT = Histogram("llm_first_token_seconds", "LLM first-token latency", buckets=[0.1, 0.2, 0.3, 0.5, 1])
ERRORS = Counter("stage_errors_total", "Errors by stage", ["stage"])

4. Structured logs with trace context

import structlog
log = structlog.get_logger()
log.info("call_end", call_id=sid, trace_id=tid, outcome="resolved", duration_sec=184)

5. Define SLOs

Turn latency p95 < 1.2s
STT error rate < 0.5%
LLM 5xx < 0.1%
Carrier answer rate > 99%

6. Build dashboards and burn-rate alerts

Use multi-window multi-burn-rate alerts so you catch fast and slow SLO burns before they become incidents.

groups:
  - name: voice-slo
    rules:
      - alert: HighTurnLatency
        expr: histogram_quantile(0.95, sum(rate(turn_duration_seconds_bucket[5m])) by (le)) > 1.2
        for: 5m
        labels: {severity: page}
        annotations: {summary: "Turn p95 latency over 1.2s"}

Production considerations

Sampling: sample 100% of errors, 10% of successes to control cost.
Cardinality: do not tag metrics with caller phone numbers.
Log volume: audio is not a log. Keep transcripts in a dedicated store.
Trace retention: 14 days is usually enough; longer for incident review.
Privacy: redact PII in spans and logs.

CallSphere's real implementation

CallSphere instruments its voice edge with OpenTelemetry and routes traces, metrics, and logs through a collector into Tempo, Prometheus, and Loki. Every call's Twilio SID is used as the trace root, so support tickets referencing a specific call SID pull up the full pipeline in one click. RED metrics exist for every stage of the STT → LLM → TTS pipeline powered by the OpenAI Realtime API (gpt-4o-realtime-preview-2025-06-03) at 24kHz PCM16 with server VAD.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

Multi-window burn-rate alerts fire on turn latency, tool error rate, and guardrail rejection rate across all verticals — 14 healthcare tools, 10 real estate agents, 4 salon agents, 7 after-hours escalation tools, 10 plus RAG IT helpdesk tools, and the 5-specialist ElevenLabs sales pod. A GPT-4o-mini post-call pipeline produces analytics that are also exported as metrics so sentiment trends show up on the same dashboards as SRE metrics. CallSphere supports 57+ languages and maintains sub-second end-to-end latency visible in Grafana at all times.

Common pitfalls

Metrics without traces: you know something is wrong but not where.
Unbounded label cardinality: Prometheus will fall over.
Logs without trace IDs: you cannot correlate.
Alerting on raw counts: you will page on random spikes.
No SLO: you cannot tell the difference between a blip and a burn.

FAQ

Should I use OpenTelemetry or a vendor SDK?

OpenTelemetry. It decouples you from any single vendor.

Is Grafana enough or do I need Honeycomb / Lightstep?

Grafana is enough for most teams. Honeycomb shines for exploratory trace analysis.

How do I correlate a caller complaint to a trace?

Caller number → recent calls table → Call SID → trace.

Should audio frames be traced?

No. Trace at the event level, not the frame level.

Can I use trace IDs for billing reconciliation?

Yes — join trace IDs to your call log and carrier CDRs.

Next steps

Want full-stack observability on your voice agent? Book a demo, explore the technology page, or see pricing.

#CallSphere #Observability #OpenTelemetry #VoiceAI #SLO #Tracing #AIVoiceAgents