Logging Best Practices for AI Agents: Structured Logs for Debugging and Audit

Why Standard Logging Falls Short for Agents

A typical web application logs a request, processes it, and logs a response. An AI agent might process a single user message through five or more steps: prompt construction, memory retrieval, LLM inference, tool calls, response validation, and memory storage. Each step can fail independently, and the failure modes are fundamentally different from traditional applications — an LLM might return a valid HTTP 200 response that contains completely wrong instructions for a tool call.

Standard print() statements or unstructured log lines make it nearly impossible to reconstruct what happened during a conversation. Structured logging with correlation IDs, consistent fields, and sensitive data redaction transforms your logs from a wall of text into a queryable debugging and audit system.

Setting Up Structured Logging with structlog

The structlog library produces JSON log lines with consistent fields that are easy to parse and query in log aggregation tools like Elasticsearch, Loki, or CloudWatch.

import structlog
import uuid

structlog.configure(
    processors=[
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.add_log_level,
        structlog.processors.StackInfoRenderer(),
        structlog.processors.format_exc_info,
        structlog.processors.JSONRenderer(),
    ],
    wrapper_class=structlog.BoundLogger,
    context_class=dict,
    logger_factory=structlog.PrintLoggerFactory(),
)

def get_logger(agent_name: str, conversation_id: str = None):
    """Create a logger bound with agent context."""
    if conversation_id is None:
        conversation_id = str(uuid.uuid4())
    return structlog.get_logger().bind(
        agent_name=agent_name,
        conversation_id=conversation_id,
    )

Every log line produced by this logger automatically includes the agent name, conversation ID, timestamp, and log level — all as structured JSON fields.

Correlation IDs Across Agent Steps

A single conversation generates logs across multiple functions and sometimes multiple services. Bind a conversation ID at the start and pass the logger through each step so every log line is linked.

async def handle_conversation(user_message: str, user_id: str):
    conversation_id = str(uuid.uuid4())
    log = get_logger("support-agent", conversation_id).bind(user_id=user_id)

    log.info("conversation_started", message_length=len(user_message))

    # Memory retrieval
    log.info("memory_retrieval_started")
    memories = await retrieve_memories(user_message)
    log.info("memory_retrieval_completed", results_count=len(memories))

    # LLM call
    log.info("llm_call_started", model="gpt-4o")
    response = await call_llm(user_message, memories)
    log.info(
        "llm_call_completed",
        model="gpt-4o",
        prompt_tokens=response.usage.prompt_tokens,
        completion_tokens=response.usage.completion_tokens,
        finish_reason=response.choices[0].finish_reason,
    )

    # Tool execution
    if response.tool_calls:
        for tool_call in response.tool_calls:
            log.info(
                "tool_call_started",
                tool_name=tool_call.function.name,
            )
            try:
                result = await execute_tool(tool_call)
                log.info("tool_call_completed", tool_name=tool_call.function.name)
            except Exception as e:
                log.error(
                    "tool_call_failed",
                    tool_name=tool_call.function.name,
                    error=str(e),
                )
                raise

    log.info("conversation_completed")
    return response.content

The resulting log output looks like this — every line shares the same conversation_id, making it trivial to filter in your log aggregation tool:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

{"event": "conversation_started", "agent_name": "support-agent", "conversation_id": "a1b2c3d4...", "user_id": "user_789", "message_length": 142, "level": "info", "timestamp": "2026-03-17T10:30:00Z"}
{"event": "llm_call_completed", "agent_name": "support-agent", "conversation_id": "a1b2c3d4...", "model": "gpt-4o", "prompt_tokens": 1250, "completion_tokens": 340, "level": "info", "timestamp": "2026-03-17T10:30:02Z"}

Redacting Sensitive Data

Agent logs often contain user messages, PII, or API keys embedded in tool call arguments. Build a redaction processor that strips sensitive fields before they hit your log backend.

import re

SENSITIVE_PATTERNS = {
    "email": re.compile(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"),
    "phone": re.compile(r"\b\d{3}[-.]?\d{3}[-.]?\d{4}\b"),
    "ssn": re.compile(r"\b\d{3}-\d{2}-\d{4}\b"),
    "api_key": re.compile(r"(sk-|pk-|api[_-]?key[=:]\s*)[a-zA-Z0-9]{20,}"),
}

def redact_sensitive_data(logger, method_name, event_dict):
    """structlog processor that redacts PII from log values."""
    for key, value in event_dict.items():
        if isinstance(value, str):
            for pattern_name, pattern in SENSITIVE_PATTERNS.items():
                value = pattern.sub(f"[REDACTED_{pattern_name.upper()}]", value)
            event_dict[key] = value
    return event_dict

# Add to structlog processors list before JSONRenderer
structlog.configure(
    processors=[
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.add_log_level,
        redact_sensitive_data,  # Runs before serialization
        structlog.processors.JSONRenderer(),
    ],
)

Choosing Log Levels for Agent Events

Use consistent log levels across your agent codebase. A clear convention prevents important signals from being buried in noise.

Level	When to Use
DEBUG	Prompt contents, full LLM responses, tool arguments
INFO	Step start/completion, token counts, conversation lifecycle
WARNING	Retries, fallback model usage, slow LLM responses
ERROR	Tool failures, LLM errors, validation failures
CRITICAL	Agent loop crashes, data corruption, auth failures

In production, set the level to INFO and enable DEBUG only when actively investigating an issue. This keeps log volume manageable while preserving enough context for post-incident analysis.

FAQ

Should I log the full LLM prompt and response?

Log full prompts and responses at DEBUG level only. At INFO level, log metadata like token counts, model name, and finish reason. Full prompts can contain PII and consume significant storage — a single conversation might generate megabytes of prompt text. For audit scenarios, consider writing full prompts to a separate, access-controlled store with shorter retention.

How do I correlate logs across multiple agents in a multi-agent system?

Use two IDs: a conversation_id that is unique per user conversation and a trace_id that follows the request across agent handoffs. When your triage agent calls a specialist agent, pass both IDs in the request. This lets you filter by conversation to see the full user interaction or by trace to see the technical execution path.

What log aggregation tools work best for agent logs?

Any tool that supports structured JSON logs works well. Grafana Loki is lightweight and integrates directly with Grafana dashboards. Elasticsearch with Kibana provides powerful full-text search across log fields. For cloud-native setups, AWS CloudWatch Logs Insights or Google Cloud Logging both support JSON field queries natively.

#Logging #StructuredLogging #Debugging #Audit #AIAgents #AgenticAI #LearnAI #AIEngineering

Logging Best Practices for AI Agents: Structured Logs for Debugging and Audit

Why Standard Logging Falls Short for Agents

Setting Up Structured Logging with structlog

Correlation IDs Across Agent Steps

Redacting Sensitive Data

Choosing Log Levels for Agent Events

FAQ

Should I log the full LLM prompt and response?

How do I correlate logs across multiple agents in a multi-agent system?

What log aggregation tools work best for agent logs?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding