Logging Best Practices for AI Agents: Structured Logs for Debugging and Audit
Implement structured logging for AI agent systems with correlation IDs, log levels, sensitive data redaction, and queryable JSON output that makes debugging production agent issues fast and audit-ready.
Why Standard Logging Falls Short for Agents
A typical web application logs a request, processes it, and logs a response. An AI agent might process a single user message through five or more steps: prompt construction, memory retrieval, LLM inference, tool calls, response validation, and memory storage. Each step can fail independently, and the failure modes are fundamentally different from traditional applications — an LLM might return a valid HTTP 200 response that contains completely wrong instructions for a tool call.
Standard print() statements or unstructured log lines make it nearly impossible to reconstruct what happened during a conversation. Structured logging with correlation IDs, consistent fields, and sensitive data redaction transforms your logs from a wall of text into a queryable debugging and audit system.
Setting Up Structured Logging with structlog
The structlog library produces JSON log lines with consistent fields that are easy to parse and query in log aggregation tools like Elasticsearch, Loki, or CloudWatch.
import structlog
import uuid
structlog.configure(
processors=[
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.add_log_level,
structlog.processors.StackInfoRenderer(),
structlog.processors.format_exc_info,
structlog.processors.JSONRenderer(),
],
wrapper_class=structlog.BoundLogger,
context_class=dict,
logger_factory=structlog.PrintLoggerFactory(),
)
def get_logger(agent_name: str, conversation_id: str = None):
"""Create a logger bound with agent context."""
if conversation_id is None:
conversation_id = str(uuid.uuid4())
return structlog.get_logger().bind(
agent_name=agent_name,
conversation_id=conversation_id,
)
Every log line produced by this logger automatically includes the agent name, conversation ID, timestamp, and log level — all as structured JSON fields.
Correlation IDs Across Agent Steps
A single conversation generates logs across multiple functions and sometimes multiple services. Bind a conversation ID at the start and pass the logger through each step so every log line is linked.
async def handle_conversation(user_message: str, user_id: str):
conversation_id = str(uuid.uuid4())
log = get_logger("support-agent", conversation_id).bind(user_id=user_id)
log.info("conversation_started", message_length=len(user_message))
# Memory retrieval
log.info("memory_retrieval_started")
memories = await retrieve_memories(user_message)
log.info("memory_retrieval_completed", results_count=len(memories))
# LLM call
log.info("llm_call_started", model="gpt-4o")
response = await call_llm(user_message, memories)
log.info(
"llm_call_completed",
model="gpt-4o",
prompt_tokens=response.usage.prompt_tokens,
completion_tokens=response.usage.completion_tokens,
finish_reason=response.choices[0].finish_reason,
)
# Tool execution
if response.tool_calls:
for tool_call in response.tool_calls:
log.info(
"tool_call_started",
tool_name=tool_call.function.name,
)
try:
result = await execute_tool(tool_call)
log.info("tool_call_completed", tool_name=tool_call.function.name)
except Exception as e:
log.error(
"tool_call_failed",
tool_name=tool_call.function.name,
error=str(e),
)
raise
log.info("conversation_completed")
return response.content
The resulting log output looks like this — every line shares the same conversation_id, making it trivial to filter in your log aggregation tool:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
{"event": "conversation_started", "agent_name": "support-agent", "conversation_id": "a1b2c3d4...", "user_id": "user_789", "message_length": 142, "level": "info", "timestamp": "2026-03-17T10:30:00Z"}
{"event": "llm_call_completed", "agent_name": "support-agent", "conversation_id": "a1b2c3d4...", "model": "gpt-4o", "prompt_tokens": 1250, "completion_tokens": 340, "level": "info", "timestamp": "2026-03-17T10:30:02Z"}
Redacting Sensitive Data
Agent logs often contain user messages, PII, or API keys embedded in tool call arguments. Build a redaction processor that strips sensitive fields before they hit your log backend.
import re
SENSITIVE_PATTERNS = {
"email": re.compile(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"),
"phone": re.compile(r"\b\d{3}[-.]?\d{3}[-.]?\d{4}\b"),
"ssn": re.compile(r"\b\d{3}-\d{2}-\d{4}\b"),
"api_key": re.compile(r"(sk-|pk-|api[_-]?key[=:]\s*)[a-zA-Z0-9]{20,}"),
}
def redact_sensitive_data(logger, method_name, event_dict):
"""structlog processor that redacts PII from log values."""
for key, value in event_dict.items():
if isinstance(value, str):
for pattern_name, pattern in SENSITIVE_PATTERNS.items():
value = pattern.sub(f"[REDACTED_{pattern_name.upper()}]", value)
event_dict[key] = value
return event_dict
# Add to structlog processors list before JSONRenderer
structlog.configure(
processors=[
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.add_log_level,
redact_sensitive_data, # Runs before serialization
structlog.processors.JSONRenderer(),
],
)
Choosing Log Levels for Agent Events
Use consistent log levels across your agent codebase. A clear convention prevents important signals from being buried in noise.
| Level | When to Use |
|---|---|
| DEBUG | Prompt contents, full LLM responses, tool arguments |
| INFO | Step start/completion, token counts, conversation lifecycle |
| WARNING | Retries, fallback model usage, slow LLM responses |
| ERROR | Tool failures, LLM errors, validation failures |
| CRITICAL | Agent loop crashes, data corruption, auth failures |
In production, set the level to INFO and enable DEBUG only when actively investigating an issue. This keeps log volume manageable while preserving enough context for post-incident analysis.
FAQ
Should I log the full LLM prompt and response?
Log full prompts and responses at DEBUG level only. At INFO level, log metadata like token counts, model name, and finish reason. Full prompts can contain PII and consume significant storage — a single conversation might generate megabytes of prompt text. For audit scenarios, consider writing full prompts to a separate, access-controlled store with shorter retention.
How do I correlate logs across multiple agents in a multi-agent system?
Use two IDs: a conversation_id that is unique per user conversation and a trace_id that follows the request across agent handoffs. When your triage agent calls a specialist agent, pass both IDs in the request. This lets you filter by conversation to see the full user interaction or by trace to see the technical execution path.
What log aggregation tools work best for agent logs?
Any tool that supports structured JSON logs works well. Grafana Loki is lightweight and integrates directly with Grafana dashboards. Elasticsearch with Kibana provides powerful full-text search across log fields. For cloud-native setups, AWS CloudWatch Logs Insights or Google Cloud Logging both support JSON field queries natively.
#Logging #StructuredLogging #Debugging #Audit #AIAgents #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.