Latency Profiling for AI Agents: Finding and Fixing Slow Steps in Agent Pipelines
Learn systematic approaches to profile AI agent latency, identify bottlenecks in multi-step pipelines, and apply targeted optimizations using timing decorators, flame charts, and parallel execution patterns.
Where Agent Latency Hides
Users perceive agent response time as a single number, but under the hood it is the sum of many sequential and parallel steps: prompt construction, embedding generation for retrieval, vector database search, LLM inference, tool execution, response parsing, and memory storage. A 4-second total response time might consist of 200ms for retrieval, 2800ms for LLM inference, 600ms for a database tool call, and 400ms for prompt assembly — but without profiling, you would never know the tool call is the second largest contributor.
Systematic latency profiling decomposes total response time into individual step durations, reveals which steps dominate, and guides you to optimizations that actually matter.
Building a Timing Decorator
A reusable decorator that logs the duration of every async function call gives you step-level visibility with minimal code changes.
import time
import functools
import structlog
logger = structlog.get_logger()
def timed(step_name: str = None):
"""Decorator that logs execution time for async functions."""
def decorator(func):
name = step_name or func.__name__
@functools.wraps(func)
async def wrapper(*args, **kwargs):
start = time.perf_counter()
try:
result = await func(*args, **kwargs)
duration_ms = (time.perf_counter() - start) * 1000
logger.info(
"step_completed",
step=name,
duration_ms=round(duration_ms, 2),
)
return result
except Exception as e:
duration_ms = (time.perf_counter() - start) * 1000
logger.error(
"step_failed",
step=name,
duration_ms=round(duration_ms, 2),
error=str(e),
)
raise
return wrapper
return decorator
# Usage
@timed("memory_retrieval")
async def retrieve_memories(query: str):
embeddings = await generate_embedding(query)
return await vector_db.search(embeddings, top_k=5)
@timed("llm_inference")
async def call_llm(messages: list):
return await client.chat.completions.create(model="gpt-4o", messages=messages)
Collecting a Latency Breakdown
Build a profiler context manager that collects step timings for an entire conversation and produces a structured breakdown.
from dataclasses import dataclass, field
from contextlib import asynccontextmanager
@dataclass
class StepTiming:
name: str
duration_ms: float
metadata: dict = field(default_factory=dict)
@dataclass
class ConversationProfile:
conversation_id: str
steps: list[StepTiming] = field(default_factory=list)
@property
def total_ms(self) -> float:
return sum(s.duration_ms for s in self.steps)
def breakdown(self) -> list[dict]:
total = self.total_ms
return [
{
"step": s.name,
"duration_ms": round(s.duration_ms, 2),
"percentage": round((s.duration_ms / total) * 100, 1) if total > 0 else 0,
}
for s in sorted(self.steps, key=lambda x: x.duration_ms, reverse=True)
]
def record(self, name: str, duration_ms: float, **metadata):
self.steps.append(StepTiming(name=name, duration_ms=duration_ms, metadata=metadata))
@asynccontextmanager
async def profile_step(profile: ConversationProfile, step_name: str):
start = time.perf_counter()
yield
duration_ms = (time.perf_counter() - start) * 1000
profile.record(step_name, duration_ms)
Using the profiler in your agent:
async def run_agent_profiled(user_message: str, conversation_id: str):
profile = ConversationProfile(conversation_id=conversation_id)
async with profile_step(profile, "prompt_assembly"):
prompt = build_prompt(user_message)
async with profile_step(profile, "memory_retrieval"):
memories = await retrieve_memories(user_message)
async with profile_step(profile, "llm_inference"):
response = await call_llm(prompt + memories)
if response.tool_calls:
async with profile_step(profile, "tool_execution"):
results = await execute_tools(response.tool_calls)
async with profile_step(profile, "response_formatting"):
output = format_response(response)
# Log the breakdown
for step in profile.breakdown():
logger.info("latency_breakdown", **step, conversation_id=conversation_id)
logger.info("total_latency", total_ms=round(profile.total_ms, 2))
return output
Identifying and Fixing Common Bottlenecks
Once you have profiling data, patterns emerge quickly. Here are the most common bottlenecks and their fixes.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Sequential steps that could be parallel. If memory retrieval and prompt assembly are independent, run them concurrently:
import asyncio
async def run_agent_optimized(user_message: str):
# Run independent steps in parallel
memories_task = asyncio.create_task(retrieve_memories(user_message))
prompt_task = asyncio.create_task(build_prompt_async(user_message))
memories, prompt = await asyncio.gather(memories_task, prompt_task)
# Sequential: LLM needs both prompt and memories
response = await call_llm(prompt, memories)
return response
Oversized prompts inflating LLM latency. LLM inference time scales with prompt length. If your prompt includes 20 retrieved documents when 5 would suffice, you are paying for latency you do not need. Profile the relationship between prompt token count and inference time to find the optimal retrieval count.
Synchronous tool calls that could overlap. If the LLM requests two independent tool calls — say a database lookup and an API call — execute them concurrently:
async def execute_tools_parallel(tool_calls: list):
tasks = [execute_single_tool(tc) for tc in tool_calls]
results = await asyncio.gather(*tasks, return_exceptions=True)
return [
r if not isinstance(r, Exception) else {"error": str(r)}
for r in results
]
FAQ
What is a good target for total agent response time?
For synchronous chat-style agents, aim for under 3 seconds total response time. Users perceive delays beyond 3 seconds as sluggish. For streaming responses, the time-to-first-token is more important — target under 500 milliseconds. For background agents processing tasks asynchronously, total time matters less than throughput.
How do I profile agents that use streaming responses?
Measure two metrics: time-to-first-token (TTFT) and total generation time. Wrap the stream to capture the timestamp of the first chunk and the last chunk. TTFT tells you how long the user waits before seeing output, while total time tells you the full cost of the request.
Should I store profiling data for every conversation in production?
Store aggregate percentile data (P50, P95, P99) for every conversation — this is lightweight and essential for trend analysis. Store full step-level breakdowns only for a sample (1-5% of conversations) or when latency exceeds a threshold. This balances observability with storage cost.
#Latency #Performance #Profiling #AIAgents #Optimization #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.