Skip to content
Learn Agentic AI13 min read0 views

Latency Profiling for AI Agents: Finding and Fixing Slow Steps in Agent Pipelines

Learn systematic approaches to profile AI agent latency, identify bottlenecks in multi-step pipelines, and apply targeted optimizations using timing decorators, flame charts, and parallel execution patterns.

Where Agent Latency Hides

Users perceive agent response time as a single number, but under the hood it is the sum of many sequential and parallel steps: prompt construction, embedding generation for retrieval, vector database search, LLM inference, tool execution, response parsing, and memory storage. A 4-second total response time might consist of 200ms for retrieval, 2800ms for LLM inference, 600ms for a database tool call, and 400ms for prompt assembly — but without profiling, you would never know the tool call is the second largest contributor.

Systematic latency profiling decomposes total response time into individual step durations, reveals which steps dominate, and guides you to optimizations that actually matter.

Building a Timing Decorator

A reusable decorator that logs the duration of every async function call gives you step-level visibility with minimal code changes.

import time
import functools
import structlog

logger = structlog.get_logger()

def timed(step_name: str = None):
    """Decorator that logs execution time for async functions."""
    def decorator(func):
        name = step_name or func.__name__

        @functools.wraps(func)
        async def wrapper(*args, **kwargs):
            start = time.perf_counter()
            try:
                result = await func(*args, **kwargs)
                duration_ms = (time.perf_counter() - start) * 1000
                logger.info(
                    "step_completed",
                    step=name,
                    duration_ms=round(duration_ms, 2),
                )
                return result
            except Exception as e:
                duration_ms = (time.perf_counter() - start) * 1000
                logger.error(
                    "step_failed",
                    step=name,
                    duration_ms=round(duration_ms, 2),
                    error=str(e),
                )
                raise

        return wrapper
    return decorator

# Usage
@timed("memory_retrieval")
async def retrieve_memories(query: str):
    embeddings = await generate_embedding(query)
    return await vector_db.search(embeddings, top_k=5)

@timed("llm_inference")
async def call_llm(messages: list):
    return await client.chat.completions.create(model="gpt-4o", messages=messages)

Collecting a Latency Breakdown

Build a profiler context manager that collects step timings for an entire conversation and produces a structured breakdown.

from dataclasses import dataclass, field
from contextlib import asynccontextmanager

@dataclass
class StepTiming:
    name: str
    duration_ms: float
    metadata: dict = field(default_factory=dict)

@dataclass
class ConversationProfile:
    conversation_id: str
    steps: list[StepTiming] = field(default_factory=list)

    @property
    def total_ms(self) -> float:
        return sum(s.duration_ms for s in self.steps)

    def breakdown(self) -> list[dict]:
        total = self.total_ms
        return [
            {
                "step": s.name,
                "duration_ms": round(s.duration_ms, 2),
                "percentage": round((s.duration_ms / total) * 100, 1) if total > 0 else 0,
            }
            for s in sorted(self.steps, key=lambda x: x.duration_ms, reverse=True)
        ]

    def record(self, name: str, duration_ms: float, **metadata):
        self.steps.append(StepTiming(name=name, duration_ms=duration_ms, metadata=metadata))


@asynccontextmanager
async def profile_step(profile: ConversationProfile, step_name: str):
    start = time.perf_counter()
    yield
    duration_ms = (time.perf_counter() - start) * 1000
    profile.record(step_name, duration_ms)

Using the profiler in your agent:

async def run_agent_profiled(user_message: str, conversation_id: str):
    profile = ConversationProfile(conversation_id=conversation_id)

    async with profile_step(profile, "prompt_assembly"):
        prompt = build_prompt(user_message)

    async with profile_step(profile, "memory_retrieval"):
        memories = await retrieve_memories(user_message)

    async with profile_step(profile, "llm_inference"):
        response = await call_llm(prompt + memories)

    if response.tool_calls:
        async with profile_step(profile, "tool_execution"):
            results = await execute_tools(response.tool_calls)

    async with profile_step(profile, "response_formatting"):
        output = format_response(response)

    # Log the breakdown
    for step in profile.breakdown():
        logger.info("latency_breakdown", **step, conversation_id=conversation_id)

    logger.info("total_latency", total_ms=round(profile.total_ms, 2))
    return output

Identifying and Fixing Common Bottlenecks

Once you have profiling data, patterns emerge quickly. Here are the most common bottlenecks and their fixes.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Sequential steps that could be parallel. If memory retrieval and prompt assembly are independent, run them concurrently:

import asyncio

async def run_agent_optimized(user_message: str):
    # Run independent steps in parallel
    memories_task = asyncio.create_task(retrieve_memories(user_message))
    prompt_task = asyncio.create_task(build_prompt_async(user_message))

    memories, prompt = await asyncio.gather(memories_task, prompt_task)

    # Sequential: LLM needs both prompt and memories
    response = await call_llm(prompt, memories)
    return response

Oversized prompts inflating LLM latency. LLM inference time scales with prompt length. If your prompt includes 20 retrieved documents when 5 would suffice, you are paying for latency you do not need. Profile the relationship between prompt token count and inference time to find the optimal retrieval count.

Synchronous tool calls that could overlap. If the LLM requests two independent tool calls — say a database lookup and an API call — execute them concurrently:

async def execute_tools_parallel(tool_calls: list):
    tasks = [execute_single_tool(tc) for tc in tool_calls]
    results = await asyncio.gather(*tasks, return_exceptions=True)
    return [
        r if not isinstance(r, Exception) else {"error": str(r)}
        for r in results
    ]

FAQ

What is a good target for total agent response time?

For synchronous chat-style agents, aim for under 3 seconds total response time. Users perceive delays beyond 3 seconds as sluggish. For streaming responses, the time-to-first-token is more important — target under 500 milliseconds. For background agents processing tasks asynchronously, total time matters less than throughput.

How do I profile agents that use streaming responses?

Measure two metrics: time-to-first-token (TTFT) and total generation time. Wrap the stream to capture the timestamp of the first chunk and the last chunk. TTFT tells you how long the user waits before seeing output, while total time tells you the full cost of the request.

Should I store profiling data for every conversation in production?

Store aggregate percentile data (P50, P95, P99) for every conversation — this is lightweight and essential for trend analysis. Store full step-level breakdowns only for a sample (1-5% of conversations) or when latency exceeds a threshold. This balances observability with storage cost.


#Latency #Performance #Profiling #AIAgents #Optimization #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.