Skip to content
Learn Agentic AI12 min read0 views

Python Performance Profiling for AI Applications: Finding Bottlenecks with cProfile and py-spy

Learn to identify and fix performance bottlenecks in AI applications using cProfile, py-spy, memory profiling, and optimization strategies for LLM pipelines and data processing.

Why Profile AI Applications

AI applications have a unique performance profile. Most of the wall-clock time is spent waiting for external API calls — LLM completions, embedding generation, vector database queries. But the CPU time between those calls matters too. Slow tokenization, inefficient prompt assembly, redundant data serialization, and memory-heavy document processing can add seconds of overhead per request that compound across thousands of daily interactions.

Profiling replaces guessing with measurement. You might assume the LLM call is the bottleneck, only to discover that your prompt template rendering takes 200ms because it re-parses Jinja templates on every call.

cProfile: Built-In Deterministic Profiling

cProfile is included in Python's standard library and measures exact call counts and cumulative time for every function.

import cProfile
import pstats
from io import StringIO

def profile_agent_pipeline():
    profiler = cProfile.Profile()
    profiler.enable()

    # Run the code you want to profile
    result = run_agent_pipeline(query="Analyze market trends")

    profiler.disable()

    # Sort by cumulative time and print top 20 functions
    stream = StringIO()
    stats = pstats.Stats(profiler, stream=stream)
    stats.sort_stats("cumulative")
    stats.print_stats(20)
    print(stream.getvalue())

    return result

You can also profile from the command line without modifying code.

python -m cProfile -s cumulative agent_pipeline.py

# Save to a file for visualization
python -m cProfile -o profile_output.prof agent_pipeline.py

# View with snakeviz (interactive browser visualization)
pip install snakeviz
snakeviz profile_output.prof

py-spy: Sampling Profiler for Production

cProfile adds overhead and requires code changes. py-spy attaches to a running Python process without any modification — perfect for profiling production AI services.

# Install py-spy
pip install py-spy

# Profile a running process by PID
py-spy top --pid 12345

# Record a flame graph
py-spy record -o flamegraph.svg --pid 12345 --duration 30

# Profile a specific script
py-spy record -o profile.svg -- python agent_server.py

Flame graphs visualize where time is spent. Wide bars represent functions that consume the most time. In AI applications, you typically see wide bars for HTTP client calls (LLM API), JSON serialization, and string operations during prompt assembly.

Profiling Async Code

Standard cProfile does not capture async/await correctly because it measures CPU time, not wall time spent in coroutines. Use yappi for async-aware profiling.

import yappi
import asyncio

async def profile_async_agent():
    yappi.set_clock_type("wall")  # wall time, not CPU time
    yappi.start()

    await run_async_agent_pipeline()

    yappi.stop()

    # Get function stats
    func_stats = yappi.get_func_stats()
    func_stats.sort("ttot", "desc")  # total time descending
    func_stats.print_all(columns={
        0: ("name", 60),
        1: ("ncall", 10),
        2: ("ttot", 10),
        3: ("tavg", 10),
    })

asyncio.run(profile_async_agent())

Memory Profiling

AI applications are memory-hungry. Document loaders, embedding vectors, and conversation histories can consume gigabytes. Use memray for detailed memory profiling.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

# Install memray
pip install memray

# Profile memory usage
python -m memray run -o output.bin agent_pipeline.py

# Generate a flamegraph of memory allocations
memray flamegraph output.bin -o memory_flamegraph.html

# Track live memory usage
memray tree output.bin

For per-line memory analysis, use memory_profiler.

from memory_profiler import profile

@profile
def load_documents(directory: str) -> list[str]:
    documents = []
    for file_path in Path(directory).glob("*.txt"):
        content = file_path.read_text()
        documents.append(content)
    return documents

# Output shows memory usage per line:
# Line #    Mem usage    Increment
#    5     45.2 MiB      0.0 MiB    documents = []
#    7     45.2 MiB      0.0 MiB    content = file_path.read_text()
#    8    312.5 MiB    267.3 MiB    documents.append(content)

Common Bottlenecks and Fixes

Here are patterns that repeatedly show up when profiling AI applications.

Redundant serialization: Converting Pydantic models to dicts multiple times in the same request chain.

# Slow: serializes on every log call
log.info("processing", data=model.model_dump())
result = process(model.model_dump())
save(model.model_dump())

# Fast: serialize once and reuse
data = model.model_dump()
log.info("processing", data=data)
result = process(data)
save(data)

String concatenation in prompt building: Using + in loops creates new string objects each time.

# Slow: O(n^2) string building
prompt = ""
for msg in messages:
    prompt += f"{msg['role']}: {msg['content']}\n"

# Fast: join is O(n)
prompt = "\n".join(f"{msg['role']}: {msg['content']}" for msg in messages)

Sequential API calls that could be concurrent:

import asyncio

# Slow: sequential
result1 = await call_llm(prompt1)
result2 = await call_llm(prompt2)
result3 = await call_llm(prompt3)

# Fast: concurrent
result1, result2, result3 = await asyncio.gather(
    call_llm(prompt1),
    call_llm(prompt2),
    call_llm(prompt3),
)

Benchmarking with timeit

For micro-benchmarks comparing two approaches, use timeit to get statistically reliable measurements.

import timeit

# Compare two prompt formatting approaches
setup = "messages = [{'role': 'user', 'content': 'hello'}] * 100"

time_concat = timeit.timeit(
    stmt='result = ""; [result := result + m["content"] for m in messages]',
    setup=setup,
    number=10_000,
)

time_join = timeit.timeit(
    stmt='"".join(m["content"] for m in messages)',
    setup=setup,
    number=10_000,
)

print(f"Concatenation: {time_concat:.3f}s")
print(f"Join: {time_join:.3f}s")

FAQ

When should I optimize Python code versus scaling infrastructure?

Profile first to identify the actual bottleneck. If 95% of request time is LLM API latency, optimizing Python code saves negligible time — scale by adding caching or request batching instead. If profiling shows significant time in your own code (prompt assembly, data processing, serialization), optimize the code. The general rule: optimize what the profiler shows, not what you assume.

Does using async automatically make my AI application faster?

Only if your application spends time waiting on I/O. Async shines when you can issue multiple LLM calls, database queries, or API requests concurrently. If your pipeline is strictly sequential — each step depends on the previous result — async adds complexity without performance benefit. Profile the specific workload to decide.

How do I profile AI applications running in Docker or Kubernetes?

Use py-spy with the --pid flag against the container's Python process. For Kubernetes, exec into the pod and run py-spy directly. Alternatively, build profiling into your application behind a feature flag — expose a /debug/profile endpoint that runs cProfile for a configurable duration and returns the results. Disable this endpoint in production unless you need it.


#Python #Performance #Profiling #Optimization #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.