Skip to content
Learn Agentic AI11 min read0 views

OpenAI Agents SDK Performance Tuning: Reducing Latency and Token Usage in Production

Optimize your OpenAI Agents SDK deployments for production with techniques for connection reuse, prompt compression, tool result caching, parallel tool execution, and token budget management.

Where Agents Spend Time and Tokens

Before optimizing, you need to understand the cost profile of an agent run. There are three main sources of latency and token usage: model calls (the LLM inference itself), tool execution (network calls, database queries, computation), and conversation history (accumulated tokens from multi-turn sessions).

Each requires a different optimization strategy. This guide covers practical techniques for each category.

Connection Reuse and Client Management

Creating a new HTTP client for every model call adds 50-200ms of overhead for TLS handshake and connection setup. Reuse clients across requests.

from agents import Agent, Runner
from openai import AsyncOpenAI
import httpx

# BAD: new client every request
async def handle_slow(message: str):
    result = await Runner.run(agent, input=message)
    return result.final_output

# GOOD: shared client with connection pooling
_shared_client = AsyncOpenAI(
    http_client=httpx.AsyncClient(
        limits=httpx.Limits(
            max_connections=50,
            max_keepalive_connections=20,
            keepalive_expiry=30,
        ),
        timeout=httpx.Timeout(30.0, connect=5.0),
    )
)

agent = Agent(
    name="fast_agent",
    instructions="You are a helpful assistant.",
    # The SDK uses the default OpenAI client, but you can
    # configure it at the module level for connection reuse
)

Prompt Optimization: Fewer Tokens, Same Quality

Every token in your agent's instructions costs money and adds latency. Compress your prompts without losing clarity.

# VERBOSE: 89 tokens
verbose_instructions = """
You are a customer support agent for our company. Your role is to help
customers with their questions and concerns. You should always be polite,
professional, and helpful. When you don't know the answer to a question,
you should let the customer know that you will escalate their issue to
a senior support agent who can help them further.
"""

# COMPRESSED: 42 tokens — same behavior
compressed_instructions = """Customer support agent. Be polite and professional.
If unsure, escalate to senior support. Use tools to look up account info."""

# STRUCTURED: Clear format reduces ambiguity, saving re-prompt tokens
structured_instructions = """Role: Customer support agent
Behavior: Polite, professional, concise
Tools: Use search_account before answering account questions
Escalation: Hand off to senior_agent if issue is unresolved after 2 attempts
Format: Reply in 1-3 sentences unless user asks for detail"""

optimized_agent = Agent(
    name="support",
    instructions=structured_instructions,
)

Tool Result Caching

If a tool returns the same data for the same inputs, cache it. This saves both tool execution time and the tokens spent on redundant tool calls.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

from functools import lru_cache
from agents import function_tool
import hashlib
import json
import time


class ToolCache:
    def __init__(self, ttl_seconds: int = 300):
        self._cache: dict[str, tuple[str, float]] = {}
        self.ttl = ttl_seconds

    def get(self, key: str) -> str | None:
        if key in self._cache:
            value, timestamp = self._cache[key]
            if time.monotonic() - timestamp < self.ttl:
                return value
            del self._cache[key]
        return None

    def set(self, key: str, value: str):
        self._cache[key] = (value, time.monotonic())

    def make_key(self, tool_name: str, **kwargs) -> str:
        raw = json.dumps({"tool": tool_name, **kwargs}, sort_keys=True)
        return hashlib.sha256(raw.encode()).hexdigest()


cache = ToolCache(ttl_seconds=600)


@function_tool
async def get_product_info(product_id: str) -> str:
    """Get product information by ID."""
    cache_key = cache.make_key("get_product_info", product_id=product_id)
    cached = cache.get(cache_key)
    if cached:
        return cached

    # Actual lookup (expensive)
    import httpx
    async with httpx.AsyncClient() as client:
        resp = await client.get(f"https://api.example.com/products/{product_id}")
        result = resp.text

    cache.set(cache_key, result)
    return result

Conversation History Trimming

Long conversations accumulate tokens fast. Trim history to keep costs under control.

from agents.items import TResponseInputItem


class ConversationTrimmer:
    def __init__(self, max_turns: int = 20, max_chars: int = 50000):
        self.max_turns = max_turns
        self.max_chars = max_chars

    def trim(self, history: list[TResponseInputItem]) -> list[TResponseInputItem]:
        # Keep system messages and the most recent turns
        system_msgs = [m for m in history if isinstance(m, dict) and m.get("role") == "system"]
        non_system = [m for m in history if not (isinstance(m, dict) and m.get("role") == "system")]

        # Keep last N turns
        trimmed = non_system[-self.max_turns * 2:]  # 2 items per turn (user + assistant)

        # Truncate if still too long
        result = system_msgs + trimmed
        total_chars = sum(len(str(m)) for m in result)

        while total_chars > self.max_chars and len(result) > len(system_msgs) + 2:
            result.pop(len(system_msgs))  # Remove oldest non-system message
            total_chars = sum(len(str(m)) for m in result)

        return result


trimmer = ConversationTrimmer(max_turns=15, max_chars=40000)

Parallel Tool Execution

When the agent calls multiple tools that are independent, execute them concurrently.

import asyncio
from agents import function_tool


@function_tool
async def get_user_orders(user_id: str) -> str:
    """Fetch user order history."""
    await asyncio.sleep(0.5)  # Simulates API call
    return f"3 orders for user {user_id}"


@function_tool
async def get_user_profile(user_id: str) -> str:
    """Fetch user profile."""
    await asyncio.sleep(0.3)  # Simulates API call
    return f"Profile for user {user_id}: Premium tier"


@function_tool
async def get_user_tickets(user_id: str) -> str:
    """Fetch user support tickets."""
    await asyncio.sleep(0.4)  # Simulates API call
    return f"2 open tickets for user {user_id}"

# The SDK handles parallel tool execution automatically when the
# model requests multiple tools in a single response. To encourage
# this, mention in agent instructions:

parallel_agent = Agent(
    name="support",
    instructions="""Customer support agent.
    When looking up user information, call get_user_profile,
    get_user_orders, and get_user_tickets simultaneously.""",
    tools=[get_user_orders, get_user_profile, get_user_tickets],
)

Token Budget Management

Set hard limits on token usage per agent run to prevent cost overruns.

from agents import ModelSettings

budget_agent = Agent(
    name="budget_agent",
    instructions="Be concise. Answer in 2-3 sentences maximum.",
    model_settings=ModelSettings(
        max_tokens=500,         # Limit output tokens
        temperature=0.3,        # Lower temperature = more deterministic = fewer retries
    ),
)

FAQ

What is the biggest performance win for most agent systems?

Connection reuse and prompt compression together typically cut latency by 30-50%. Connection reuse eliminates TLS overhead on every model call, and shorter prompts reduce both input token costs and time-to-first-token. Start with these two before investing in more complex optimizations.

How do I measure token usage per agent run?

The SDK returns usage information in the RunResult. Access result.raw_responses to get token counts from each model call. Sum up input_tokens and output_tokens across all responses to get total usage for the run. Log these to your metrics system to track trends.

Should I use a smaller model for simple tasks?

Yes. Route simple queries (greetings, FAQ answers, status checks) to faster, cheaper models like GPT-4o-mini while keeping complex reasoning on GPT-4o or Claude. Use the custom model provider pattern to dynamically select models based on task complexity detected by a lightweight classifier.


#OpenAIAgentsSDK #Performance #Optimization #Latency #TokenUsage #Production #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.