OpenAI Agents SDK Performance Tuning: Reducing Latency and Token Usage in Production
Optimize your OpenAI Agents SDK deployments for production with techniques for connection reuse, prompt compression, tool result caching, parallel tool execution, and token budget management.
Where Agents Spend Time and Tokens
Before optimizing, you need to understand the cost profile of an agent run. There are three main sources of latency and token usage: model calls (the LLM inference itself), tool execution (network calls, database queries, computation), and conversation history (accumulated tokens from multi-turn sessions).
Each requires a different optimization strategy. This guide covers practical techniques for each category.
Connection Reuse and Client Management
Creating a new HTTP client for every model call adds 50-200ms of overhead for TLS handshake and connection setup. Reuse clients across requests.
from agents import Agent, Runner
from openai import AsyncOpenAI
import httpx
# BAD: new client every request
async def handle_slow(message: str):
result = await Runner.run(agent, input=message)
return result.final_output
# GOOD: shared client with connection pooling
_shared_client = AsyncOpenAI(
http_client=httpx.AsyncClient(
limits=httpx.Limits(
max_connections=50,
max_keepalive_connections=20,
keepalive_expiry=30,
),
timeout=httpx.Timeout(30.0, connect=5.0),
)
)
agent = Agent(
name="fast_agent",
instructions="You are a helpful assistant.",
# The SDK uses the default OpenAI client, but you can
# configure it at the module level for connection reuse
)
Prompt Optimization: Fewer Tokens, Same Quality
Every token in your agent's instructions costs money and adds latency. Compress your prompts without losing clarity.
# VERBOSE: 89 tokens
verbose_instructions = """
You are a customer support agent for our company. Your role is to help
customers with their questions and concerns. You should always be polite,
professional, and helpful. When you don't know the answer to a question,
you should let the customer know that you will escalate their issue to
a senior support agent who can help them further.
"""
# COMPRESSED: 42 tokens — same behavior
compressed_instructions = """Customer support agent. Be polite and professional.
If unsure, escalate to senior support. Use tools to look up account info."""
# STRUCTURED: Clear format reduces ambiguity, saving re-prompt tokens
structured_instructions = """Role: Customer support agent
Behavior: Polite, professional, concise
Tools: Use search_account before answering account questions
Escalation: Hand off to senior_agent if issue is unresolved after 2 attempts
Format: Reply in 1-3 sentences unless user asks for detail"""
optimized_agent = Agent(
name="support",
instructions=structured_instructions,
)
Tool Result Caching
If a tool returns the same data for the same inputs, cache it. This saves both tool execution time and the tokens spent on redundant tool calls.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
from functools import lru_cache
from agents import function_tool
import hashlib
import json
import time
class ToolCache:
def __init__(self, ttl_seconds: int = 300):
self._cache: dict[str, tuple[str, float]] = {}
self.ttl = ttl_seconds
def get(self, key: str) -> str | None:
if key in self._cache:
value, timestamp = self._cache[key]
if time.monotonic() - timestamp < self.ttl:
return value
del self._cache[key]
return None
def set(self, key: str, value: str):
self._cache[key] = (value, time.monotonic())
def make_key(self, tool_name: str, **kwargs) -> str:
raw = json.dumps({"tool": tool_name, **kwargs}, sort_keys=True)
return hashlib.sha256(raw.encode()).hexdigest()
cache = ToolCache(ttl_seconds=600)
@function_tool
async def get_product_info(product_id: str) -> str:
"""Get product information by ID."""
cache_key = cache.make_key("get_product_info", product_id=product_id)
cached = cache.get(cache_key)
if cached:
return cached
# Actual lookup (expensive)
import httpx
async with httpx.AsyncClient() as client:
resp = await client.get(f"https://api.example.com/products/{product_id}")
result = resp.text
cache.set(cache_key, result)
return result
Conversation History Trimming
Long conversations accumulate tokens fast. Trim history to keep costs under control.
from agents.items import TResponseInputItem
class ConversationTrimmer:
def __init__(self, max_turns: int = 20, max_chars: int = 50000):
self.max_turns = max_turns
self.max_chars = max_chars
def trim(self, history: list[TResponseInputItem]) -> list[TResponseInputItem]:
# Keep system messages and the most recent turns
system_msgs = [m for m in history if isinstance(m, dict) and m.get("role") == "system"]
non_system = [m for m in history if not (isinstance(m, dict) and m.get("role") == "system")]
# Keep last N turns
trimmed = non_system[-self.max_turns * 2:] # 2 items per turn (user + assistant)
# Truncate if still too long
result = system_msgs + trimmed
total_chars = sum(len(str(m)) for m in result)
while total_chars > self.max_chars and len(result) > len(system_msgs) + 2:
result.pop(len(system_msgs)) # Remove oldest non-system message
total_chars = sum(len(str(m)) for m in result)
return result
trimmer = ConversationTrimmer(max_turns=15, max_chars=40000)
Parallel Tool Execution
When the agent calls multiple tools that are independent, execute them concurrently.
import asyncio
from agents import function_tool
@function_tool
async def get_user_orders(user_id: str) -> str:
"""Fetch user order history."""
await asyncio.sleep(0.5) # Simulates API call
return f"3 orders for user {user_id}"
@function_tool
async def get_user_profile(user_id: str) -> str:
"""Fetch user profile."""
await asyncio.sleep(0.3) # Simulates API call
return f"Profile for user {user_id}: Premium tier"
@function_tool
async def get_user_tickets(user_id: str) -> str:
"""Fetch user support tickets."""
await asyncio.sleep(0.4) # Simulates API call
return f"2 open tickets for user {user_id}"
# The SDK handles parallel tool execution automatically when the
# model requests multiple tools in a single response. To encourage
# this, mention in agent instructions:
parallel_agent = Agent(
name="support",
instructions="""Customer support agent.
When looking up user information, call get_user_profile,
get_user_orders, and get_user_tickets simultaneously.""",
tools=[get_user_orders, get_user_profile, get_user_tickets],
)
Token Budget Management
Set hard limits on token usage per agent run to prevent cost overruns.
from agents import ModelSettings
budget_agent = Agent(
name="budget_agent",
instructions="Be concise. Answer in 2-3 sentences maximum.",
model_settings=ModelSettings(
max_tokens=500, # Limit output tokens
temperature=0.3, # Lower temperature = more deterministic = fewer retries
),
)
FAQ
What is the biggest performance win for most agent systems?
Connection reuse and prompt compression together typically cut latency by 30-50%. Connection reuse eliminates TLS overhead on every model call, and shorter prompts reduce both input token costs and time-to-first-token. Start with these two before investing in more complex optimizations.
How do I measure token usage per agent run?
The SDK returns usage information in the RunResult. Access result.raw_responses to get token counts from each model call. Sum up input_tokens and output_tokens across all responses to get total usage for the run. Log these to your metrics system to track trends.
Should I use a smaller model for simple tasks?
Yes. Route simple queries (greetings, FAQ answers, status checks) to faster, cheaper models like GPT-4o-mini while keeping complex reasoning on GPT-4o or Claude. Use the custom model provider pattern to dynamically select models based on task complexity detected by a lightweight classifier.
#OpenAIAgentsSDK #Performance #Optimization #Latency #TokenUsage #Production #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.