Reducing Time-to-First-Token in AI Agents: Connection Reuse, Warm Pools, and Prefetching
Learn how to minimize the delay between a user request and the first visible response from your AI agent by optimizing connections, DNS caching, request pipelining, and warm pool strategies.
What Is Time-to-First-Token and Why It Matters
Time-to-First-Token (TTFT) is the duration between when a user submits a request and when the first token of the AI response becomes visible. In conversational AI agents, TTFT directly shapes user perception of speed. A 2-second TTFT feels snappy. A 5-second TTFT feels broken — even if the total generation time is identical.
Most of the TTFT budget is not spent inside the LLM. It is consumed by network overhead: DNS resolution, TCP handshake, TLS negotiation, and HTTP request serialization. Optimizing these layers can shave 200-800ms off every single request.
Connection Reuse with HTTP Keep-Alive
Every new HTTPS connection to an LLM provider requires a DNS lookup, TCP three-way handshake, and TLS negotiation. On a cold connection to OpenAI or Anthropic, this adds 150-400ms. Connection reuse eliminates this overhead for subsequent requests.
import httpx
import asyncio
# BAD: Creating a new client per request
async def slow_completion(prompt: str) -> str:
async with httpx.AsyncClient() as client:
response = await client.post(
"https://api.openai.com/v1/chat/completions",
headers={"Authorization": "Bearer sk-..."},
json={"model": "gpt-4o", "messages": [{"role": "user", "content": prompt}]},
)
return response.json()["choices"][0]["message"]["content"]
# GOOD: Reuse a single client across all requests
class LLMClient:
def __init__(self):
self._client = httpx.AsyncClient(
timeout=httpx.Timeout(30.0, connect=5.0),
limits=httpx.Limits(
max_connections=20,
max_keepalive_connections=10,
keepalive_expiry=120,
),
http2=True,
)
async def completion(self, prompt: str) -> str:
response = await self._client.post(
"https://api.openai.com/v1/chat/completions",
headers={"Authorization": "Bearer sk-..."},
json={"model": "gpt-4o", "messages": [{"role": "user", "content": prompt}]},
)
return response.json()["choices"][0]["message"]["content"]
async def close(self):
await self._client.aclose()
The httpx.AsyncClient with http2=True enables multiplexed streams over a single connection, meaning multiple LLM calls share one TLS session.
DNS Caching
DNS resolution adds 20-80ms per cold lookup. Python does not cache DNS results by default. You can add a resolver cache that persists across requests.
import httpx
from httpx._transports.default import AsyncHTTPTransport
# Configure transport with connection pooling
transport = AsyncHTTPTransport(
retries=2,
http2=True,
)
client = httpx.AsyncClient(
transport=transport,
timeout=httpx.Timeout(30.0, connect=5.0),
)
At the infrastructure level, running a local DNS cache like dnsmasq or using systemd-resolved with caching enabled eliminates repeated lookups entirely.
Warm Pools: Pre-Establishing Connections
A warm pool pre-establishes connections before any user request arrives. When the first request comes in, the TCP and TLS handshake are already complete.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
import asyncio
import httpx
class WarmLLMPool:
def __init__(self, base_url: str, api_key: str, pool_size: int = 5):
self.client = httpx.AsyncClient(
base_url=base_url,
headers={"Authorization": f"Bearer {api_key}"},
limits=httpx.Limits(
max_connections=pool_size,
max_keepalive_connections=pool_size,
),
http2=True,
timeout=httpx.Timeout(30.0),
)
async def warm_up(self):
"""Pre-establish connections by sending lightweight requests."""
tasks = [
self.client.get("/v1/models")
for _ in range(3)
]
await asyncio.gather(*tasks, return_exceptions=True)
async def complete(self, messages: list[dict]) -> str:
response = await self.client.post(
"/v1/chat/completions",
json={"model": "gpt-4o", "messages": messages, "max_tokens": 1},
)
return response.json()
# During application startup
pool = WarmLLMPool("https://api.openai.com", "sk-...")
await pool.warm_up()
Call warm_up() during your application's startup phase — in FastAPI this goes inside the lifespan handler, in Django it goes in AppConfig.ready().
Request Prefetching for Predictable Workflows
When your agent follows predictable patterns — like always retrieving user context before generating a response — you can prefetch data while the user is still typing.
import asyncio
class PrefetchingAgent:
def __init__(self, llm_client, user_store):
self.llm = llm_client
self.users = user_store
self._prefetch_cache: dict[str, asyncio.Task] = {}
async def on_typing_started(self, user_id: str):
"""Trigger prefetch when user starts typing."""
if user_id not in self._prefetch_cache:
self._prefetch_cache[user_id] = asyncio.create_task(
self.users.get_context(user_id)
)
async def handle_message(self, user_id: str, message: str):
# Retrieve prefetched context (already in flight or completed)
task = self._prefetch_cache.pop(user_id, None)
if task:
context = await task
else:
context = await self.users.get_context(user_id)
return await self.llm.completion(
f"User context: {context}\nUser: {message}"
)
This pattern overlaps network I/O with user think time, reducing perceived TTFT by the full duration of the prefetch.
Measuring TTFT in Practice
Always measure TTFT from the client perspective, not server-side. Use structured logging to track each phase.
import time
async def timed_completion(client, messages):
t_start = time.perf_counter()
response = await client.post(
"/v1/chat/completions",
json={"model": "gpt-4o", "messages": messages, "stream": True},
)
t_first_byte = time.perf_counter()
chunks = []
async for chunk in response.aiter_bytes():
if not chunks:
t_first_token = time.perf_counter()
chunks.append(chunk)
return {
"ttfb_ms": (t_first_byte - t_start) * 1000,
"ttft_ms": (t_first_token - t_start) * 1000,
"total_ms": (time.perf_counter() - t_start) * 1000,
}
FAQ
How much latency does connection reuse actually save?
On a typical HTTPS connection to a major LLM provider, the cold connection overhead is 150-400ms (DNS + TCP + TLS). Connection reuse eliminates all of this for subsequent requests. Over a conversation with 10 turns, that saves 1.5-4 seconds of cumulative wait time.
Should I use HTTP/2 for LLM API calls?
Yes. HTTP/2 multiplexes multiple requests over a single TCP connection, which is valuable when your agent makes parallel tool calls or sends multiple completions simultaneously. Libraries like httpx support it natively with http2=True.
What is a good TTFT target for conversational AI agents?
Under 500ms is excellent, under 1 second is acceptable for most applications, and anything over 2 seconds will feel sluggish to users. These targets include network overhead but exclude the actual model inference time at the provider.
#Performance #TTFT #ConnectionPooling #Latency #Python #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.