Skip to content
Learn Agentic AI9 min read0 views

Reducing Time-to-First-Token in AI Agents: Connection Reuse, Warm Pools, and Prefetching

Learn how to minimize the delay between a user request and the first visible response from your AI agent by optimizing connections, DNS caching, request pipelining, and warm pool strategies.

What Is Time-to-First-Token and Why It Matters

Time-to-First-Token (TTFT) is the duration between when a user submits a request and when the first token of the AI response becomes visible. In conversational AI agents, TTFT directly shapes user perception of speed. A 2-second TTFT feels snappy. A 5-second TTFT feels broken — even if the total generation time is identical.

Most of the TTFT budget is not spent inside the LLM. It is consumed by network overhead: DNS resolution, TCP handshake, TLS negotiation, and HTTP request serialization. Optimizing these layers can shave 200-800ms off every single request.

Connection Reuse with HTTP Keep-Alive

Every new HTTPS connection to an LLM provider requires a DNS lookup, TCP three-way handshake, and TLS negotiation. On a cold connection to OpenAI or Anthropic, this adds 150-400ms. Connection reuse eliminates this overhead for subsequent requests.

import httpx
import asyncio

# BAD: Creating a new client per request
async def slow_completion(prompt: str) -> str:
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "https://api.openai.com/v1/chat/completions",
            headers={"Authorization": "Bearer sk-..."},
            json={"model": "gpt-4o", "messages": [{"role": "user", "content": prompt}]},
        )
        return response.json()["choices"][0]["message"]["content"]

# GOOD: Reuse a single client across all requests
class LLMClient:
    def __init__(self):
        self._client = httpx.AsyncClient(
            timeout=httpx.Timeout(30.0, connect=5.0),
            limits=httpx.Limits(
                max_connections=20,
                max_keepalive_connections=10,
                keepalive_expiry=120,
            ),
            http2=True,
        )

    async def completion(self, prompt: str) -> str:
        response = await self._client.post(
            "https://api.openai.com/v1/chat/completions",
            headers={"Authorization": "Bearer sk-..."},
            json={"model": "gpt-4o", "messages": [{"role": "user", "content": prompt}]},
        )
        return response.json()["choices"][0]["message"]["content"]

    async def close(self):
        await self._client.aclose()

The httpx.AsyncClient with http2=True enables multiplexed streams over a single connection, meaning multiple LLM calls share one TLS session.

DNS Caching

DNS resolution adds 20-80ms per cold lookup. Python does not cache DNS results by default. You can add a resolver cache that persists across requests.

import httpx
from httpx._transports.default import AsyncHTTPTransport

# Configure transport with connection pooling
transport = AsyncHTTPTransport(
    retries=2,
    http2=True,
)

client = httpx.AsyncClient(
    transport=transport,
    timeout=httpx.Timeout(30.0, connect=5.0),
)

At the infrastructure level, running a local DNS cache like dnsmasq or using systemd-resolved with caching enabled eliminates repeated lookups entirely.

Warm Pools: Pre-Establishing Connections

A warm pool pre-establishes connections before any user request arrives. When the first request comes in, the TCP and TLS handshake are already complete.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

import asyncio
import httpx

class WarmLLMPool:
    def __init__(self, base_url: str, api_key: str, pool_size: int = 5):
        self.client = httpx.AsyncClient(
            base_url=base_url,
            headers={"Authorization": f"Bearer {api_key}"},
            limits=httpx.Limits(
                max_connections=pool_size,
                max_keepalive_connections=pool_size,
            ),
            http2=True,
            timeout=httpx.Timeout(30.0),
        )

    async def warm_up(self):
        """Pre-establish connections by sending lightweight requests."""
        tasks = [
            self.client.get("/v1/models")
            for _ in range(3)
        ]
        await asyncio.gather(*tasks, return_exceptions=True)

    async def complete(self, messages: list[dict]) -> str:
        response = await self.client.post(
            "/v1/chat/completions",
            json={"model": "gpt-4o", "messages": messages, "max_tokens": 1},
        )
        return response.json()

# During application startup
pool = WarmLLMPool("https://api.openai.com", "sk-...")
await pool.warm_up()

Call warm_up() during your application's startup phase — in FastAPI this goes inside the lifespan handler, in Django it goes in AppConfig.ready().

Request Prefetching for Predictable Workflows

When your agent follows predictable patterns — like always retrieving user context before generating a response — you can prefetch data while the user is still typing.

import asyncio

class PrefetchingAgent:
    def __init__(self, llm_client, user_store):
        self.llm = llm_client
        self.users = user_store
        self._prefetch_cache: dict[str, asyncio.Task] = {}

    async def on_typing_started(self, user_id: str):
        """Trigger prefetch when user starts typing."""
        if user_id not in self._prefetch_cache:
            self._prefetch_cache[user_id] = asyncio.create_task(
                self.users.get_context(user_id)
            )

    async def handle_message(self, user_id: str, message: str):
        # Retrieve prefetched context (already in flight or completed)
        task = self._prefetch_cache.pop(user_id, None)
        if task:
            context = await task
        else:
            context = await self.users.get_context(user_id)

        return await self.llm.completion(
            f"User context: {context}\nUser: {message}"
        )

This pattern overlaps network I/O with user think time, reducing perceived TTFT by the full duration of the prefetch.

Measuring TTFT in Practice

Always measure TTFT from the client perspective, not server-side. Use structured logging to track each phase.

import time

async def timed_completion(client, messages):
    t_start = time.perf_counter()
    response = await client.post(
        "/v1/chat/completions",
        json={"model": "gpt-4o", "messages": messages, "stream": True},
    )
    t_first_byte = time.perf_counter()

    chunks = []
    async for chunk in response.aiter_bytes():
        if not chunks:
            t_first_token = time.perf_counter()
        chunks.append(chunk)

    return {
        "ttfb_ms": (t_first_byte - t_start) * 1000,
        "ttft_ms": (t_first_token - t_start) * 1000,
        "total_ms": (time.perf_counter() - t_start) * 1000,
    }

FAQ

How much latency does connection reuse actually save?

On a typical HTTPS connection to a major LLM provider, the cold connection overhead is 150-400ms (DNS + TCP + TLS). Connection reuse eliminates all of this for subsequent requests. Over a conversation with 10 turns, that saves 1.5-4 seconds of cumulative wait time.

Should I use HTTP/2 for LLM API calls?

Yes. HTTP/2 multiplexes multiple requests over a single TCP connection, which is valuable when your agent makes parallel tool calls or sends multiple completions simultaneously. Libraries like httpx support it natively with http2=True.

What is a good TTFT target for conversational AI agents?

Under 500ms is excellent, under 1 second is acceptable for most applications, and anything over 2 seconds will feel sluggish to users. These targets include network overhead but exclude the actual model inference time at the provider.


#Performance #TTFT #ConnectionPooling #Latency #Python #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.