Connection Pooling for AI Applications: Reusing HTTP Connections Across LLM Calls

Why Connection Pooling Matters for LLM Applications

Every HTTP request to an LLM API involves a TCP handshake (one round-trip), a TLS handshake (two more round-trips), and possibly a DNS lookup. For a server 50ms away, that is 150ms of overhead before you send a single byte of your prompt. When your agent makes 20 LLM calls per user request, that overhead adds up to 3 seconds of pure connection setup.

Connection pooling eliminates this by reusing established TCP connections across multiple requests. Once the initial connection is established, subsequent requests skip the handshake entirely and start transmitting immediately.

httpx Connection Pool Configuration

httpx is the recommended async HTTP client for modern Python applications. It provides fine-grained control over connection pooling.

import httpx

# Configure a connection pool tuned for LLM API access
limits = httpx.Limits(
    max_connections=100,        # Total connections across all hosts
    max_keepalive_connections=20,  # Idle connections to keep alive
    keepalive_expiry=30.0,      # Seconds before idle conn is closed
)

client = httpx.AsyncClient(
    limits=limits,
    timeout=httpx.Timeout(
        connect=5.0,    # Max time to establish connection
        read=60.0,      # Max time to read response (LLMs are slow)
        write=10.0,     # Max time to send request
        pool=10.0,      # Max time waiting for available connection
    ),
    http2=True,  # HTTP/2 multiplexes requests over a single conn
    headers={"Authorization": f"Bearer {API_KEY}"},
)

The critical parameters:

max_connections controls how many simultaneous TCP connections the client maintains. Set this to match your concurrency level.
max_keepalive_connections determines how many idle connections stay alive between bursts of requests.
keepalive_expiry balances resource usage against reconnection overhead.
http2 enables multiplexing multiple requests over a single connection, which is particularly effective for LLM APIs.

Lifecycle Management: Application-Scoped Clients

The most common mistake is creating a new client per request. Always scope the client to your application lifetime.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

from contextlib import asynccontextmanager
from fastapi import FastAPI

class LLMService:
    """LLM service with connection pool lifecycle management."""

    def __init__(self):
        self._client: httpx.AsyncClient | None = None

    async def start(self):
        self._client = httpx.AsyncClient(
            limits=httpx.Limits(
                max_connections=50,
                max_keepalive_connections=10,
            ),
            timeout=httpx.Timeout(connect=5.0, read=120.0),
            http2=True,
            base_url="https://api.openai.com/v1",
            headers={"Authorization": f"Bearer {API_KEY}"},
        )

    async def stop(self):
        if self._client:
            await self._client.aclose()

    async def complete(self, messages: list[dict]) -> str:
        response = await self._client.post(
            "/chat/completions",
            json={"model": "gpt-4o", "messages": messages},
        )
        response.raise_for_status()
        return response.json()["choices"][0]["message"]["content"]

llm_service = LLMService()

@asynccontextmanager
async def lifespan(app: FastAPI):
    await llm_service.start()
    yield
    await llm_service.stop()

app = FastAPI(lifespan=lifespan)

aiohttp Connection Pooling

aiohttp uses TCPConnector to manage connection pools. It offers additional options like DNS caching.

import aiohttp

connector = aiohttp.TCPConnector(
    limit=100,               # Max total connections
    limit_per_host=30,       # Max connections per host
    ttl_dns_cache=300,       # Cache DNS lookups for 5 minutes
    use_dns_cache=True,      # Enable DNS caching
    keepalive_timeout=30,    # Keep idle connections for 30s
    enable_cleanup_closed=True,  # Clean up closed connections
)

async def create_session() -> aiohttp.ClientSession:
    return aiohttp.ClientSession(
        connector=connector,
        timeout=aiohttp.ClientTimeout(
            total=120,      # Total request timeout
            connect=5,      # Connection establishment timeout
            sock_read=60,   # Socket read timeout
        ),
        headers={"Authorization": f"Bearer {API_KEY}"},
    )

DNS Caching

DNS resolution adds 5-50ms per request without caching. Both httpx and aiohttp can cache DNS lookups to eliminate this.

# aiohttp has built-in DNS caching via TCPConnector
connector = aiohttp.TCPConnector(
    use_dns_cache=True,
    ttl_dns_cache=300,  # 5-minute cache TTL
)

# For httpx, use a custom transport with caching
# httpx does DNS caching automatically within connection
# pool lifetime — connections are reused, so DNS is
# only resolved once per keepalive window

Monitoring Connection Pool Health

In production, monitor your pool to detect exhaustion and connection leaks.

import logging

logger = logging.getLogger("llm_pool")

class MonitoredLLMClient:
    def __init__(self, max_connections: int = 50):
        self._max = max_connections
        self._active = 0
        self._client = httpx.AsyncClient(
            limits=httpx.Limits(max_connections=max_connections),
            timeout=httpx.Timeout(connect=5.0, read=120.0),
        )

    async def request(self, messages: list[dict]) -> str:
        self._active += 1
        utilization = self._active / self._max
        if utilization > 0.8:
            logger.warning(
                f"Pool utilization high: {self._active}/{self._max} "
                f"({utilization:.0%})"
            )
        try:
            resp = await self._client.post(
                "https://api.openai.com/v1/chat/completions",
                json={"model": "gpt-4o", "messages": messages},
            )
            return resp.json()["choices"][0]["message"]["content"]
        finally:
            self._active -= 1

FAQ

How many max_connections should I set for LLM API calls?

Match it to your maximum expected concurrency. If your application handles 50 concurrent user requests and each makes 1-2 LLM calls, set max_connections to 50-100. Setting it too high wastes resources; too low causes requests to queue waiting for connections. Monitor pool utilization in production and adjust.

Should I use HTTP/2 for LLM API calls?

Yes, when the API supports it. HTTP/2 multiplexes multiple requests over a single TCP connection, reducing connection overhead dramatically. OpenAI and Anthropic APIs support HTTP/2. Enable it with http2=True in httpx (requires the h2 package installed).

What happens when the connection pool is exhausted?

Requests wait in a queue until a connection becomes available, up to the pool timeout. In httpx, this is the pool timeout parameter. If the timeout expires, an httpx.PoolTimeout exception is raised. Handle this by either increasing pool size or implementing request queuing with backpressure.

#Python #ConnectionPooling #Httpx #Aiohttp #Performance #AgenticAI #LearnAI #AIEngineering

Connection Pooling for AI Applications: Reusing HTTP Connections Across LLM Calls

Why Connection Pooling Matters for LLM Applications

httpx Connection Pool Configuration

Lifecycle Management: Application-Scoped Clients

aiohttp Connection Pooling

DNS Caching

Monitoring Connection Pool Health

FAQ

How many max_connections should I set for LLM API calls?

Should I use HTTP/2 for LLM API calls?

What happens when the connection pool is exhausted?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding