Connection Pooling for AI Applications: Reusing HTTP Connections Across LLM Calls
Learn to configure HTTP connection pooling with httpx and aiohttp for AI applications. Reduce latency, manage connection limits, and optimize DNS caching for LLM API calls.
Why Connection Pooling Matters for LLM Applications
Every HTTP request to an LLM API involves a TCP handshake (one round-trip), a TLS handshake (two more round-trips), and possibly a DNS lookup. For a server 50ms away, that is 150ms of overhead before you send a single byte of your prompt. When your agent makes 20 LLM calls per user request, that overhead adds up to 3 seconds of pure connection setup.
Connection pooling eliminates this by reusing established TCP connections across multiple requests. Once the initial connection is established, subsequent requests skip the handshake entirely and start transmitting immediately.
httpx Connection Pool Configuration
httpx is the recommended async HTTP client for modern Python applications. It provides fine-grained control over connection pooling.
import httpx
# Configure a connection pool tuned for LLM API access
limits = httpx.Limits(
max_connections=100, # Total connections across all hosts
max_keepalive_connections=20, # Idle connections to keep alive
keepalive_expiry=30.0, # Seconds before idle conn is closed
)
client = httpx.AsyncClient(
limits=limits,
timeout=httpx.Timeout(
connect=5.0, # Max time to establish connection
read=60.0, # Max time to read response (LLMs are slow)
write=10.0, # Max time to send request
pool=10.0, # Max time waiting for available connection
),
http2=True, # HTTP/2 multiplexes requests over a single conn
headers={"Authorization": f"Bearer {API_KEY}"},
)
The critical parameters:
- max_connections controls how many simultaneous TCP connections the client maintains. Set this to match your concurrency level.
- max_keepalive_connections determines how many idle connections stay alive between bursts of requests.
- keepalive_expiry balances resource usage against reconnection overhead.
- http2 enables multiplexing multiple requests over a single connection, which is particularly effective for LLM APIs.
Lifecycle Management: Application-Scoped Clients
The most common mistake is creating a new client per request. Always scope the client to your application lifetime.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
from contextlib import asynccontextmanager
from fastapi import FastAPI
class LLMService:
"""LLM service with connection pool lifecycle management."""
def __init__(self):
self._client: httpx.AsyncClient | None = None
async def start(self):
self._client = httpx.AsyncClient(
limits=httpx.Limits(
max_connections=50,
max_keepalive_connections=10,
),
timeout=httpx.Timeout(connect=5.0, read=120.0),
http2=True,
base_url="https://api.openai.com/v1",
headers={"Authorization": f"Bearer {API_KEY}"},
)
async def stop(self):
if self._client:
await self._client.aclose()
async def complete(self, messages: list[dict]) -> str:
response = await self._client.post(
"/chat/completions",
json={"model": "gpt-4o", "messages": messages},
)
response.raise_for_status()
return response.json()["choices"][0]["message"]["content"]
llm_service = LLMService()
@asynccontextmanager
async def lifespan(app: FastAPI):
await llm_service.start()
yield
await llm_service.stop()
app = FastAPI(lifespan=lifespan)
aiohttp Connection Pooling
aiohttp uses TCPConnector to manage connection pools. It offers additional options like DNS caching.
import aiohttp
connector = aiohttp.TCPConnector(
limit=100, # Max total connections
limit_per_host=30, # Max connections per host
ttl_dns_cache=300, # Cache DNS lookups for 5 minutes
use_dns_cache=True, # Enable DNS caching
keepalive_timeout=30, # Keep idle connections for 30s
enable_cleanup_closed=True, # Clean up closed connections
)
async def create_session() -> aiohttp.ClientSession:
return aiohttp.ClientSession(
connector=connector,
timeout=aiohttp.ClientTimeout(
total=120, # Total request timeout
connect=5, # Connection establishment timeout
sock_read=60, # Socket read timeout
),
headers={"Authorization": f"Bearer {API_KEY}"},
)
DNS Caching
DNS resolution adds 5-50ms per request without caching. Both httpx and aiohttp can cache DNS lookups to eliminate this.
# aiohttp has built-in DNS caching via TCPConnector
connector = aiohttp.TCPConnector(
use_dns_cache=True,
ttl_dns_cache=300, # 5-minute cache TTL
)
# For httpx, use a custom transport with caching
# httpx does DNS caching automatically within connection
# pool lifetime — connections are reused, so DNS is
# only resolved once per keepalive window
Monitoring Connection Pool Health
In production, monitor your pool to detect exhaustion and connection leaks.
import logging
logger = logging.getLogger("llm_pool")
class MonitoredLLMClient:
def __init__(self, max_connections: int = 50):
self._max = max_connections
self._active = 0
self._client = httpx.AsyncClient(
limits=httpx.Limits(max_connections=max_connections),
timeout=httpx.Timeout(connect=5.0, read=120.0),
)
async def request(self, messages: list[dict]) -> str:
self._active += 1
utilization = self._active / self._max
if utilization > 0.8:
logger.warning(
f"Pool utilization high: {self._active}/{self._max} "
f"({utilization:.0%})"
)
try:
resp = await self._client.post(
"https://api.openai.com/v1/chat/completions",
json={"model": "gpt-4o", "messages": messages},
)
return resp.json()["choices"][0]["message"]["content"]
finally:
self._active -= 1
FAQ
How many max_connections should I set for LLM API calls?
Match it to your maximum expected concurrency. If your application handles 50 concurrent user requests and each makes 1-2 LLM calls, set max_connections to 50-100. Setting it too high wastes resources; too low causes requests to queue waiting for connections. Monitor pool utilization in production and adjust.
Should I use HTTP/2 for LLM API calls?
Yes, when the API supports it. HTTP/2 multiplexes multiple requests over a single TCP connection, reducing connection overhead dramatically. OpenAI and Anthropic APIs support HTTP/2. Enable it with http2=True in httpx (requires the h2 package installed).
What happens when the connection pool is exhausted?
Requests wait in a queue until a connection becomes available, up to the pool timeout. In httpx, this is the pool timeout parameter. If the timeout expires, an httpx.PoolTimeout exception is raised. Handle this by either increasing pool size or implementing request queuing with backpressure.
#Python #ConnectionPooling #Httpx #Aiohttp #Performance #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.