Database Scaling for AI Agents: Connection Pooling, Read Replicas, and Sharding

The Database Bottleneck in AI Agent Systems

AI agent platforms hit database bottlenecks faster than typical web applications. Each agent conversation generates a high volume of reads and writes — fetching conversation history, persisting each turn, logging tool calls, storing retrieved documents, and updating session metadata. A platform with 1,000 concurrent agent sessions might produce 10,000 to 50,000 database operations per minute.

PostgreSQL handles this well up to a point, but three specific problems emerge at scale: connection exhaustion, read contention, and table size.

Connection Pooling with PgBouncer

PostgreSQL creates a new process for each client connection. With 50 agent worker pods each maintaining a pool of 20 connections, you need 1,000 PostgreSQL backend processes — each consuming 5 to 10 MB of RAM. At this scale, the database server spends more resources managing connections than executing queries.

PgBouncer sits between your application and PostgreSQL, multiplexing hundreds of application connections over a smaller number of actual database connections:

# pgbouncer.ini mounted as ConfigMap
[databases]
agents = host=postgres-primary port=5432 dbname=agents

[pgbouncer]
listen_addr = 0.0.0.0
listen_port = 6432
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt

# Transaction pooling: connection returned after each transaction
pool_mode = transaction
default_pool_size = 50
max_client_conn = 2000
max_db_connections = 100
reserve_pool_size = 10
reserve_pool_timeout = 3

# Timeout settings for long-running agent queries
query_timeout = 30
client_idle_timeout = 300
server_idle_timeout = 60

Transaction pooling mode is the right choice for AI agent workloads. Each agent request grabs a connection, runs its transaction (read history, write new turn), and immediately returns the connection. This lets 2,000 application connections share 100 actual PostgreSQL connections.

Deploy PgBouncer as a sidecar or separate pod:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: pgbouncer
spec:
  replicas: 2
  template:
    spec:
      containers:
        - name: pgbouncer
          image: edoburu/pgbouncer:1.22.0
          ports:
            - containerPort: 6432
          volumeMounts:
            - name: config
              mountPath: /etc/pgbouncer
          resources:
            requests:
              cpu: "200m"
              memory: "128Mi"
            limits:
              cpu: "500m"
              memory: "256Mi"

Read/Write Splitting

Most AI agent database operations are reads — fetching conversation history, looking up tool configurations, loading prompt templates. Sending reads to replicas frees the primary for writes:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

from sqlalchemy import create_engine
from sqlalchemy.orm import Session
import random

write_engine = create_engine(
    "postgresql://user:pass@pgbouncer-primary:6432/agents",
    pool_size=20,
    pool_pre_ping=True,
)
read_engines = [
    create_engine(
        f"postgresql://user:pass@pgbouncer-replica-{i}:6432/agents",
        pool_size=20,
        pool_pre_ping=True,
    )
    for i in range(3)
]

def get_write_session() -> Session:
    return Session(bind=write_engine)

def get_read_session() -> Session:
    engine = random.choice(read_engines)
    return Session(bind=engine)

# Usage in agent code
def get_conversation_history(session_id: str) -> list:
    with get_read_session() as session:
        return session.execute(
            "SELECT role, content, created_at "
            "FROM messages WHERE session_id = :sid "
            "ORDER BY created_at",
            {"sid": session_id},
        ).fetchall()

def save_message(session_id: str, role: str, content: str):
    with get_write_session() as session:
        session.execute(
            "INSERT INTO messages (session_id, role, content) "
            "VALUES (:sid, :role, :content)",
            {"sid": session_id, "role": role, "content": content},
        )
        session.commit()

One caveat: replication lag means a message you just wrote may not appear on a replica for 10 to 100 milliseconds. For the critical path where you write a user message and immediately need to read the full history to send to the LLM, read from the primary. Use replicas for dashboard queries, analytics, and search operations where slight staleness is acceptable.

Sharding by Tenant or Session

When a single PostgreSQL instance cannot hold all conversation data even with replicas, shard horizontally. The most natural shard key for AI agent platforms is tenant ID (for multi-tenant platforms) or a hash of the session ID:

import hashlib

SHARD_COUNT = 4
shard_engines = {
    i: create_engine(f"postgresql://user:pass@shard-{i}:6432/agents")
    for i in range(SHARD_COUNT)
}

def get_shard(session_id: str) -> int:
    hash_val = int(hashlib.sha256(session_id.encode()).hexdigest(), 16)
    return hash_val % SHARD_COUNT

def get_session_engine(session_id: str):
    shard = get_shard(session_id)
    return shard_engines[shard]

This ensures all messages for a given session live on the same shard, so fetching conversation history never requires cross-shard queries.

Migrating to a Sharded Architecture

The migration from a single database to shards follows this sequence: add the shard column to your schema, deploy the routing layer to write to the correct shard while still reading from the original database, backfill historical data to shards, then switch reads to the sharded path. Run both paths in parallel with comparison logging before cutting over fully.

FAQ

When should I introduce PgBouncer versus just increasing PostgreSQL max_connections?

Introduce PgBouncer when you have more than 200 concurrent connections or more than 10 application pods. Increasing max_connections beyond 200 to 300 degrades PostgreSQL performance because each connection is a separate process with its own memory allocation. PgBouncer multiplexes connections efficiently with minimal resource overhead.

How do I handle replication lag when an agent writes a message and immediately reads it back?

For the write-then-read-immediately pattern, always read from the primary. Route only non-critical reads — dashboards, analytics, search — to replicas. Alternatively, some connection poolers support "primary stickiness" where reads after a write in the same transaction or within a short time window automatically route to the primary.

Is sharding worth the complexity for AI agent platforms?

Only if a single PostgreSQL instance (with replicas) cannot handle your data volume. Most platforms can handle millions of conversations on a single well-tuned PostgreSQL instance with proper indexes and partitioning. Shard when you exceed 1 to 2 TB of conversation data or need to isolate tenants for compliance reasons.

#DatabaseScaling #PostgreSQL #ConnectionPooling #PgBouncer #Sharding #AIAgents #AgenticAI #LearnAI #AIEngineering

Database Scaling for AI Agents: Connection Pooling, Read Replicas, and Sharding

The Database Bottleneck in AI Agent Systems

Connection Pooling with PgBouncer

Read/Write Splitting

Sharding by Tenant or Session

Migrating to a Sharded Architecture

FAQ

When should I introduce PgBouncer versus just increasing PostgreSQL max_connections?

How do I handle replication lag when an agent writes a message and immediately reads it back?

Is sharding worth the complexity for AI agent platforms?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding