Skip to content
Learn Agentic AI11 min read0 views

Deploying FastAPI AI Agents: Uvicorn, Gunicorn, and Docker Configuration

Deploy FastAPI AI agent backends to production with optimal Uvicorn and Gunicorn configuration, Docker multi-stage builds, health check endpoints, and graceful shutdown handling for long-running agent requests.

Production Deployment Considerations for AI Agents

Deploying an AI agent backend to production is different from deploying a typical web API. Agent requests are long-running because LLM calls can take 5 to 30 seconds. Streaming responses keep connections open for extended periods. Memory usage can spike when processing large documents. And a cold start that takes 10 seconds to load embeddings is unacceptable if your health check does not account for it.

This guide covers the server configuration, containerization, and operational patterns that make AI agent backends reliable in production.

Uvicorn Configuration for Development vs Production

Uvicorn is the ASGI server that runs your FastAPI application. Development and production configurations differ significantly:

# Development: run directly with auto-reload
# uvicorn app.main:app --reload --host 0.0.0.0 --port 8000

# Production configuration file: uvicorn_config.py
import multiprocessing

bind = "0.0.0.0:8000"
workers = multiprocessing.cpu_count()
worker_class = "uvicorn.workers.UvicornWorker"
timeout = 120  # Agent requests can be slow
keepalive = 5
accesslog = "-"
errorlog = "-"
loglevel = "info"

For AI agents, set timeout high enough to accommodate LLM response times. A 30-second timeout will kill legitimate agent requests that are waiting for a complex LLM response.

Gunicorn with Uvicorn Workers

For production, run Gunicorn as the process manager with Uvicorn workers. Gunicorn handles process lifecycle, auto-restart of crashed workers, and graceful reloading:

# gunicorn.conf.py
import multiprocessing

workers = multiprocessing.cpu_count() * 2 + 1
worker_class = "uvicorn.workers.UvicornWorker"
worker_connections = 1000
timeout = 120           # Agent requests can be slow
graceful_timeout = 30
keepalive = 5
bind = "0.0.0.0:8000"
preload_app = True      # Share loaded models across workers
max_requests = 1000     # Restart workers to prevent leaks
max_requests_jitter = 50
accesslog = "-"
errorlog = "-"
loglevel = "info"

Key settings: preload_app loads your app once and forks workers from it, sharing memory for embeddings and models. max_requests restarts workers periodically to prevent memory leaks. The jitter prevents all workers from restarting simultaneously.

Run with:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

gunicorn app.main:app -c gunicorn.conf.py

Health Check Endpoints

AI agent backends need health checks that verify the full dependency chain, not just that the HTTP server is running:

from fastapi import APIRouter

router = APIRouter(tags=["health"])

@router.get("/health")
async def health_check():
    return {"status": "healthy"}

@router.get("/health/ready")
async def readiness_check(
    db: AsyncSession = Depends(get_db),
    llm_client: AsyncOpenAI = Depends(get_llm_client),
):
    checks = {}
    try:
        await db.execute(text("SELECT 1"))
        checks["database"] = "ok"
    except Exception as e:
        checks["database"] = f"error: {str(e)}"

    try:
        await llm_client.models.list()
        checks["llm_api"] = "ok"
    except Exception as e:
        checks["llm_api"] = f"error: {str(e)}"

    all_healthy = all(v == "ok" for v in checks.values())
    return JSONResponse(
        status_code=200 if all_healthy else 503,
        content={"status": "ready" if all_healthy else "degraded", "checks": checks},
    )

Use /health for Kubernetes liveness probes and /health/ready for readiness probes. The readiness check verifies that downstream dependencies are reachable before accepting traffic.

Docker Multi-Stage Build

A multi-stage Dockerfile keeps your production image small and secure:

# Stage 1: Build dependencies
FROM python:3.12-slim AS builder

WORKDIR /build
COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt

# Stage 2: Production image
FROM python:3.12-slim

# Security: run as non-root
RUN groupadd -r agent && useradd -r -g agent agent

WORKDIR /app

# Copy installed packages from builder
COPY --from=builder /install /usr/local

# Copy application code
COPY app/ ./app/
COPY gunicorn.conf.py .

# Set environment
ENV PYTHONUNBUFFERED=1 \
    PYTHONDONTWRITEBYTECODE=1 \
    PORT=8000

EXPOSE 8000

# Health check
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')"

# Run as non-root user
USER agent

CMD ["gunicorn", "app.main:app", "-c", "gunicorn.conf.py"]

The builder stage installs dependencies into a prefix directory. The production stage copies only the installed packages and application code, leaving behind build tools, pip cache, and other unnecessary artifacts.

Graceful Shutdown for Long-Running Requests

AI agent requests can take 30 seconds or more. Configure graceful shutdown so in-flight requests complete before the server stops:

import signal
import asyncio

shutdown_event = asyncio.Event()

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Startup
    app.state.llm_client = AsyncOpenAI()
    yield
    # Shutdown: signal all active streams to stop
    shutdown_event.set()
    # Give active requests time to complete
    await asyncio.sleep(5)
    await app.state.llm_client.close()

async def agent_stream_with_shutdown(message: str):
    async for token in llm.stream_generate(message):
        if shutdown_event.is_set():
            yield {"event": "error", "data": "Server shutting down"}
            return
        yield {"event": "token", "data": token}

In your Kubernetes deployment, set terminationGracePeriodSeconds to at least 60 seconds to allow active agent requests to finish before the pod is killed.

FAQ

How many Gunicorn workers should I run for an AI agent API?

For async FastAPI with AI agent workloads, start with CPU count plus 1, not the typical 2x CPU plus 1 formula. Each async worker handles many concurrent connections through the event loop, so you need fewer workers than a synchronous framework. The bottleneck is usually the LLM API, not CPU. Monitor memory usage per worker since each worker loads shared resources. If each worker uses 500MB and you have 4GB of RAM, 4 workers with overhead is your practical limit.

Should I use preload_app with Gunicorn?

Yes, for AI agent backends. With preload_app = True, Gunicorn loads your FastAPI application once and forks workers from it. This means loaded embeddings, model configurations, and shared data are in memory only once through copy-on-write. Without preload, each worker independently loads everything, multiplying memory usage. The trade-off is that code changes require a full Gunicorn restart rather than a graceful worker reload, but in production you are deploying new containers anyway.

How do I handle the 30-second default timeout for AI agent requests behind a reverse proxy?

Increase timeout values at every layer. Set Gunicorn timeout to 120 seconds. Configure your Nginx proxy_read_timeout to 120 seconds. Set your load balancer idle timeout to 120 seconds. For Kubernetes, set nginx.ingress.kubernetes.io/proxy-read-timeout: "120" on your Ingress. If you use streaming, many proxies reset their timeout on each chunk received, so streaming naturally avoids timeout issues as long as tokens arrive regularly.


#FastAPI #Docker #Deployment #Uvicorn #AIAgents #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.