FastAPI for AI Agents: Project Structure and Async Best Practices

Why FastAPI for AI Agent Backends

FastAPI has become the framework of choice for building AI agent backends. Its native async support means your server can handle hundreds of concurrent LLM API calls without blocking. Its automatic OpenAPI documentation makes it trivial for frontend teams to integrate. And its dependency injection system maps perfectly to the pattern of injecting LLM clients, database sessions, and agent configurations into your endpoints.

Unlike Django or Flask, FastAPI was designed from the ground up around Python type hints and async/await. When your agent backend needs to call an LLM, retrieve context from a vector database, and log the interaction simultaneously, async endpoints handle this naturally without thread pool hacks.

Recommended Project Structure

A well-organized project keeps agent logic, API routes, and infrastructure concerns cleanly separated:

ai_agent_backend/
  app/
    __init__.py
    main.py              # FastAPI app, lifespan, middleware
    config.py            # Settings with pydantic-settings
    routes/
      __init__.py
      agents.py          # Agent conversation endpoints
      tools.py           # Tool execution endpoints
      health.py          # Health check routes
    agents/
      __init__.py
      base.py            # Base agent class
      research_agent.py  # Specialized agents
      support_agent.py
    services/
      __init__.py
      llm_service.py     # LLM client wrapper
      vector_store.py    # Embedding search
    models/
      __init__.py
      requests.py        # Pydantic request models
      responses.py       # Pydantic response models
    dependencies.py      # Dependency injection providers
    middleware.py         # Custom middleware
  tests/
  Dockerfile
  requirements.txt

The agents/ directory contains your agent logic, completely decoupled from HTTP concerns. The services/ layer wraps external integrations like LLM APIs and vector databases. Routes stay thin, delegating all business logic to agents and services.

Creating the Application with Lifespan Events

Lifespan events let you initialize expensive resources once at startup and clean them up at shutdown. This is essential for AI agents because creating LLM clients and loading embeddings should happen once, not per request:

from contextlib import asynccontextmanager
from fastapi import FastAPI
import httpx

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Startup: initialize shared resources
    app.state.llm_client = httpx.AsyncClient(
        base_url="https://api.openai.com/v1",
        headers={"Authorization": f"Bearer {settings.openai_api_key}"},
        timeout=60.0,
    )
    app.state.vector_client = await init_vector_store()
    print("AI agent backend ready")

    yield  # Application runs here

    # Shutdown: clean up resources
    await app.state.llm_client.aclose()
    await app.state.vector_client.close()
    print("Cleanup complete")

app = FastAPI(
    title="AI Agent Backend",
    version="1.0.0",
    lifespan=lifespan,
)

Async Endpoint Best Practices

Every endpoint that calls an LLM or database should be async. This lets FastAPI handle many concurrent requests on a single event loop instead of consuming a thread per request:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

from fastapi import APIRouter, Depends

router = APIRouter(prefix="/agents", tags=["agents"])

@router.post("/chat")
async def chat_with_agent(
    request: ChatRequest,
    llm_service: LLMService = Depends(get_llm_service),
    db: AsyncSession = Depends(get_db_session),
):
    # These run concurrently, not sequentially
    context, history = await asyncio.gather(
        llm_service.retrieve_context(request.message),
        db.execute(select(ChatHistory).where(
            ChatHistory.session_id == request.session_id
        )),
    )

    response = await llm_service.generate(
        message=request.message,
        context=context,
        history=history.scalars().all(),
    )

    return ChatResponse(
        message=response.content,
        session_id=request.session_id,
    )

Use asyncio.gather() to run independent async operations in parallel. If your agent needs to fetch context from a vector store and load chat history from a database, those two calls have no dependency on each other and can run simultaneously.

Dependency Injection for Configuration

FastAPI's Depends system is ideal for managing AI agent configuration. Define your settings with pydantic-settings and inject them wherever needed:

from pydantic_settings import BaseSettings
from functools import lru_cache

class Settings(BaseSettings):
    openai_api_key: str
    openai_model: str = "gpt-4o"
    max_tokens: int = 4096
    vector_db_url: str
    database_url: str

    class Config:
        env_file = ".env"

@lru_cache
def get_settings() -> Settings:
    return Settings()

# Use in any endpoint
@router.get("/config")
async def get_agent_config(
    settings: Settings = Depends(get_settings),
):
    return {"model": settings.openai_model}

The @lru_cache decorator ensures settings are parsed from environment variables only once. Every endpoint that depends on get_settings receives the same cached instance.

Key Takeaways

FastAPI's async-first architecture aligns naturally with AI agent workloads. Structure your project to separate agent logic from HTTP routing, use lifespan events for resource management, leverage asyncio.gather() for parallel operations, and let dependency injection handle configuration and client management. This foundation makes your agent backend testable, scalable, and maintainable as you add more sophisticated agent capabilities.

FAQ

Why should I use async def instead of regular def for agent endpoints?

Agent endpoints almost always call external services like LLM APIs, vector databases, or traditional databases. With async def, the event loop can process other requests while waiting for these I/O operations to complete. A synchronous def endpoint in FastAPI runs in a thread pool, which limits concurrency to the number of available threads. With async, a single worker process can handle thousands of concurrent connections.

Should I put agent logic directly in route handlers?

No. Keep route handlers thin and delegate to service or agent classes. Routes should handle request parsing, dependency injection, and response formatting. The actual agent reasoning, tool calling, and LLM interaction belong in dedicated classes in the agents/ or services/ directories. This makes your agent logic independently testable without spinning up an HTTP server.

When should I use lifespan events versus Depends for initialization?

Use lifespan events for expensive, shared resources that should exist for the lifetime of the application, like HTTP clients, database connection pools, and loaded ML models. Use Depends for per-request resources like database sessions or request-scoped caches. If you create a new httpx.AsyncClient per request via Depends, you waste time on connection setup. Put it in lifespan instead and inject it from app.state.

#FastAPI #Python #Async #AIAgents #ProjectStructure #AgenticAI #LearnAI #AIEngineering

FastAPI for AI Agents: Project Structure and Async Best Practices

Why FastAPI for AI Agent Backends

Recommended Project Structure

Creating the Application with Lifespan Events

Async Endpoint Best Practices

Dependency Injection for Configuration

Key Takeaways

FAQ

Why should I use async def instead of regular def for agent endpoints?

Should I put agent logic directly in route handlers?

When should I use lifespan events versus Depends for initialization?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding