FastAPI for AI Agents: Project Structure and Async Best Practices
Learn how to structure a FastAPI project for AI agent backends, leverage async endpoints for concurrent LLM calls, use dependency injection effectively, and manage application lifecycle with lifespan events.
Why FastAPI for AI Agent Backends
FastAPI has become the framework of choice for building AI agent backends. Its native async support means your server can handle hundreds of concurrent LLM API calls without blocking. Its automatic OpenAPI documentation makes it trivial for frontend teams to integrate. And its dependency injection system maps perfectly to the pattern of injecting LLM clients, database sessions, and agent configurations into your endpoints.
Unlike Django or Flask, FastAPI was designed from the ground up around Python type hints and async/await. When your agent backend needs to call an LLM, retrieve context from a vector database, and log the interaction simultaneously, async endpoints handle this naturally without thread pool hacks.
Recommended Project Structure
A well-organized project keeps agent logic, API routes, and infrastructure concerns cleanly separated:
ai_agent_backend/
app/
__init__.py
main.py # FastAPI app, lifespan, middleware
config.py # Settings with pydantic-settings
routes/
__init__.py
agents.py # Agent conversation endpoints
tools.py # Tool execution endpoints
health.py # Health check routes
agents/
__init__.py
base.py # Base agent class
research_agent.py # Specialized agents
support_agent.py
services/
__init__.py
llm_service.py # LLM client wrapper
vector_store.py # Embedding search
models/
__init__.py
requests.py # Pydantic request models
responses.py # Pydantic response models
dependencies.py # Dependency injection providers
middleware.py # Custom middleware
tests/
Dockerfile
requirements.txt
The agents/ directory contains your agent logic, completely decoupled from HTTP concerns. The services/ layer wraps external integrations like LLM APIs and vector databases. Routes stay thin, delegating all business logic to agents and services.
Creating the Application with Lifespan Events
Lifespan events let you initialize expensive resources once at startup and clean them up at shutdown. This is essential for AI agents because creating LLM clients and loading embeddings should happen once, not per request:
from contextlib import asynccontextmanager
from fastapi import FastAPI
import httpx
@asynccontextmanager
async def lifespan(app: FastAPI):
# Startup: initialize shared resources
app.state.llm_client = httpx.AsyncClient(
base_url="https://api.openai.com/v1",
headers={"Authorization": f"Bearer {settings.openai_api_key}"},
timeout=60.0,
)
app.state.vector_client = await init_vector_store()
print("AI agent backend ready")
yield # Application runs here
# Shutdown: clean up resources
await app.state.llm_client.aclose()
await app.state.vector_client.close()
print("Cleanup complete")
app = FastAPI(
title="AI Agent Backend",
version="1.0.0",
lifespan=lifespan,
)
Async Endpoint Best Practices
Every endpoint that calls an LLM or database should be async. This lets FastAPI handle many concurrent requests on a single event loop instead of consuming a thread per request:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
from fastapi import APIRouter, Depends
router = APIRouter(prefix="/agents", tags=["agents"])
@router.post("/chat")
async def chat_with_agent(
request: ChatRequest,
llm_service: LLMService = Depends(get_llm_service),
db: AsyncSession = Depends(get_db_session),
):
# These run concurrently, not sequentially
context, history = await asyncio.gather(
llm_service.retrieve_context(request.message),
db.execute(select(ChatHistory).where(
ChatHistory.session_id == request.session_id
)),
)
response = await llm_service.generate(
message=request.message,
context=context,
history=history.scalars().all(),
)
return ChatResponse(
message=response.content,
session_id=request.session_id,
)
Use asyncio.gather() to run independent async operations in parallel. If your agent needs to fetch context from a vector store and load chat history from a database, those two calls have no dependency on each other and can run simultaneously.
Dependency Injection for Configuration
FastAPI's Depends system is ideal for managing AI agent configuration. Define your settings with pydantic-settings and inject them wherever needed:
from pydantic_settings import BaseSettings
from functools import lru_cache
class Settings(BaseSettings):
openai_api_key: str
openai_model: str = "gpt-4o"
max_tokens: int = 4096
vector_db_url: str
database_url: str
class Config:
env_file = ".env"
@lru_cache
def get_settings() -> Settings:
return Settings()
# Use in any endpoint
@router.get("/config")
async def get_agent_config(
settings: Settings = Depends(get_settings),
):
return {"model": settings.openai_model}
The @lru_cache decorator ensures settings are parsed from environment variables only once. Every endpoint that depends on get_settings receives the same cached instance.
Key Takeaways
FastAPI's async-first architecture aligns naturally with AI agent workloads. Structure your project to separate agent logic from HTTP routing, use lifespan events for resource management, leverage asyncio.gather() for parallel operations, and let dependency injection handle configuration and client management. This foundation makes your agent backend testable, scalable, and maintainable as you add more sophisticated agent capabilities.
FAQ
Why should I use async def instead of regular def for agent endpoints?
Agent endpoints almost always call external services like LLM APIs, vector databases, or traditional databases. With async def, the event loop can process other requests while waiting for these I/O operations to complete. A synchronous def endpoint in FastAPI runs in a thread pool, which limits concurrency to the number of available threads. With async, a single worker process can handle thousands of concurrent connections.
Should I put agent logic directly in route handlers?
No. Keep route handlers thin and delegate to service or agent classes. Routes should handle request parsing, dependency injection, and response formatting. The actual agent reasoning, tool calling, and LLM interaction belong in dedicated classes in the agents/ or services/ directories. This makes your agent logic independently testable without spinning up an HTTP server.
When should I use lifespan events versus Depends for initialization?
Use lifespan events for expensive, shared resources that should exist for the lifetime of the application, like HTTP clients, database connection pools, and loaded ML models. Use Depends for per-request resources like database sessions or request-scoped caches. If you create a new httpx.AsyncClient per request via Depends, you waste time on connection setup. Put it in lifespan instead and inject it from app.state.
#FastAPI #Python #Async #AIAgents #ProjectStructure #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.