Stateful vs Stateless AI Agents: Architecture Trade-Offs for Production Systems

The State Problem in Agent Systems

Every AI agent has state. At minimum, it maintains a conversation history that grows with each turn. More complex agents accumulate tool results, user preferences, multi-step plan progress, and intermediate reasoning artifacts. The question is not whether your agent has state — it is where that state lives and how it is managed.

This decision has profound consequences for scalability, reliability, cost, and user experience. A stateful agent that keeps everything in memory is simple to build but impossible to scale horizontally. A stateless agent that reconstructs context from scratch on every request is scalable but expensive and slow. Most production systems need a hybrid approach.

Stateful Agent Architecture

In a stateful design, the agent process maintains the full conversation context in memory. Each request from a user is routed to the same agent instance, which can immediately access prior context.

# stateful/agent_server.py
from agents import Agent, Runner
import asyncio

class StatefulAgentServer:
    """Stateful agent that maintains conversation history in memory."""

    def __init__(self):
        self.sessions: dict[str, list[dict]] = {}
        self.agent = Agent(
            name="Stateful Assistant",
            instructions="You are a helpful assistant with full conversation memory.",
            model="gpt-4o",
        )

    async def process(self, session_id: str, user_message: str) -> str:
        # Retrieve or create session
        if session_id not in self.sessions:
            self.sessions[session_id] = []

        history = self.sessions[session_id]
        history.append({"role": "user", "content": user_message})

        # Run with full history — agent has complete context
        result = await Runner.run(self.agent, history)

        history.append({"role": "assistant", "content": result.final_output})
        self.sessions[session_id] = history

        return result.final_output

    def get_session_size(self, session_id: str) -> int:
        """Returns the number of messages in a session."""
        return len(self.sessions.get(session_id, []))

Advantages of Stateful Agents

Low latency — No need to fetch context from external storage on each request
Simple implementation — The agent has all context immediately available
Rich interactions — Can build complex multi-turn workflows without state management overhead
Lower token cost per request — No need to re-inject background context that is already in the conversation

Disadvantages of Stateful Agents

No horizontal scaling — Sessions are pinned to specific instances via sticky sessions
Memory pressure — Long conversations consume increasingly more memory
Single point of failure — If the instance crashes, all active sessions are lost
Uneven load distribution — Some instances may be overloaded while others are idle

Stateless Agent Architecture

In a stateless design, the agent process keeps no local state. All context is externalized to a database or cache, loaded at the start of each request, and discarded when the request completes.

# stateless/agent_server.py
from agents import Agent, Runner
import redis.asyncio as redis
import json

class StatelessAgentServer:
    """Stateless agent that loads context from Redis on each request."""

    def __init__(self, redis_url: str = "redis://localhost:6379/0"):
        self.redis = redis.from_url(redis_url)
        self.agent = Agent(
            name="Stateless Assistant",
            instructions="You are a helpful assistant.",
            model="gpt-4o",
        )

    async def process(self, session_id: str, user_message: str) -> str:
        # Load history from Redis
        raw = await self.redis.get(f"session:{session_id}")
        history = json.loads(raw) if raw else []

        history.append({"role": "user", "content": user_message})

        # Trim history if too long (sliding window)
        if len(history) > 40:
            # Keep system context + last 20 turns
            history = history[:2] + history[-38:]

        result = await Runner.run(self.agent, history)
        history.append({"role": "assistant", "content": result.final_output})

        # Save back to Redis with TTL
        await self.redis.setex(
            f"session:{session_id}",
            3600,  # 1 hour TTL
            json.dumps(history),
        )

        return result.final_output

Advantages of Stateless Agents

Horizontal scaling — Any instance can handle any request, add instances freely
Fault tolerance — Instance crashes do not lose session state
Even load distribution — Load balancers can use round-robin without sticky sessions
Simpler deployment — No need to drain sessions during rolling updates

Disadvantages of Stateless Agents

Added latency — Every request starts with a Redis/DB fetch
Higher token cost — Must include full context in every LLM call
Complexity — Need to manage serialization, TTLs, and storage limits
Storage costs — Session data must be stored externally

Hybrid Architecture: State Externalization with Local Caching

The best production systems combine both approaches. State lives in an external store for durability, but a local cache reduces the latency penalty:

# hybrid/agent_server.py
from agents import Agent, Runner
import redis.asyncio as redis
import json
from cachetools import TTLCache

class HybridAgentServer:
    """Hybrid agent with external state and local caching."""

    def __init__(self, redis_url: str = "redis://localhost:6379/0"):
        self.redis = redis.from_url(redis_url)
        self.local_cache = TTLCache(maxsize=1000, ttl=300)  # 5 min local cache
        self.agent = Agent(
            name="Hybrid Assistant",
            instructions="You are a helpful assistant.",
            model="gpt-4o",
        )

    async def _load_session(self, session_id: str) -> list[dict]:
        # Try local cache first
        if session_id in self.local_cache:
            return self.local_cache[session_id]

        # Fall back to Redis
        raw = await self.redis.get(f"session:{session_id}")
        history = json.loads(raw) if raw else []

        # Populate local cache
        self.local_cache[session_id] = history
        return history

    async def _save_session(self, session_id: str, history: list[dict]):
        # Update both stores
        self.local_cache[session_id] = history
        await self.redis.setex(
            f"session:{session_id}",
            3600,
            json.dumps(history),
        )

    async def process(self, session_id: str, user_message: str) -> str:
        history = await self._load_session(session_id)
        history.append({"role": "user", "content": user_message})

        # Context windowing: summarize old messages to save tokens
        if len(history) > 30:
            history = await self._compress_history(history)

        result = await Runner.run(self.agent, history)
        history.append({"role": "assistant", "content": result.final_output})

        await self._save_session(session_id, history)
        return result.final_output

    async def _compress_history(self, history: list[dict]) -> list[dict]:
        """Summarize older messages to reduce token usage."""
        old_messages = history[:-10]
        recent_messages = history[-10:]

        # Use a lightweight model to summarize
        summary_text = f"Summary of {len(old_messages)} prior messages: "
        summary_text += " | ".join(
            m["content"][:100] for m in old_messages if m["role"] == "user"
        )

        compressed = [
            {"role": "system", "content": f"Previous conversation summary: {summary_text[:500]}"}
        ] + recent_messages

        return compressed

Context Window Management Strategies

As conversations grow, you must decide what to keep, what to summarize, and what to discard. Here are four strategies:

1. Sliding Window

Keep only the most recent N messages. Simple but loses long-term context.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

def sliding_window(history: list[dict], max_messages: int = 20) -> list[dict]:
    if len(history) <= max_messages:
        return history
    return history[-max_messages:]

2. Summarization

Periodically compress older messages into a summary. Preserves key information but adds latency.

3. Retrieval-Augmented Memory

Store all messages in a vector database and retrieve only the most relevant ones for each new request.

async def retrieval_memory(history: list[dict], query: str,
                           top_k: int = 5) -> list[dict]:
    # Embed the current query
    # Search vector DB for most relevant past messages
    # Return recent messages + relevant historical messages
    relevant = await vector_search(query, top_k=top_k)
    recent = history[-10:]
    return relevant + recent

4. Tiered Memory

Combine all approaches: recent messages in full, medium-term messages summarized, long-term messages in vector storage.

Decision Framework

Use this table to choose your approach:

Factor	Stateful	Stateless	Hybrid
Conversation length	Short (< 20 turns)	Any	Any
Scale requirements	Single instance	Horizontal	Horizontal
Latency sensitivity	Very high	Moderate	High
Budget	Low infra, high compute	Higher infra	Balanced
Failure tolerance	Low	High	High
Implementation effort	Low	Medium	High

Start stateless unless you have a specific reason not to. It is easier to add caching to a stateless system than to add durability to a stateful one.

FAQ

How do I migrate from a stateful to a stateless architecture?

Start by adding external state storage alongside your in-memory state. Write session data to Redis on every update while continuing to read from memory. Once the dual-write is stable, switch reads to Redis. Finally, remove the in-memory sessions. This zero-downtime migration takes about a week for most systems.

What is the performance impact of loading state from Redis on every request?

A typical Redis GET for a serialized conversation of 20 messages takes 1-3 milliseconds on a local network. This is negligible compared to the 500-5000ms latency of the LLM API call itself. The token cost of re-sending context is a bigger concern than the storage latency.

How do I handle state for multi-agent workflows?

Each agent in the workflow should have its own session state, plus a shared workflow state that tracks the overall progress. Store the workflow state in Redis with a structure like workflow:{id}:state containing the current stage, accumulated results, and the conversation history for each agent.

When should I use a database instead of Redis for session storage?

Use a database (PostgreSQL) when sessions need to persist for days or weeks, when you need to query across sessions (analytics), or when session data is too large for Redis memory. Use Redis when sessions are short-lived (hours), latency is critical, and you can afford to lose old sessions.