Agent State Machines: Managing Complex Multi-Step Workflows with Explicit States

Why State Machines for Agents?

Many agent tasks are not simple request-response exchanges. They involve multi-step workflows: gather requirements, research options, draft a proposal, get approval, execute. Without explicit state management, agents tend to lose track of where they are in complex workflows, repeat steps, or skip critical stages.

A finite state machine (FSM) solves this by defining every possible state the agent can be in, every valid transition between states, and the conditions (guards) that must be met for a transition to fire. The result is an agent whose behavior is predictable, debuggable, and easy to persist and resume.

Designing an Agent State Machine

Consider a customer onboarding agent. It needs to: collect user info, verify identity, set up an account, configure preferences, and send a welcome message. Here is how to model this as a state machine.

from enum import Enum
from dataclasses import dataclass, field
from typing import Dict, Callable, Optional, Any, List
from datetime import datetime

class OnboardingState(Enum):
    COLLECTING_INFO = "collecting_info"
    VERIFYING_IDENTITY = "verifying_identity"
    CREATING_ACCOUNT = "creating_account"
    CONFIGURING_PREFS = "configuring_preferences"
    SENDING_WELCOME = "sending_welcome"
    COMPLETED = "completed"
    ERROR = "error"

@dataclass
class StateContext:
    """Mutable data that travels with the state machine."""
    user_data: Dict[str, Any] = field(default_factory=dict)
    verification_result: Optional[bool] = None
    account_id: Optional[str] = None
    error_message: Optional[str] = None
    history: List[str] = field(default_factory=list)

Implementing the State Machine Engine

The engine manages transitions, enforces guards, and runs entry/exit actions for each state.

@dataclass
class Transition:
    from_state: OnboardingState
    to_state: OnboardingState
    guard: Optional[Callable[[StateContext], bool]] = None
    action: Optional[Callable[[StateContext], None]] = None

class AgentStateMachine:
    def __init__(self, initial_state: OnboardingState, context: StateContext = None):
        self.current_state = initial_state
        self.context = context or StateContext()
        self.transitions: List[Transition] = []
        self.state_handlers: Dict[OnboardingState, Callable] = {}
        self.context.history.append(
            f"{datetime.utcnow().isoformat()}: entered {initial_state.value}"
        )

    def add_transition(
        self,
        from_state: OnboardingState,
        to_state: OnboardingState,
        guard: Callable[[StateContext], bool] = None,
        action: Callable[[StateContext], None] = None,
    ):
        self.transitions.append(Transition(from_state, to_state, guard, action))

    def register_handler(self, state: OnboardingState, handler: Callable):
        """Register an async function to execute when entering a state."""
        self.state_handlers[state] = handler

    async def advance(self) -> bool:
        """Try to transition to the next valid state. Returns True if transitioned."""
        for t in self.transitions:
            if t.from_state != self.current_state:
                continue
            if t.guard and not t.guard(self.context):
                continue

            # Execute transition action
            if t.action:
                t.action(self.context)

            # Move to new state
            old_state = self.current_state
            self.current_state = t.to_state
            self.context.history.append(
                f"{datetime.utcnow().isoformat()}: "
                f"{old_state.value} -> {t.to_state.value}"
            )

            # Run the state handler
            if t.to_state in self.state_handlers:
                await self.state_handlers[t.to_state](self.context)

            return True

        return False  # No valid transition found

    async def run_to_completion(self, max_steps: int = 20):
        """Run the state machine until it reaches a terminal state."""
        for _ in range(max_steps):
            if self.current_state in (OnboardingState.COMPLETED, OnboardingState.ERROR):
                break
            advanced = await self.advance()
            if not advanced:
                self.context.error_message = (
                    f"Stuck in {self.current_state.value}: no valid transition"
                )
                self.current_state = OnboardingState.ERROR
                break
        return self.current_state

Wiring Up the Onboarding Workflow

Now define the handlers and guards for each state.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

async def collect_info(ctx: StateContext):
    """Simulate collecting user information via agent conversation."""
    # In production, this would involve LLM-driven conversation
    ctx.user_data = {
        "name": "Alice Johnson",
        "email": "alice@example.com",
        "id_document": "passport_12345",
    }

async def verify_identity(ctx: StateContext):
    """Call an identity verification API."""
    doc = ctx.user_data.get("id_document", "")
    ctx.verification_result = bool(doc and len(doc) > 5)

async def create_account(ctx: StateContext):
    """Create the user account in the system."""
    ctx.account_id = f"acct_{ctx.user_data['email'].split('@')[0]}"

async def configure_prefs(ctx: StateContext):
    """Set default preferences for the new account."""
    ctx.user_data["preferences"] = {"theme": "light", "notifications": True}

async def send_welcome(ctx: StateContext):
    """Send welcome email."""
    print(f"Welcome email sent to {ctx.user_data['email']}")

# Build the state machine
sm = AgentStateMachine(OnboardingState.COLLECTING_INFO)

sm.register_handler(OnboardingState.COLLECTING_INFO, collect_info)
sm.register_handler(OnboardingState.VERIFYING_IDENTITY, verify_identity)
sm.register_handler(OnboardingState.CREATING_ACCOUNT, create_account)
sm.register_handler(OnboardingState.CONFIGURING_PREFS, configure_prefs)
sm.register_handler(OnboardingState.SENDING_WELCOME, send_welcome)

# Define transitions with guards
sm.add_transition(
    OnboardingState.COLLECTING_INFO,
    OnboardingState.VERIFYING_IDENTITY,
    guard=lambda ctx: bool(ctx.user_data.get("email")),
)
sm.add_transition(
    OnboardingState.VERIFYING_IDENTITY,
    OnboardingState.CREATING_ACCOUNT,
    guard=lambda ctx: ctx.verification_result is True,
)
sm.add_transition(
    OnboardingState.VERIFYING_IDENTITY,
    OnboardingState.ERROR,
    guard=lambda ctx: ctx.verification_result is False,
    action=lambda ctx: setattr(ctx, "error_message", "Identity verification failed"),
)
sm.add_transition(
    OnboardingState.CREATING_ACCOUNT,
    OnboardingState.CONFIGURING_PREFS,
    guard=lambda ctx: ctx.account_id is not None,
)
sm.add_transition(
    OnboardingState.CONFIGURING_PREFS,
    OnboardingState.SENDING_WELCOME,
)
sm.add_transition(
    OnboardingState.SENDING_WELCOME,
    OnboardingState.COMPLETED,
)

Persistence

Because the state machine's entire state lives in the StateContext dataclass plus the current_state enum, persisting it is straightforward — serialize both to JSON and save to a database. On resume, deserialize and continue from where you left off.

FAQ

When should I use a state machine instead of letting the LLM decide the next step?

Use state machines when the workflow has clearly defined stages with strict ordering requirements — like compliance workflows, approval chains, or multi-step onboarding. Let the LLM decide when the workflow is exploratory or the steps are not predictable in advance.

How do I handle errors that require retrying a state?

Add a retry counter to your StateContext and a self-transition (same from and to state) with a guard that checks the retry count. When the retry limit is exceeded, transition to the ERROR state instead.

Can I combine state machines with LLM-driven agents?

Absolutely. The state machine controls the high-level workflow structure, while individual state handlers can use LLM agents for the actual work within each state. This gives you the predictability of explicit states with the flexibility of AI-driven execution.

#StateMachines #WorkflowManagement #AgentDesign #Python #AgenticAI #LearnAI #AIEngineering

Agent State Machines: Managing Complex Multi-Step Workflows with Explicit States

Why State Machines for Agents?

Designing an Agent State Machine

Implementing the State Machine Engine

Wiring Up the Onboarding Workflow

Persistence

FAQ

When should I use a state machine instead of letting the LLM decide the next step?

How do I handle errors that require retrying a state?

Can I combine state machines with LLM-driven agents?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding