Agent State Machines: Managing Complex Multi-Step Workflows with Explicit States
Learn how to model AI agent workflows as finite state machines with explicit states, transitions, and guards — providing predictable behavior, easy debugging, and reliable persistence for long-running tasks.
Why State Machines for Agents?
Many agent tasks are not simple request-response exchanges. They involve multi-step workflows: gather requirements, research options, draft a proposal, get approval, execute. Without explicit state management, agents tend to lose track of where they are in complex workflows, repeat steps, or skip critical stages.
A finite state machine (FSM) solves this by defining every possible state the agent can be in, every valid transition between states, and the conditions (guards) that must be met for a transition to fire. The result is an agent whose behavior is predictable, debuggable, and easy to persist and resume.
Designing an Agent State Machine
Consider a customer onboarding agent. It needs to: collect user info, verify identity, set up an account, configure preferences, and send a welcome message. Here is how to model this as a state machine.
from enum import Enum
from dataclasses import dataclass, field
from typing import Dict, Callable, Optional, Any, List
from datetime import datetime
class OnboardingState(Enum):
COLLECTING_INFO = "collecting_info"
VERIFYING_IDENTITY = "verifying_identity"
CREATING_ACCOUNT = "creating_account"
CONFIGURING_PREFS = "configuring_preferences"
SENDING_WELCOME = "sending_welcome"
COMPLETED = "completed"
ERROR = "error"
@dataclass
class StateContext:
"""Mutable data that travels with the state machine."""
user_data: Dict[str, Any] = field(default_factory=dict)
verification_result: Optional[bool] = None
account_id: Optional[str] = None
error_message: Optional[str] = None
history: List[str] = field(default_factory=list)
Implementing the State Machine Engine
The engine manages transitions, enforces guards, and runs entry/exit actions for each state.
@dataclass
class Transition:
from_state: OnboardingState
to_state: OnboardingState
guard: Optional[Callable[[StateContext], bool]] = None
action: Optional[Callable[[StateContext], None]] = None
class AgentStateMachine:
def __init__(self, initial_state: OnboardingState, context: StateContext = None):
self.current_state = initial_state
self.context = context or StateContext()
self.transitions: List[Transition] = []
self.state_handlers: Dict[OnboardingState, Callable] = {}
self.context.history.append(
f"{datetime.utcnow().isoformat()}: entered {initial_state.value}"
)
def add_transition(
self,
from_state: OnboardingState,
to_state: OnboardingState,
guard: Callable[[StateContext], bool] = None,
action: Callable[[StateContext], None] = None,
):
self.transitions.append(Transition(from_state, to_state, guard, action))
def register_handler(self, state: OnboardingState, handler: Callable):
"""Register an async function to execute when entering a state."""
self.state_handlers[state] = handler
async def advance(self) -> bool:
"""Try to transition to the next valid state. Returns True if transitioned."""
for t in self.transitions:
if t.from_state != self.current_state:
continue
if t.guard and not t.guard(self.context):
continue
# Execute transition action
if t.action:
t.action(self.context)
# Move to new state
old_state = self.current_state
self.current_state = t.to_state
self.context.history.append(
f"{datetime.utcnow().isoformat()}: "
f"{old_state.value} -> {t.to_state.value}"
)
# Run the state handler
if t.to_state in self.state_handlers:
await self.state_handlers[t.to_state](self.context)
return True
return False # No valid transition found
async def run_to_completion(self, max_steps: int = 20):
"""Run the state machine until it reaches a terminal state."""
for _ in range(max_steps):
if self.current_state in (OnboardingState.COMPLETED, OnboardingState.ERROR):
break
advanced = await self.advance()
if not advanced:
self.context.error_message = (
f"Stuck in {self.current_state.value}: no valid transition"
)
self.current_state = OnboardingState.ERROR
break
return self.current_state
Wiring Up the Onboarding Workflow
Now define the handlers and guards for each state.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
async def collect_info(ctx: StateContext):
"""Simulate collecting user information via agent conversation."""
# In production, this would involve LLM-driven conversation
ctx.user_data = {
"name": "Alice Johnson",
"email": "alice@example.com",
"id_document": "passport_12345",
}
async def verify_identity(ctx: StateContext):
"""Call an identity verification API."""
doc = ctx.user_data.get("id_document", "")
ctx.verification_result = bool(doc and len(doc) > 5)
async def create_account(ctx: StateContext):
"""Create the user account in the system."""
ctx.account_id = f"acct_{ctx.user_data['email'].split('@')[0]}"
async def configure_prefs(ctx: StateContext):
"""Set default preferences for the new account."""
ctx.user_data["preferences"] = {"theme": "light", "notifications": True}
async def send_welcome(ctx: StateContext):
"""Send welcome email."""
print(f"Welcome email sent to {ctx.user_data['email']}")
# Build the state machine
sm = AgentStateMachine(OnboardingState.COLLECTING_INFO)
sm.register_handler(OnboardingState.COLLECTING_INFO, collect_info)
sm.register_handler(OnboardingState.VERIFYING_IDENTITY, verify_identity)
sm.register_handler(OnboardingState.CREATING_ACCOUNT, create_account)
sm.register_handler(OnboardingState.CONFIGURING_PREFS, configure_prefs)
sm.register_handler(OnboardingState.SENDING_WELCOME, send_welcome)
# Define transitions with guards
sm.add_transition(
OnboardingState.COLLECTING_INFO,
OnboardingState.VERIFYING_IDENTITY,
guard=lambda ctx: bool(ctx.user_data.get("email")),
)
sm.add_transition(
OnboardingState.VERIFYING_IDENTITY,
OnboardingState.CREATING_ACCOUNT,
guard=lambda ctx: ctx.verification_result is True,
)
sm.add_transition(
OnboardingState.VERIFYING_IDENTITY,
OnboardingState.ERROR,
guard=lambda ctx: ctx.verification_result is False,
action=lambda ctx: setattr(ctx, "error_message", "Identity verification failed"),
)
sm.add_transition(
OnboardingState.CREATING_ACCOUNT,
OnboardingState.CONFIGURING_PREFS,
guard=lambda ctx: ctx.account_id is not None,
)
sm.add_transition(
OnboardingState.CONFIGURING_PREFS,
OnboardingState.SENDING_WELCOME,
)
sm.add_transition(
OnboardingState.SENDING_WELCOME,
OnboardingState.COMPLETED,
)
Persistence
Because the state machine's entire state lives in the StateContext dataclass plus the current_state enum, persisting it is straightforward — serialize both to JSON and save to a database. On resume, deserialize and continue from where you left off.
FAQ
When should I use a state machine instead of letting the LLM decide the next step?
Use state machines when the workflow has clearly defined stages with strict ordering requirements — like compliance workflows, approval chains, or multi-step onboarding. Let the LLM decide when the workflow is exploratory or the steps are not predictable in advance.
How do I handle errors that require retrying a state?
Add a retry counter to your StateContext and a self-transition (same from and to state) with a guard that checks the retry count. When the retry limit is exceeded, transition to the ERROR state instead.
Can I combine state machines with LLM-driven agents?
Absolutely. The state machine controls the high-level workflow structure, while individual state handlers can use LLM agents for the actual work within each state. This gives you the predictability of explicit states with the flexibility of AI-driven execution.
#StateMachines #WorkflowManagement #AgentDesign #Python #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.