Error Handling in LangGraph: Retry Nodes, Fallback Paths, and Recovery
Build resilient LangGraph workflows with try/except patterns in nodes, fallback conditional edges, configurable retry logic, and dead-end recovery strategies for production agent systems.
Errors Are Inevitable in Agent Systems
Agent workflows interact with external systems — LLM APIs, databases, web services, file systems. Any of these can fail. API rate limits, network timeouts, malformed LLM outputs, and tool execution errors are not edge cases — they are normal operating conditions. Production LangGraph workflows must handle errors gracefully rather than crashing and losing all accumulated state.
Error Handling Inside Nodes
The first line of defense is try/except blocks within node functions:
from typing import TypedDict, Annotated
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, AIMessage
class State(TypedDict):
messages: Annotated[list, add_messages]
error: str
retry_count: int
llm = ChatOpenAI(model="gpt-4o-mini")
def call_llm(state: State) -> dict:
try:
response = llm.invoke(state["messages"])
return {
"messages": [response],
"error": "",
"retry_count": state.get("retry_count", 0),
}
except Exception as e:
return {
"error": str(e),
"retry_count": state.get("retry_count", 0) + 1,
}
By catching exceptions and writing error information to state, you keep the graph running and let downstream nodes or routing logic decide how to recover.
Fallback Edges Based on Error State
Use conditional edges to route to different nodes depending on whether an error occurred:
from typing import Literal
def check_error(state: State) -> Literal["retry", "fallback", "continue"]:
if state.get("error"):
if state.get("retry_count", 0) < 3:
return "retry"
return "fallback"
return "continue"
def retry_node(state: State) -> dict:
"""Wait briefly and clear the error for retry."""
import time
time.sleep(1) # Back off before retry
return {"error": ""}
def fallback_node(state: State) -> dict:
"""Provide a graceful degradation response."""
return {
"messages": [AIMessage(
content="I encountered an issue processing your request. "
"Here is what I can tell you based on available information."
)],
"error": "",
}
builder = StateGraph(State)
builder.add_node("agent", call_llm)
builder.add_node("retry", retry_node)
builder.add_node("fallback", fallback_node)
builder.add_node("respond", lambda s: s)
builder.add_edge(START, "agent")
builder.add_conditional_edges("agent", check_error, {
"retry": "retry",
"fallback": "fallback",
"continue": "respond",
})
builder.add_edge("retry", "agent") # Loop back for retry
builder.add_edge("fallback", END)
builder.add_edge("respond", END)
graph = builder.compile()
This pattern gives the agent three attempts before falling back to a graceful degradation response.
Exponential Backoff Retry
For more sophisticated retry logic, implement exponential backoff:
import time
def smart_retry(state: State) -> dict:
count = state.get("retry_count", 0)
delay = min(2 ** count, 30) # 1s, 2s, 4s, 8s... max 30s
time.sleep(delay)
return {"error": ""}
This prevents overwhelming a failing service with rapid retries while still recovering quickly from transient errors.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Tool Error Recovery
Tools fail frequently — APIs return errors, queries time out, external services go down. Build error handling directly into your tools:
from langchain_core.tools import tool
import httpx
@tool
def fetch_data(url: str) -> str:
"""Fetch data from a URL with error handling."""
try:
response = httpx.get(url, timeout=10)
response.raise_for_status()
return response.text[:2000]
except httpx.TimeoutException:
return "ERROR: Request timed out. The server may be slow or unreachable."
except httpx.HTTPStatusError as e:
return f"ERROR: HTTP {e.response.status_code}. The resource may not exist."
except Exception as e:
return f"ERROR: {type(e).__name__}: {e}"
Returning error strings instead of raising exceptions lets the LLM see the error and decide how to proceed — perhaps by trying a different URL or rephrasing the query.
Dead-End Detection
Sometimes the agent gets stuck in a loop without making progress. Detect this by tracking state changes:
def detect_stall(state: State) -> Literal["continue", "abort"]:
messages = state["messages"]
if len(messages) < 4:
return "continue"
# Check if last 3 AI messages are similar (stuck in a loop)
recent_ai = [
m.content for m in messages[-6:]
if isinstance(m, AIMessage)
][-3:]
if len(recent_ai) == 3 and len(set(recent_ai)) == 1:
return "abort"
return "continue"
def abort_node(state: State) -> dict:
return {
"messages": [AIMessage(
content="I appear to be stuck. Let me summarize what I have so far "
"and suggest a different approach."
)]
}
Combining Checkpointing with Error Recovery
Checkpointing and error handling work together for maximum resilience:
from langgraph.checkpoint.memory import MemorySaver
memory = MemorySaver()
graph = builder.compile(checkpointer=memory)
config = {"configurable": {"thread_id": "resilient-session"}}
try:
result = graph.invoke(
{"messages": [HumanMessage(content="Process this complex request")]},
config,
)
except Exception:
# Graph crashed — but state is checkpointed
# Resume from last successful node
result = graph.invoke(None, config)
Even if the entire process crashes, the checkpointed state lets you resume from the last successful node rather than restarting the entire workflow.
FAQ
Should I catch all exceptions in every node?
No. Catch exceptions that you can meaningfully handle — API errors, timeouts, validation failures. Let unexpected errors (programming bugs, out-of-memory) propagate so they surface during development rather than being silently swallowed.
How do I log errors without exposing them to the user?
Write errors to a separate state field like error_log that your response formatting node ignores. Alternatively, use Python logging within nodes to send error details to your observability stack while returning user-friendly messages to state.
Can I set a global timeout for the entire graph execution?
LangGraph does not have a built-in global timeout. Implement it at the application level by running graph.ainvoke() inside an asyncio.wait_for() with your desired timeout. If the timeout triggers, the checkpointed state is still available for later resumption.
#LangGraph #ErrorHandling #RetryLogic #FaultTolerance #Python #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.