Skip to content
Learn Agentic AI11 min read0 views

Error Handling in LangGraph: Retry Nodes, Fallback Paths, and Recovery

Build resilient LangGraph workflows with try/except patterns in nodes, fallback conditional edges, configurable retry logic, and dead-end recovery strategies for production agent systems.

Errors Are Inevitable in Agent Systems

Agent workflows interact with external systems — LLM APIs, databases, web services, file systems. Any of these can fail. API rate limits, network timeouts, malformed LLM outputs, and tool execution errors are not edge cases — they are normal operating conditions. Production LangGraph workflows must handle errors gracefully rather than crashing and losing all accumulated state.

Error Handling Inside Nodes

The first line of defense is try/except blocks within node functions:

from typing import TypedDict, Annotated
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, AIMessage

class State(TypedDict):
    messages: Annotated[list, add_messages]
    error: str
    retry_count: int

llm = ChatOpenAI(model="gpt-4o-mini")

def call_llm(state: State) -> dict:
    try:
        response = llm.invoke(state["messages"])
        return {
            "messages": [response],
            "error": "",
            "retry_count": state.get("retry_count", 0),
        }
    except Exception as e:
        return {
            "error": str(e),
            "retry_count": state.get("retry_count", 0) + 1,
        }

By catching exceptions and writing error information to state, you keep the graph running and let downstream nodes or routing logic decide how to recover.

Fallback Edges Based on Error State

Use conditional edges to route to different nodes depending on whether an error occurred:

from typing import Literal

def check_error(state: State) -> Literal["retry", "fallback", "continue"]:
    if state.get("error"):
        if state.get("retry_count", 0) < 3:
            return "retry"
        return "fallback"
    return "continue"

def retry_node(state: State) -> dict:
    """Wait briefly and clear the error for retry."""
    import time
    time.sleep(1)  # Back off before retry
    return {"error": ""}

def fallback_node(state: State) -> dict:
    """Provide a graceful degradation response."""
    return {
        "messages": [AIMessage(
            content="I encountered an issue processing your request. "
            "Here is what I can tell you based on available information."
        )],
        "error": "",
    }

builder = StateGraph(State)
builder.add_node("agent", call_llm)
builder.add_node("retry", retry_node)
builder.add_node("fallback", fallback_node)
builder.add_node("respond", lambda s: s)

builder.add_edge(START, "agent")
builder.add_conditional_edges("agent", check_error, {
    "retry": "retry",
    "fallback": "fallback",
    "continue": "respond",
})
builder.add_edge("retry", "agent")  # Loop back for retry
builder.add_edge("fallback", END)
builder.add_edge("respond", END)

graph = builder.compile()

This pattern gives the agent three attempts before falling back to a graceful degradation response.

Exponential Backoff Retry

For more sophisticated retry logic, implement exponential backoff:

import time

def smart_retry(state: State) -> dict:
    count = state.get("retry_count", 0)
    delay = min(2 ** count, 30)  # 1s, 2s, 4s, 8s... max 30s
    time.sleep(delay)
    return {"error": ""}

This prevents overwhelming a failing service with rapid retries while still recovering quickly from transient errors.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Tool Error Recovery

Tools fail frequently — APIs return errors, queries time out, external services go down. Build error handling directly into your tools:

from langchain_core.tools import tool
import httpx

@tool
def fetch_data(url: str) -> str:
    """Fetch data from a URL with error handling."""
    try:
        response = httpx.get(url, timeout=10)
        response.raise_for_status()
        return response.text[:2000]
    except httpx.TimeoutException:
        return "ERROR: Request timed out. The server may be slow or unreachable."
    except httpx.HTTPStatusError as e:
        return f"ERROR: HTTP {e.response.status_code}. The resource may not exist."
    except Exception as e:
        return f"ERROR: {type(e).__name__}: {e}"

Returning error strings instead of raising exceptions lets the LLM see the error and decide how to proceed — perhaps by trying a different URL or rephrasing the query.

Dead-End Detection

Sometimes the agent gets stuck in a loop without making progress. Detect this by tracking state changes:

def detect_stall(state: State) -> Literal["continue", "abort"]:
    messages = state["messages"]
    if len(messages) < 4:
        return "continue"

    # Check if last 3 AI messages are similar (stuck in a loop)
    recent_ai = [
        m.content for m in messages[-6:]
        if isinstance(m, AIMessage)
    ][-3:]

    if len(recent_ai) == 3 and len(set(recent_ai)) == 1:
        return "abort"
    return "continue"

def abort_node(state: State) -> dict:
    return {
        "messages": [AIMessage(
            content="I appear to be stuck. Let me summarize what I have so far "
            "and suggest a different approach."
        )]
    }

Combining Checkpointing with Error Recovery

Checkpointing and error handling work together for maximum resilience:

from langgraph.checkpoint.memory import MemorySaver

memory = MemorySaver()
graph = builder.compile(checkpointer=memory)

config = {"configurable": {"thread_id": "resilient-session"}}

try:
    result = graph.invoke(
        {"messages": [HumanMessage(content="Process this complex request")]},
        config,
    )
except Exception:
    # Graph crashed — but state is checkpointed
    # Resume from last successful node
    result = graph.invoke(None, config)

Even if the entire process crashes, the checkpointed state lets you resume from the last successful node rather than restarting the entire workflow.

FAQ

Should I catch all exceptions in every node?

No. Catch exceptions that you can meaningfully handle — API errors, timeouts, validation failures. Let unexpected errors (programming bugs, out-of-memory) propagate so they surface during development rather than being silently swallowed.

How do I log errors without exposing them to the user?

Write errors to a separate state field like error_log that your response formatting node ignores. Alternatively, use Python logging within nodes to send error details to your observability stack while returning user-friendly messages to state.

Can I set a global timeout for the entire graph execution?

LangGraph does not have a built-in global timeout. Implement it at the application level by running graph.ainvoke() inside an asyncio.wait_for() with your desired timeout. If the timeout triggers, the checkpointed state is still available for later resumption.


#LangGraph #ErrorHandling #RetryLogic #FaultTolerance #Python #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.