Debugging Agent Loops: Identifying and Fixing Infinite Loops and Circular Handoffs
Learn how to detect, diagnose, and fix infinite loops and circular handoffs in AI agent systems using loop detection, max_turns limits, break conditions, and real-time monitoring.
The Agent That Would Not Stop
You deploy a multi-agent system, start a test conversation, and watch the logs. Agent A calls a tool, gets a result, decides it needs more information, calls the tool again with slightly different parameters, gets a similar result, decides it still needs more, and calls the tool again. Five minutes later, you have burned through 50,000 tokens and the user has received nothing.
Agent loops are one of the most expensive and dangerous failure modes in production. They consume tokens, block users, and can cascade into resource exhaustion. Unlike traditional infinite loops that spike CPU usage, agent loops are slow and expensive — each iteration costs money and time.
Types of Agent Loops
There are three distinct patterns you need to watch for:
Tool retry loops: The agent calls the same tool repeatedly because it is unsatisfied with the result. This happens when the tool returns valid but incomplete data, and the agent does not know when to stop.
Self-reflection loops: The agent evaluates its own output, decides it is not good enough, rewrites it, evaluates again, and never reaches a quality threshold it accepts.
Circular handoffs: In multi-agent systems, Agent A hands off to Agent B, which decides the task belongs to Agent A, which hands back to Agent B. This ping-pong can continue indefinitely.
Implementing max_turns Protection
The simplest and most important safeguard is limiting the number of turns an agent can take:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
from agents import Agent, Runner
agent = Agent(
name="Research Assistant",
instructions="Answer the user question using available tools.",
)
# Hard limit on agent turns
result = await Runner.run(
agent,
"Find the quarterly revenue for Acme Corp",
max_turns=10, # Stop after 10 tool call + response cycles
)
if result.max_turns_exceeded:
print("Agent hit turn limit — possible loop detected")
But max_turns alone is a blunt instrument. You also need intelligent loop detection.
Building a Loop Detector
A loop detector watches the sequence of agent actions and identifies repetitive patterns:
from collections import Counter
from dataclasses import dataclass
@dataclass
class AgentAction:
action_type: str # "tool_call", "handoff", "response"
target: str # tool name or agent name
args_hash: str # hash of the arguments
class LoopDetector:
def __init__(self, window_size: int = 5, threshold: int = 3):
self.actions: list[AgentAction] = []
self.window_size = window_size
self.threshold = threshold
def record(self, action: AgentAction):
self.actions.append(action)
def check_for_loop(self) -> dict | None:
if len(self.actions) < self.threshold:
return None
# Check for exact repetition
recent = self.actions[-self.window_size:]
signatures = [
f"{a.action_type}:{a.target}:{a.args_hash}"
for a in recent
]
counts = Counter(signatures)
for sig, count in counts.items():
if count >= self.threshold:
return {
"type": "exact_repeat",
"signature": sig,
"count": count,
}
# Check for ping-pong pattern (A->B->A->B)
if len(self.actions) >= 4:
targets = [a.target for a in self.actions[-4:]]
if targets[0] == targets[2] and targets[1] == targets[3]:
return {
"type": "ping_pong",
"agents": [targets[0], targets[1]],
}
return None
Integrating Loop Detection with Agent Execution
Wire the detector into your agent runner so it can intervene before costs spiral:
import hashlib
class SafeAgentRunner:
def __init__(self, max_turns=15, loop_window=5, loop_threshold=3):
self.detector = LoopDetector(loop_window, loop_threshold)
self.max_turns = max_turns
self.turn_count = 0
def hash_args(self, args: dict) -> str:
return hashlib.md5(
str(sorted(args.items())).encode()
).hexdigest()[:8]
async def on_tool_call(self, tool_name: str, arguments: dict):
self.turn_count += 1
action = AgentAction(
action_type="tool_call",
target=tool_name,
args_hash=self.hash_args(arguments),
)
self.detector.record(action)
loop = self.detector.check_for_loop()
if loop:
raise LoopDetectedError(
f"Loop detected: {loop['type']} — {loop}"
)
if self.turn_count >= self.max_turns:
raise MaxTurnsExceededError(
f"Agent exceeded {self.max_turns} turns"
)
class LoopDetectedError(Exception):
pass
class MaxTurnsExceededError(Exception):
pass
Fixing Circular Handoffs
For multi-agent systems, add handoff tracking that prevents an agent from handing back to the agent that just handed to it:
class HandoffTracker:
def __init__(self, max_handoffs: int = 5):
self.chain: list[str] = []
self.max_handoffs = max_handoffs
def record_handoff(self, from_agent: str, to_agent: str):
self.chain.append(f"{from_agent}->{to_agent}")
# Detect immediate bounce-back
if len(self.chain) >= 2:
last = self.chain[-1]
prev = self.chain[-2]
reverse = f"{to_agent}->{from_agent}"
if prev == reverse:
raise CircularHandoffError(
f"Circular handoff: {from_agent} <-> {to_agent}"
)
if len(self.chain) > self.max_handoffs:
raise TooManyHandoffsError(
f"Exceeded {self.max_handoffs} handoffs: {self.chain}"
)
class CircularHandoffError(Exception):
pass
class TooManyHandoffsError(Exception):
pass
FAQ
How do I distinguish between a legitimate retry and a harmful loop?
A legitimate retry changes its approach — different search terms, different parameters, or a fallback strategy. A harmful loop repeats the same action with identical or near-identical parameters. Hash the tool arguments and compare consecutive calls. If three or more calls produce the same hash, it is a loop.
What should the agent do when a loop is detected instead of just stopping?
Return a graceful response to the user explaining that the task could not be completed fully, along with whatever partial results were gathered. Log the full action history for debugging. Never silently drop the conversation — the user should always know what happened.
What is a safe default for max_turns in production?
For simple single-agent tasks, 10 to 15 turns is usually sufficient. For complex multi-agent workflows, 20 to 30 turns may be needed. Start low and increase based on observed behavior. Always pair max_turns with token budget limits as a second safety net.
#Debugging #AgentLoops #MultiAgent #AIAgents #Troubleshooting #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.