Why Single Agents Are Not Enough

The first generation of LLM-powered applications used a single model call to handle each request. The second generation introduced tool-calling agents that could loop through multi-step tasks. But as tasks grow in complexity -- analyzing a full codebase, orchestrating a customer support pipeline, or running a due diligence review across hundreds of documents -- a single agent becomes a bottleneck.

Multi-agent systems solve this by decomposing complex work across specialized agents that communicate, delegate, and collaborate. Instead of one agent trying to be an expert at everything, you build a team where each agent has a focused role, a constrained toolset, and a clear responsibility boundary.

Anthropic's own Claude Code uses this pattern internally. When you ask Claude Code to refactor a large codebase, it spawns subagent processes for file analysis, test execution, and code generation. The orchestrator coordinates their work and synthesizes a coherent result.

Core Architecture Patterns

Pattern 1: Hub-and-Spoke (Orchestrator Model)

The most common multi-agent pattern uses a central orchestrator agent that delegates tasks to specialized subagents.

import asyncio
from anthropic import Anthropic

client = Anthropic()

async def orchestrator(task: str) -> str:
    """Central agent that decomposes tasks and delegates to specialists."""
    response = client.messages.create(
        model="claude-sonnet-4-5-20250514",
        max_tokens=4096,
        system="""You are an orchestrator agent. Break down the user's request
into subtasks and specify which specialist should handle each one.
Available specialists: researcher, coder, reviewer, writer.
Output a JSON array of {specialist, task, priority} objects.""",
        messages=[{"role": "user", "content": task}]
    )

    subtasks = parse_subtasks(response.content[0].text)
    results = await asyncio.gather(*[
        dispatch_to_specialist(st["specialist"], st["task"])
        for st in subtasks
    ])

    # Synthesize results
    synthesis = client.messages.create(
        model="claude-sonnet-4-5-20250514",
        max_tokens=4096,
        system="Synthesize these specialist results into a coherent final answer.",
        messages=[{"role": "user", "content": format_results(results)}]
    )
    return synthesis.content[0].text

async def dispatch_to_specialist(specialist: str, task: str) -> str:
    """Route a subtask to the appropriate specialist agent."""
    system_prompts = {
        "researcher": "You are a research specialist. Find and cite accurate information.",
        "coder": "You are a coding specialist. Write clean, tested, production code.",
        "reviewer": "You are a code review specialist. Find bugs and suggest improvements.",
        "writer": "You are a technical writer. Produce clear, well-structured documentation.",
    }

    response = client.messages.create(
        model="claude-sonnet-4-5-20250514",
        max_tokens=4096,
        system=system_prompts[specialist],
        messages=[{"role": "user", "content": task}]
    )
    return response.content[0].text

Pattern 2: Pipeline (Sequential Processing)

In pipeline architectures, each agent's output becomes the next agent's input. This works well for workflows with clear stage dependencies.

async def pipeline(raw_data: str) -> str:
    """Sequential multi-agent pipeline for document processing."""

    # Stage 1: Extract
    extracted = await run_agent(
        system="Extract all key facts, dates, and entities from this document.",
        user_input=raw_data,
        model="claude-haiku-4-5-20250514"  # Fast, cheap for extraction
    )

    # Stage 2: Analyze
    analysis = await run_agent(
        system="Analyze these extracted facts for inconsistencies and patterns.",
        user_input=extracted,
        model="claude-sonnet-4-5-20250514"  # Balanced for analysis
    )

    # Stage 3: Synthesize
    report = await run_agent(
        system="Write an executive summary based on this analysis.",
        user_input=analysis,
        model="claude-sonnet-4-5-20250514"
    )

    return report

Pattern 3: Debate (Adversarial Collaboration)

Two agents argue opposing positions, and a judge agent synthesizes the best answer. This improves accuracy on ambiguous or high-stakes decisions.

async def debate(question: str) -> str:
    """Two agents debate, a third judges."""

    advocate = await run_agent(
        system="Argue strongly IN FAVOR of the proposition. Provide evidence.",
        user_input=question
    )

    critic = await run_agent(
        system="Argue strongly AGAINST the proposition. Provide evidence.",
        user_input=question
    )

    judgment = await run_agent(
        system="""You are an impartial judge. Review both arguments,
identify the strongest points from each side, and deliver
a balanced, well-reasoned verdict.""",
        user_input=f"FOR:\n{advocate}\n\nAGAINST:\n{critic}"
    )

    return judgment

Inter-Agent Communication

The biggest challenge in multi-agent systems is not building individual agents -- it is building reliable communication between them. There are three primary strategies.

Shared Memory (Context Store)

Agents read from and write to a shared key-value store. This works well for loosely coupled agents that need access to a growing body of knowledge.

from dataclasses import dataclass, field
from typing import Any

@dataclass
class SharedMemory:
    store: dict[str, Any] = field(default_factory=dict)
    history: list[dict] = field(default_factory=list)

    def write(self, agent_id: str, key: str, value: Any):
        self.store[key] = value
        self.history.append({"agent": agent_id, "action": "write", "key": key})

    def read(self, key: str) -> Any:
        return self.store.get(key)

    def get_context_for_agent(self, agent_id: str, relevant_keys: list[str]) -> str:
        context_parts = []
        for key in relevant_keys:
            if key in self.store:
                context_parts.append(f"{key}: {self.store[key]}")
        return "\n".join(context_parts)

Message Passing

Agents communicate through structured messages. This provides better isolation and audit trails.

Tool-Mediated Handoff

One agent writes output to a file or database, and the next agent reads from it. This is the simplest approach and works well with the Claude Agent SDK's built-in file tools.

Cost Optimization for Multi-Agent Systems

Multi-agent systems multiply API costs because each agent makes its own calls. Here are proven strategies to keep costs manageable.

Strategy	Cost Reduction	Implementation Complexity
Use Haiku for simple tasks	60-80%	Low
Prompt caching on system prompts	Up to 90% on cached tokens	Low
Batch API for non-real-time work	50%	Medium
Shared context compression	30-50%	Medium
Agent result caching	Variable	Medium

Model Tiering

Not every agent needs Opus. A practical tiering strategy:

Orchestrator: Sonnet (needs good reasoning to decompose tasks)
Researcher: Sonnet (needs comprehension and synthesis)
Extractor: Haiku (structured extraction is simpler)
Formatter: Haiku (template-based output)
Judge/Reviewer: Sonnet or Opus (needs nuanced judgment)

Production Considerations

Error Isolation

When one agent fails, the system should not cascade. Wrap each agent call in error handling that captures the failure and allows the orchestrator to retry or reassign.

Observability

Log every inter-agent message with timestamps, token counts, and costs. Without observability, debugging a multi-agent system is nearly impossible. Use structured logging with correlation IDs that tie all agent calls in a single workflow together.

Rate Limiting

With multiple agents making concurrent API calls, you can quickly hit Claude API rate limits. Implement a shared rate limiter:

import asyncio

class RateLimiter:
    def __init__(self, max_requests_per_minute: int = 50):
        self.semaphore = asyncio.Semaphore(max_requests_per_minute)
        self.reset_interval = 60

    async def acquire(self):
        await self.semaphore.acquire()
        asyncio.get_event_loop().call_later(
            self.reset_interval, self.semaphore.release
        )

rate_limiter = RateLimiter(max_requests_per_minute=50)

async def rate_limited_api_call(**kwargs):
    await rate_limiter.acquire()
    return client.messages.create(**kwargs)

Testing Multi-Agent Systems

Test each agent in isolation first, then test the integration. Mock the API responses to create deterministic test scenarios. Track the full conversation flow to verify agents are communicating correctly.

When to Use Multi-Agent Systems

Multi-agent systems add complexity. Use them when:

A single agent's context window is insufficient for the full task
The task naturally decomposes into specialized subtasks
You need different model tiers for different parts of the work
Parallel processing would significantly reduce latency
You need adversarial verification for high-stakes decisions

Avoid them when a single well-prompted agent with tools can handle the task in under 10 turns. The overhead of orchestration is not free, and simpler architectures are easier to debug and maintain.

Multi-Agent Systems with Claude: Building Teams of AI Agents