Multi-Model Agent Architectures: Using Different LLMs for Different Reasoning Steps
Learn how to build agent systems that route different reasoning tasks to different language models — using fast, cheap models for classification and routing, and powerful models for generation and complex reasoning.
Why One Model Does Not Fit All Tasks
Running GPT-4o or Claude Opus for every agent step is like using a sports car to deliver groceries. Classification tasks (is this a billing question or a technical question?) need millisecond responses and cost fractions of a cent. Complex reasoning (analyze this contract and identify risky clauses) needs the most capable model available. Multi-model architectures match model capability to task complexity, cutting costs by 60-80% while maintaining output quality where it matters.
The Model Routing Pattern
The core idea is a router that examines each task and dispatches it to the appropriate model. The router itself should be fast and cheap — it is the one component that runs on every request.
from enum import Enum
from dataclasses import dataclass
from typing import Any
import litellm
class ModelTier(Enum):
FAST = "fast" # Classification, extraction, routing
BALANCED = "balanced" # Summarization, simple generation
POWERFUL = "powerful" # Complex reasoning, creative writing
@dataclass
class ModelConfig:
tier: ModelTier
model_id: str
max_tokens: int
cost_per_1k_input: float
cost_per_1k_output: float
MODEL_REGISTRY = {
ModelTier.FAST: ModelConfig(
tier=ModelTier.FAST,
model_id="gpt-4o-mini",
max_tokens=1024,
cost_per_1k_input=0.00015,
cost_per_1k_output=0.0006,
),
ModelTier.BALANCED: ModelConfig(
tier=ModelTier.BALANCED,
model_id="gpt-4o",
max_tokens=4096,
cost_per_1k_input=0.0025,
cost_per_1k_output=0.01,
),
ModelTier.POWERFUL: ModelConfig(
tier=ModelTier.POWERFUL,
model_id="claude-opus-4-20250514",
max_tokens=8192,
cost_per_1k_input=0.015,
cost_per_1k_output=0.075,
),
}
Building the Task Router
The router classifies incoming tasks and assigns them a model tier. This classification itself uses the fast model.
import json
class TaskRouter:
def __init__(self):
self.fast_model = MODEL_REGISTRY[ModelTier.FAST].model_id
async def classify_task(self, task_description: str) -> ModelTier:
response = await litellm.acompletion(
model=self.fast_model,
messages=[
{"role": "system", "content": """Classify this task into one tier:
- FAST: simple classification, yes/no questions, entity extraction, formatting
- BALANCED: summarization, translation, simple Q&A, data transformation
- POWERFUL: complex reasoning, multi-step analysis, creative writing, code generation
Respond with ONLY the tier name."""},
{"role": "user", "content": task_description},
],
max_tokens=10,
temperature=0,
)
tier_name = response.choices[0].message.content.strip().upper()
return ModelTier[tier_name]
async def route_and_execute(
self, task: str, system_prompt: str
) -> dict[str, Any]:
tier = await self.classify_task(task)
config = MODEL_REGISTRY[tier]
response = await litellm.acompletion(
model=config.model_id,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": task},
],
max_tokens=config.max_tokens,
)
return {
"result": response.choices[0].message.content,
"model_used": config.model_id,
"tier": tier.value,
"estimated_cost": self._estimate_cost(response, config),
}
def _estimate_cost(self, response, config: ModelConfig) -> float:
input_tokens = response.usage.prompt_tokens
output_tokens = response.usage.completion_tokens
return (
(input_tokens / 1000) * config.cost_per_1k_input
+ (output_tokens / 1000) * config.cost_per_1k_output
)
Multi-Model Agent Pipeline
In a multi-step agent pipeline, each step can use a different model. Here is a document analysis pipeline where steps are assigned different tiers.
from agents import Agent
# Step 1: Extract key entities (fast model)
extractor = Agent(
name="Entity Extractor",
model="gpt-4o-mini",
instructions="Extract all named entities (people, companies, dates, amounts) from the text. Return as JSON.",
)
# Step 2: Classify document type (fast model)
classifier = Agent(
name="Document Classifier",
model="gpt-4o-mini",
instructions="Classify this document as: contract, invoice, letter, report, or memo. Return only the type.",
)
# Step 3: Deep analysis (powerful model)
analyzer = Agent(
name="Document Analyzer",
model="claude-opus-4-20250514",
instructions="""Perform deep analysis of this document:
- Identify key obligations and deadlines
- Flag potential risks or ambiguities
- Summarize the document's purpose and implications
Use the entity data and document type provided for context.""",
)
Orchestrating the Pipeline
from agents import Runner
async def analyze_document(document_text: str) -> dict:
# Fast: Extract entities ($0.001)
entities_result = await Runner.run(
extractor, f"Extract entities from: {document_text}"
)
# Fast: Classify document ($0.0005)
class_result = await Runner.run(
classifier, f"Classify: {document_text[:500]}"
)
# Powerful: Deep analysis ($0.05)
analysis_prompt = f"""Document type: {class_result.final_output}
Entities found: {entities_result.final_output}
Full document: {document_text}"""
analysis_result = await Runner.run(analyzer, analysis_prompt)
return {
"entities": entities_result.final_output,
"document_type": class_result.final_output,
"analysis": analysis_result.final_output,
"total_estimated_cost": 0.05, # vs $0.15 if all steps used the powerful model
}
The fast steps cost almost nothing. The expensive model only runs for the one step that genuinely needs deep reasoning. Over thousands of documents, this architecture saves significant cost.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Cost Tracking and Model Selection Feedback
Track actual costs and quality per tier to refine routing decisions over time.
import sqlite3
from datetime import datetime
class CostTracker:
def __init__(self, db_path: str = "model_costs.db"):
self.db = sqlite3.connect(db_path)
self.db.execute("""
CREATE TABLE IF NOT EXISTS model_usage (
id INTEGER PRIMARY KEY,
timestamp TEXT,
task_type TEXT,
tier TEXT,
model_id TEXT,
input_tokens INTEGER,
output_tokens INTEGER,
cost REAL,
quality_score REAL
)
""")
def log_usage(self, task_type: str, tier: str, model_id: str,
input_tokens: int, output_tokens: int, cost: float):
self.db.execute(
"INSERT INTO model_usage (timestamp, task_type, tier, model_id, "
"input_tokens, output_tokens, cost) VALUES (?, ?, ?, ?, ?, ?, ?)",
(datetime.utcnow().isoformat(), task_type, tier, model_id,
input_tokens, output_tokens, cost),
)
self.db.commit()
def get_cost_summary(self) -> dict:
rows = self.db.execute(
"SELECT tier, SUM(cost), COUNT(*) FROM model_usage GROUP BY tier"
).fetchall()
return {row[0]: {"total_cost": row[1], "requests": row[2]} for row in rows}
FAQ
How do you handle cases where the router misclassifies a task?
Add a quality feedback loop. If the output from a FAST-tier model is flagged as low quality (by a user or automated check), automatically retry with a higher tier and log the misclassification. Over time, use these logs to fine-tune the router's classification prompt or train a small classifier model specifically for routing.
Should the router model itself be swappable?
Yes. The router should be the fastest and cheapest model available. As new small models are released (like GPT-4o-mini successors), swap the router model without changing the rest of the architecture. The router's accuracy requirements are modest — it just needs to distinguish simple from complex tasks.
How do you handle cross-model context passing?
Each model in the pipeline receives only the information it needs, not the full conversation history. The orchestrator extracts relevant outputs from upstream steps and formats them as context for downstream steps. This reduces token usage and prevents context window overflow when using models with smaller limits.
#MultiModelAI #ModelRouting #CostOptimization #AgentArchitecture #LLMOrchestration #AIEngineering #SmartRouting #ProductionAI
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.