Multi-Model Agent Architectures: Using Different LLMs for Different Reasoning Steps

Why One Model Does Not Fit All Tasks

Running GPT-4o or Claude Opus for every agent step is like using a sports car to deliver groceries. Classification tasks (is this a billing question or a technical question?) need millisecond responses and cost fractions of a cent. Complex reasoning (analyze this contract and identify risky clauses) needs the most capable model available. Multi-model architectures match model capability to task complexity, cutting costs by 60-80% while maintaining output quality where it matters.

The Model Routing Pattern

The core idea is a router that examines each task and dispatches it to the appropriate model. The router itself should be fast and cheap — it is the one component that runs on every request.

from enum import Enum
from dataclasses import dataclass
from typing import Any
import litellm

class ModelTier(Enum):
    FAST = "fast"       # Classification, extraction, routing
    BALANCED = "balanced"  # Summarization, simple generation
    POWERFUL = "powerful"  # Complex reasoning, creative writing

@dataclass
class ModelConfig:
    tier: ModelTier
    model_id: str
    max_tokens: int
    cost_per_1k_input: float
    cost_per_1k_output: float

MODEL_REGISTRY = {
    ModelTier.FAST: ModelConfig(
        tier=ModelTier.FAST,
        model_id="gpt-4o-mini",
        max_tokens=1024,
        cost_per_1k_input=0.00015,
        cost_per_1k_output=0.0006,
    ),
    ModelTier.BALANCED: ModelConfig(
        tier=ModelTier.BALANCED,
        model_id="gpt-4o",
        max_tokens=4096,
        cost_per_1k_input=0.0025,
        cost_per_1k_output=0.01,
    ),
    ModelTier.POWERFUL: ModelConfig(
        tier=ModelTier.POWERFUL,
        model_id="claude-opus-4-20250514",
        max_tokens=8192,
        cost_per_1k_input=0.015,
        cost_per_1k_output=0.075,
    ),
}

Building the Task Router

The router classifies incoming tasks and assigns them a model tier. This classification itself uses the fast model.

import json

class TaskRouter:
    def __init__(self):
        self.fast_model = MODEL_REGISTRY[ModelTier.FAST].model_id

    async def classify_task(self, task_description: str) -> ModelTier:
        response = await litellm.acompletion(
            model=self.fast_model,
            messages=[
                {"role": "system", "content": """Classify this task into one tier:
- FAST: simple classification, yes/no questions, entity extraction, formatting
- BALANCED: summarization, translation, simple Q&A, data transformation
- POWERFUL: complex reasoning, multi-step analysis, creative writing, code generation

Respond with ONLY the tier name."""},
                {"role": "user", "content": task_description},
            ],
            max_tokens=10,
            temperature=0,
        )
        tier_name = response.choices[0].message.content.strip().upper()
        return ModelTier[tier_name]

    async def route_and_execute(
        self, task: str, system_prompt: str
    ) -> dict[str, Any]:
        tier = await self.classify_task(task)
        config = MODEL_REGISTRY[tier]

        response = await litellm.acompletion(
            model=config.model_id,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": task},
            ],
            max_tokens=config.max_tokens,
        )

        return {
            "result": response.choices[0].message.content,
            "model_used": config.model_id,
            "tier": tier.value,
            "estimated_cost": self._estimate_cost(response, config),
        }

    def _estimate_cost(self, response, config: ModelConfig) -> float:
        input_tokens = response.usage.prompt_tokens
        output_tokens = response.usage.completion_tokens
        return (
            (input_tokens / 1000) * config.cost_per_1k_input
            + (output_tokens / 1000) * config.cost_per_1k_output
        )

Multi-Model Agent Pipeline

In a multi-step agent pipeline, each step can use a different model. Here is a document analysis pipeline where steps are assigned different tiers.

from agents import Agent

# Step 1: Extract key entities (fast model)
extractor = Agent(
    name="Entity Extractor",
    model="gpt-4o-mini",
    instructions="Extract all named entities (people, companies, dates, amounts) from the text. Return as JSON.",
)

# Step 2: Classify document type (fast model)
classifier = Agent(
    name="Document Classifier",
    model="gpt-4o-mini",
    instructions="Classify this document as: contract, invoice, letter, report, or memo. Return only the type.",
)

# Step 3: Deep analysis (powerful model)
analyzer = Agent(
    name="Document Analyzer",
    model="claude-opus-4-20250514",
    instructions="""Perform deep analysis of this document:
    - Identify key obligations and deadlines
    - Flag potential risks or ambiguities
    - Summarize the document's purpose and implications
    Use the entity data and document type provided for context.""",
)

Orchestrating the Pipeline

from agents import Runner

async def analyze_document(document_text: str) -> dict:
    # Fast: Extract entities ($0.001)
    entities_result = await Runner.run(
        extractor, f"Extract entities from: {document_text}"
    )

    # Fast: Classify document ($0.0005)
    class_result = await Runner.run(
        classifier, f"Classify: {document_text[:500]}"
    )

    # Powerful: Deep analysis ($0.05)
    analysis_prompt = f"""Document type: {class_result.final_output}
Entities found: {entities_result.final_output}
Full document: {document_text}"""

    analysis_result = await Runner.run(analyzer, analysis_prompt)

    return {
        "entities": entities_result.final_output,
        "document_type": class_result.final_output,
        "analysis": analysis_result.final_output,
        "total_estimated_cost": 0.05,  # vs $0.15 if all steps used the powerful model
    }

The fast steps cost almost nothing. The expensive model only runs for the one step that genuinely needs deep reasoning. Over thousands of documents, this architecture saves significant cost.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

Cost Tracking and Model Selection Feedback

Track actual costs and quality per tier to refine routing decisions over time.

import sqlite3
from datetime import datetime

class CostTracker:
    def __init__(self, db_path: str = "model_costs.db"):
        self.db = sqlite3.connect(db_path)
        self.db.execute("""
            CREATE TABLE IF NOT EXISTS model_usage (
                id INTEGER PRIMARY KEY,
                timestamp TEXT,
                task_type TEXT,
                tier TEXT,
                model_id TEXT,
                input_tokens INTEGER,
                output_tokens INTEGER,
                cost REAL,
                quality_score REAL
            )
        """)

    def log_usage(self, task_type: str, tier: str, model_id: str,
                  input_tokens: int, output_tokens: int, cost: float):
        self.db.execute(
            "INSERT INTO model_usage (timestamp, task_type, tier, model_id, "
            "input_tokens, output_tokens, cost) VALUES (?, ?, ?, ?, ?, ?, ?)",
            (datetime.utcnow().isoformat(), task_type, tier, model_id,
             input_tokens, output_tokens, cost),
        )
        self.db.commit()

    def get_cost_summary(self) -> dict:
        rows = self.db.execute(
            "SELECT tier, SUM(cost), COUNT(*) FROM model_usage GROUP BY tier"
        ).fetchall()
        return {row[0]: {"total_cost": row[1], "requests": row[2]} for row in rows}

FAQ

How do you handle cases where the router misclassifies a task?

Add a quality feedback loop. If the output from a FAST-tier model is flagged as low quality (by a user or automated check), automatically retry with a higher tier and log the misclassification. Over time, use these logs to fine-tune the router's classification prompt or train a small classifier model specifically for routing.

Should the router model itself be swappable?

Yes. The router should be the fastest and cheapest model available. As new small models are released (like GPT-4o-mini successors), swap the router model without changing the rest of the architecture. The router's accuracy requirements are modest — it just needs to distinguish simple from complex tasks.

How do you handle cross-model context passing?

Each model in the pipeline receives only the information it needs, not the full conversation history. The orchestrator extracts relevant outputs from upstream steps and formats them as context for downstream steps. This reduces token usage and prevents context window overflow when using models with smaller limits.

#MultiModelAI #ModelRouting #CostOptimization #AgentArchitecture #LLMOrchestration #AIEngineering #SmartRouting #ProductionAI

Multi-Model Agent Architectures: Using Different LLMs for Different Reasoning Steps

Why One Model Does Not Fit All Tasks

The Model Routing Pattern

Building the Task Router

Multi-Model Agent Pipeline

Orchestrating the Pipeline

Cost Tracking and Model Selection Feedback

FAQ

How do you handle cases where the router misclassifies a task?

Should the router model itself be swappable?

How do you handle cross-model context passing?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding