Model Routing: Directing Agent Queries to the Optimal Model Based on Complexity
Design and implement a model router that classifies query complexity and directs agent requests to the most cost-effective model. Build fallback chains, measure routing accuracy, and optimize per-query costs.
The Cost of Using One Model for Everything
Most agent systems use a single model for all requests. If you choose a powerful model like GPT-4o, you get reliable results but pay premium prices for simple tasks that a smaller model could handle. If you choose a cheap model, complex queries fail. This is a false trade-off.
In practice, 60-80% of agent queries are straightforward — simple lookups, classification, template-based responses, or short factual answers. Only 20-40% require deep reasoning, long context processing, or complex multi-step chains. Model routing exploits this distribution by sending easy queries to small, fast, cheap models and reserving expensive models for hard queries.
A well-designed router can reduce LLM costs by 40-70% while maintaining quality on the queries that matter.
Router Architecture
A model router sits between the agent and the LLM providers. It inspects each query, classifies its complexity, and forwards it to the appropriate model:
from dataclasses import dataclass
from enum import Enum
from openai import OpenAI
class Complexity(Enum):
SIMPLE = "simple"
MODERATE = "moderate"
COMPLEX = "complex"
@dataclass
class ModelTier:
model: str
base_url: str
api_key: str
max_tokens: int
cost_per_1k_tokens: float
class ModelRouter:
def __init__(self):
self.tiers = {
Complexity.SIMPLE: ModelTier(
model="llama3.1:8b",
base_url="http://localhost:11434/v1",
api_key="ollama",
max_tokens=512,
cost_per_1k_tokens=0.0, # Free, local
),
Complexity.MODERATE: ModelTier(
model="gpt-4o-mini",
base_url="https://api.openai.com/v1",
api_key="sk-...",
max_tokens=2048,
cost_per_1k_tokens=0.15,
),
Complexity.COMPLEX: ModelTier(
model="gpt-4o",
base_url="https://api.openai.com/v1",
api_key="sk-...",
max_tokens=4096,
cost_per_1k_tokens=2.50,
),
}
def route(self, messages: list) -> tuple[str, OpenAI]:
complexity = self.classify_complexity(messages)
tier = self.tiers[complexity]
client = OpenAI(base_url=tier.base_url, api_key=tier.api_key)
return tier.model, client
def classify_complexity(self, messages: list) -> Complexity:
user_msg = messages[-1]["content"] if messages else ""
# Rule-based classification (fast, free)
return self._rule_based_classify(user_msg)
def _rule_based_classify(self, text: str) -> Complexity:
text_lower = text.lower()
word_count = len(text.split())
# Simple: short queries, greetings, yes/no questions
simple_indicators = [
word_count < 15,
text_lower.startswith(("what is", "who is", "define", "list")),
text_lower in ("hello", "hi", "thanks", "bye"),
]
# Complex: long context, multi-step reasoning, analysis
complex_indicators = [
word_count > 200,
any(kw in text_lower for kw in [
"analyze", "compare", "explain why", "step by step",
"write a", "debug", "refactor", "design",
]),
text.count("\n") > 5, # Multi-line input
]
if sum(complex_indicators) >= 2:
return Complexity.COMPLEX
if sum(simple_indicators) >= 2:
return Complexity.SIMPLE
return Complexity.MODERATE
LLM-Based Classification
Rule-based routing is fast but brittle. For higher accuracy, use a small, fast model to classify query complexity:
class LLMClassifier:
def __init__(self):
# Use a small local model for classification
self.client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama",
)
def classify(self, query: str) -> Complexity:
response = self.client.chat.completions.create(
model="gemma2:2b",
messages=[{
"role": "user",
"content": (
"Classify this query complexity as SIMPLE, MODERATE, or COMPLEX.\n"
"SIMPLE: factual lookups, definitions, yes/no, greetings\n"
"MODERATE: explanations, summaries, single-step tasks\n"
"COMPLEX: analysis, multi-step reasoning, code generation, comparisons\n"
"Respond with one word only.\n\n"
f"Query: {query}"
),
}],
temperature=0.0,
max_tokens=5,
)
label = response.choices[0].message.content.strip().upper()
return Complexity[label] if label in Complexity.__members__ else Complexity.MODERATE
The classification call adds 50-200ms of latency but only costs a fraction of a cent. The savings from routing simple queries to cheap models far outweigh this overhead.
Implementing Fallback Chains
Even with good routing, the selected model may fail — it might produce a low-quality response, hit a rate limit, or timeout. Implement automatic escalation:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
import time
import logging
logger = logging.getLogger(__name__)
class RoutingAgent:
def __init__(self, router: ModelRouter):
self.router = router
self.escalation_order = [
Complexity.SIMPLE,
Complexity.MODERATE,
Complexity.COMPLEX,
]
def query(self, messages: list, system_prompt: str = "") -> str:
complexity = self.router.classify_complexity(messages)
start_idx = self.escalation_order.index(complexity)
full_messages = []
if system_prompt:
full_messages.append({"role": "system", "content": system_prompt})
full_messages.extend(messages)
# Try the classified tier, then escalate on failure
for tier_complexity in self.escalation_order[start_idx:]:
tier = self.router.tiers[tier_complexity]
client = OpenAI(base_url=tier.base_url, api_key=tier.api_key)
try:
start = time.time()
response = client.chat.completions.create(
model=tier.model,
messages=full_messages,
max_tokens=tier.max_tokens,
temperature=0.2,
timeout=30,
)
elapsed = time.time() - start
result = response.choices[0].message.content
# Quality gate: if response is too short, escalate
if len(result.strip()) < 20 and tier_complexity != Complexity.COMPLEX:
logger.warning(
f"Short response from {tier.model}, escalating"
)
continue
logger.info(
f"Routed to {tier.model} "
f"({tier_complexity.value}) in {elapsed:.2f}s"
)
return result
except Exception as e:
logger.error(f"Failed on {tier.model}: {e}")
continue
return "I'm unable to process this request at the moment."
Measuring Router Performance
Track routing decisions to optimize over time:
from collections import defaultdict
import json
class RouterMetrics:
def __init__(self):
self.decisions = defaultdict(int)
self.escalations = 0
self.costs = defaultdict(float)
def record(self, classified: Complexity, actual: Complexity, cost: float):
self.decisions[classified.value] += 1
if actual != classified:
self.escalations += 1
self.costs[actual.value] += cost
def report(self) -> dict:
total = sum(self.decisions.values())
return {
"total_queries": total,
"distribution": {
k: f"{v/total*100:.1f}%"
for k, v in self.decisions.items()
},
"escalation_rate": f"{self.escalations/total*100:.1f}%"
if total > 0 else "0%",
"total_cost": f"${sum(self.costs.values()):.4f}",
"cost_by_tier": {
k: f"${v:.4f}" for k, v in self.costs.items()
},
}
metrics = RouterMetrics()
# After each query:
# metrics.record(classified_complexity, actual_tier_used, cost)
print(json.dumps(metrics.report(), indent=2))
Advanced: Embedding-Based Routing
For even better routing accuracy, use semantic similarity to a set of labeled example queries:
from sentence_transformers import SentenceTransformer
import numpy as np
class EmbeddingRouter:
def __init__(self):
self.embedder = SentenceTransformer("BAAI/bge-small-en-v1.5")
# Labeled example queries for each complexity tier
self.examples = {
Complexity.SIMPLE: [
"What is Python?",
"Define machine learning",
"Hello",
"What time is it?",
],
Complexity.MODERATE: [
"Explain how neural networks learn",
"Summarize the benefits of microservices",
"What are the pros and cons of NoSQL?",
],
Complexity.COMPLEX: [
"Design a distributed event-sourcing system for an e-commerce platform",
"Compare transformer and LSTM architectures for time-series forecasting",
"Debug this multi-threaded Python code that has a race condition",
],
}
# Pre-compute example embeddings
self.tier_embeddings = {}
for tier, texts in self.examples.items():
self.tier_embeddings[tier] = self.embedder.encode(
texts, normalize_embeddings=True
)
def classify(self, query: str) -> Complexity:
query_emb = self.embedder.encode([query], normalize_embeddings=True)
best_tier = Complexity.MODERATE
best_score = -1.0
for tier, embeddings in self.tier_embeddings.items():
similarities = np.dot(embeddings, query_emb.T).flatten()
max_sim = float(similarities.max())
if max_sim > best_score:
best_score = max_sim
best_tier = tier
return best_tier
Cost Savings in Practice
Consider an agent handling 100,000 queries per month with this distribution after routing:
- 60% simple (local Llama 8B): $0
- 30% moderate (GPT-4o-mini at $0.15/1K tokens): ~$45
- 10% complex (GPT-4o at $2.50/1K tokens): ~$25
Total with routing: ~$70/month Without routing (all GPT-4o): ~$250/month
That is a 72% cost reduction while maintaining full quality on the complex queries that actually need it.
FAQ
Does the routing classification itself add meaningful latency?
Rule-based routing adds less than 1ms. LLM-based classification with a local 2B model adds 50-200ms. Embedding-based routing adds 10-30ms. For most agent applications where LLM inference takes 500ms-3s, the routing overhead is negligible — and the latency savings from using a faster model for simple queries often more than compensate.
What if the router misclassifies a complex query as simple?
This is why fallback chains are essential. If the small model produces a short, low-quality, or incoherent response, the quality gate detects this and escalates to the next tier. In practice, misclassification rates below 15% have minimal impact on user experience because the escalation mechanism catches most errors.
Can I use model routing with tool-calling agents?
Yes, but route based on the tool complexity, not just the query text. Simple tool calls (single lookup, single API call) route to small models. Complex orchestration (multi-tool chains, conditional logic) routes to large models. You can inspect the agent's tool definitions to inform the routing decision.
#ModelRouting #CostOptimization #AgentArchitecture #MultiModel #LLMOrchestration #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.