Smart Model Routing: Using Cheap Models First, Expensive Models When Needed
Learn how to design a model routing system that sends simple queries to cheap models and escalates complex ones to powerful models. Reduce AI agent costs by 40-60% while maintaining quality with intelligent routing.
The Model Routing Problem
Most teams default to using their best (and most expensive) model for every request. A customer asking "What are your business hours?" gets the same GPT-4o treatment as someone asking for a complex multi-step analysis. This is like sending every package via overnight express shipping — it works, but it destroys your margins.
Smart model routing classifies requests by complexity and routes them to the cheapest model that can handle them well. In practice, 60–80% of agent queries are simple enough for a small, fast model, meaning you only need the expensive model for the remaining 20–40%.
Designing a Two-Tier Router
The simplest effective pattern uses two tiers: a fast/cheap model for straightforward requests and a powerful/expensive model for complex ones. A lightweight classifier decides which tier handles each request.
from dataclasses import dataclass
from enum import Enum
from typing import Optional
import openai
class Complexity(Enum):
SIMPLE = "simple"
COMPLEX = "complex"
@dataclass
class RoutingDecision:
complexity: Complexity
model: str
reason: str
estimated_cost_ratio: float # relative to always using the expensive model
TIER_CONFIG = {
Complexity.SIMPLE: {
"model": "gpt-4o-mini",
"max_tokens": 1024,
"cost_ratio": 0.06, # ~6% the cost of gpt-4o
},
Complexity.COMPLEX: {
"model": "gpt-4o",
"max_tokens": 4096,
"cost_ratio": 1.0,
},
}
class ModelRouter:
def __init__(self, client: openai.OpenAI):
self.client = client
def classify_complexity(self, user_message: str) -> RoutingDecision:
classification_prompt = (
"Classify this user message as SIMPLE or COMPLEX.\n"
"SIMPLE: factual lookups, greetings, yes/no questions, "
"status checks, single-step tasks.\n"
"COMPLEX: multi-step reasoning, analysis, code generation, "
"creative writing, comparisons, ambiguous queries.\n"
f"Message: {user_message}\n"
"Respond with only SIMPLE or COMPLEX."
)
response = self.client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": classification_prompt}],
max_tokens=10,
temperature=0,
)
label = response.choices[0].message.content.strip().upper()
complexity = Complexity.COMPLEX if "COMPLEX" in label else Complexity.SIMPLE
config = TIER_CONFIG[complexity]
return RoutingDecision(
complexity=complexity,
model=config["model"],
reason=label,
estimated_cost_ratio=config["cost_ratio"],
)
def route_and_respond(self, user_message: str, system_prompt: str) -> dict:
decision = self.classify_complexity(user_message)
config = TIER_CONFIG[decision.complexity]
response = self.client.chat.completions.create(
model=decision.model,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_message},
],
max_tokens=config["max_tokens"],
)
return {
"response": response.choices[0].message.content,
"model_used": decision.model,
"complexity": decision.complexity.value,
"cost_ratio": decision.estimated_cost_ratio,
}
Adding Quality Gates
Routing is only valuable if quality stays high. Add a quality gate that catches cases where the cheap model underperforms and automatically retries with the expensive model.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
class QualityGatedRouter(ModelRouter):
def __init__(self, client: openai.OpenAI, quality_threshold: float = 0.7):
super().__init__(client)
self.quality_threshold = quality_threshold
def check_response_quality(self, question: str, answer: str) -> float:
check_prompt = (
"Rate this answer's quality from 0.0 to 1.0.\n"
f"Question: {question}\n"
f"Answer: {answer}\n"
"Respond with only a number."
)
response = self.client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": check_prompt}],
max_tokens=5,
temperature=0,
)
try:
return float(response.choices[0].message.content.strip())
except ValueError:
return 0.5
def route_with_fallback(self, user_message: str, system_prompt: str) -> dict:
result = self.route_and_respond(user_message, system_prompt)
if result["complexity"] == "simple":
score = self.check_response_quality(user_message, result["response"])
if score < self.quality_threshold:
result = self.route_and_respond(user_message, system_prompt)
result["model_used"] = TIER_CONFIG[Complexity.COMPLEX]["model"]
result["escalated"] = True
result["original_quality_score"] = score
return result
Cost Tracking Across Routes
class RoutingCostTracker:
def __init__(self):
self.requests = []
def record(self, complexity: str, model: str, tokens_used: int, cost: float):
self.requests.append({
"complexity": complexity,
"model": model,
"tokens": tokens_used,
"cost": cost,
})
def savings_report(self) -> dict:
total_actual = sum(r["cost"] for r in self.requests)
total_if_always_expensive = sum(
r["tokens"] / 1_000_000 * 12.50 for r in self.requests
)
savings = total_if_always_expensive - total_actual
return {
"actual_cost": round(total_actual, 4),
"cost_without_routing": round(total_if_always_expensive, 4),
"savings": round(savings, 4),
"savings_pct": round((savings / total_if_always_expensive) * 100, 1),
"simple_pct": round(
len([r for r in self.requests if r["complexity"] == "simple"])
/ len(self.requests) * 100, 1
),
}
When Not to Route
Avoid model routing for safety-critical applications (medical, legal, financial advice), tasks requiring consistent voice or style across responses, and scenarios where the classification cost exceeds the routing savings — which happens with very short queries where the classifier itself costs more than the difference between models.
FAQ
Does the classifier itself add significant cost?
The classifier call uses a cheap model with very few output tokens (just "SIMPLE" or "COMPLEX"), so it costs roughly $0.00001–$0.00005 per classification. At typical volumes, the classifier cost is 0.1–0.5% of total LLM spend. The savings from routing far outweigh this overhead.
What if the classifier misroutes a complex query to the cheap model?
This is where quality gates matter. The fallback pattern detects low-quality responses and automatically escalates to the expensive model. Track your escalation rate — if it exceeds 15–20%, retune your classifier prompt or switch to a rule-based pre-filter for known complex patterns.
Can I use more than two tiers?
Absolutely. Three-tier systems (small/medium/large) work well at scale. The key is keeping the classifier logic simple enough that it does not become a cost center itself. Start with two tiers and add a middle tier only when you have enough traffic data to justify the complexity.
#ModelRouting #CostOptimization #LLMSelection #AIArchitecture #SmartRouting #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.