Distillation: Training Smaller Models to Mimic Larger Ones for Production Use
Learn how to use knowledge distillation to transfer capabilities from large teacher models to smaller, cheaper student models suitable for production deployment, with concrete examples of cost savings and quality tradeoffs.
The Production Cost Problem
GPT-4o produces excellent results. It also costs $2.50 per million input tokens and $10 per million output tokens. At 100,000 requests per day with an average of 2,000 tokens per request, that is roughly $2,000-5,000 per month. A distilled GPT-4o-mini or fine-tuned Llama 3.1 8B can deliver 80-95% of the quality at 5-20% of the cost.
Knowledge distillation is the process of training a smaller "student" model to replicate the behavior of a larger "teacher" model. Unlike traditional fine-tuning where you need human-labeled data, distillation uses the teacher model itself to generate training data and labels.
The Distillation Pipeline
The basic approach is straightforward: send your production prompts to the teacher model, collect its responses, and fine-tune the student model on those input-output pairs.
from openai import OpenAI
import json
import asyncio
from typing import Optional
client = OpenAI()
async def generate_teacher_response(
client: OpenAI,
messages: list[dict],
model: str = "gpt-4o",
) -> Optional[str]:
"""Get a response from the teacher model."""
try:
response = client.chat.completions.create(
model=model,
messages=messages,
temperature=0.0,
)
return response.choices[0].message.content
except Exception as e:
print(f"Teacher error: {e}")
return None
def build_distillation_dataset(
production_inputs: list[dict],
teacher_model: str = "gpt-4o",
system_prompt: str = "",
output_path: str = "distillation_data.jsonl",
) -> int:
"""Generate distillation training data from production inputs."""
count = 0
with open(output_path, "w") as f:
for input_data in production_inputs:
user_message = input_data["user_message"]
messages = []
if system_prompt:
messages.append({"role": "system", "content": system_prompt})
messages.append({"role": "user", "content": user_message})
teacher_response = client.chat.completions.create(
model=teacher_model,
messages=messages,
temperature=0.0,
).choices[0].message.content
if teacher_response:
training_example = {
"messages": messages + [
{"role": "assistant", "content": teacher_response}
]
}
f.write(json.dumps(training_example) + "\n")
count += 1
print(f"Generated {count} distillation examples")
return count
Selective Distillation: Focus on What Matters
Not all teacher responses are worth learning from. A teacher that produces a mediocre response teaches mediocre behavior. Filter teacher responses before adding them to the training set.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
def selective_distillation(
inputs: list[dict],
teacher_model: str = "gpt-4o",
judge_model: str = "gpt-4o-mini",
quality_threshold: float = 4.0,
system_prompt: str = "",
) -> list[dict]:
"""Generate and filter distillation data using a quality judge."""
high_quality = []
for input_data in inputs:
user_message = input_data["user_message"]
messages = []
if system_prompt:
messages.append({"role": "system", "content": system_prompt})
messages.append({"role": "user", "content": user_message})
# Get teacher response
teacher_response = client.chat.completions.create(
model=teacher_model,
messages=messages,
temperature=0.0,
).choices[0].message.content
# Judge the quality
judge_response = client.chat.completions.create(
model=judge_model,
messages=[
{
"role": "system",
"content": (
"Rate this response on a 1-5 scale for accuracy, "
"helpfulness, and completeness. Output only the number."
),
},
{
"role": "user",
"content": f"Question: {user_message}\n\nResponse: {teacher_response}",
},
],
temperature=0.0,
max_tokens=5,
)
try:
score = float(judge_response.choices[0].message.content.strip())
except ValueError:
continue
if score >= quality_threshold:
high_quality.append({
"messages": messages + [
{"role": "assistant", "content": teacher_response}
],
"quality_score": score,
})
print(f"Kept {len(high_quality)}/{len(inputs)} examples (score >= {quality_threshold})")
return high_quality
Chain-of-Thought Distillation
For reasoning-heavy tasks, distill the teacher's reasoning process — not just its final answer. This transfers the problem-solving strategy, not merely the output.
def distill_with_reasoning(
inputs: list[dict],
teacher_model: str = "gpt-4o",
system_prompt: str = "",
) -> list[dict]:
"""Distill chain-of-thought reasoning from teacher to student."""
examples = []
cot_system = (
f"{system_prompt}\n\n"
"Think through the problem step by step before giving your final answer. "
"Format: first show your reasoning under '## Reasoning', "
"then give the final answer under '## Answer'."
)
for input_data in inputs:
user_message = input_data["user_message"]
messages = [
{"role": "system", "content": cot_system},
{"role": "user", "content": user_message},
]
response = client.chat.completions.create(
model=teacher_model,
messages=messages,
temperature=0.0,
).choices[0].message.content
# Verify the response contains both sections
if "## Reasoning" in response and "## Answer" in response:
examples.append({
"messages": [
{"role": "system", "content": cot_system},
{"role": "user", "content": user_message},
{"role": "assistant", "content": response},
]
})
return examples
Cost Analysis: Teacher vs Student
def calculate_distillation_roi(
daily_requests: int,
avg_input_tokens: int,
avg_output_tokens: int,
teacher_input_cost_per_m: float, # e.g., $2.50 for GPT-4o
teacher_output_cost_per_m: float, # e.g., $10.00 for GPT-4o
student_input_cost_per_m: float, # e.g., $0.15 for GPT-4o-mini
student_output_cost_per_m: float, # e.g., $0.60 for GPT-4o-mini
distillation_examples: int = 5000,
distillation_epochs: int = 3,
) -> dict:
"""Calculate the ROI of distillation."""
# Monthly inference costs
monthly_requests = daily_requests * 30
monthly_input_tokens = monthly_requests * avg_input_tokens / 1_000_000
monthly_output_tokens = monthly_requests * avg_output_tokens / 1_000_000
teacher_monthly = (
monthly_input_tokens * teacher_input_cost_per_m
+ monthly_output_tokens * teacher_output_cost_per_m
)
student_monthly = (
monthly_input_tokens * student_input_cost_per_m
+ monthly_output_tokens * student_output_cost_per_m
)
# One-time distillation cost (generating training data with teacher)
distillation_tokens = distillation_examples * (avg_input_tokens + avg_output_tokens)
distillation_cost = (
distillation_tokens / 1_000_000 * teacher_input_cost_per_m
+ distillation_tokens / 1_000_000 * teacher_output_cost_per_m
)
monthly_savings = teacher_monthly - student_monthly
break_even_months = distillation_cost / monthly_savings if monthly_savings > 0 else float("inf")
return {
"teacher_monthly_cost": f"${teacher_monthly:,.2f}",
"student_monthly_cost": f"${student_monthly:,.2f}",
"monthly_savings": f"${monthly_savings:,.2f}",
"distillation_cost": f"${distillation_cost:,.2f}",
"break_even_months": round(break_even_months, 1),
"annual_savings": f"${monthly_savings * 12 - distillation_cost:,.2f}",
}
# Example: 50K requests/day, 500 input + 300 output tokens average
roi = calculate_distillation_roi(
daily_requests=50_000,
avg_input_tokens=500,
avg_output_tokens=300,
teacher_input_cost_per_m=2.50,
teacher_output_cost_per_m=10.00,
student_input_cost_per_m=0.15,
student_output_cost_per_m=0.60,
)
# teacher_monthly: $5,625, student_monthly: $405, monthly_savings: $5,220
# break_even: 0.1 months, annual_savings: ~$62,000
FAQ
How much quality loss should I expect from distillation?
For well-defined tasks (classification, extraction, formatting), distilled models retain 90-98% of teacher quality. For open-ended generation and complex reasoning, expect 80-90%. Narrow tasks distill well; broad creative tasks distill poorly because the student cannot capture the teacher's full capability distribution.
Should I distill within the same model family or cross-family?
Same-family distillation (GPT-4o to GPT-4o-mini, Llama 70B to 8B) works better because architectures share representations. Cross-family works but needs more data. Choose based on deployment needs — if you need self-hosted, distill to open-source regardless of teacher family.
How many distillation examples do I need?
For focused tasks, 1,000-3,000 high-quality examples suffice. For broader capabilities, aim for 5,000-10,000. Coverage matters more than volume — a thousand diverse examples beats ten thousand repetitive ones.
#KnowledgeDistillation #ModelCompression #FineTuning #ProductionML #CostOptimization #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.