Skip to content
Learn Agentic AI12 min read0 views

Distillation: Training Smaller Models to Mimic Larger Ones for Production Use

Learn how to use knowledge distillation to transfer capabilities from large teacher models to smaller, cheaper student models suitable for production deployment, with concrete examples of cost savings and quality tradeoffs.

The Production Cost Problem

GPT-4o produces excellent results. It also costs $2.50 per million input tokens and $10 per million output tokens. At 100,000 requests per day with an average of 2,000 tokens per request, that is roughly $2,000-5,000 per month. A distilled GPT-4o-mini or fine-tuned Llama 3.1 8B can deliver 80-95% of the quality at 5-20% of the cost.

Knowledge distillation is the process of training a smaller "student" model to replicate the behavior of a larger "teacher" model. Unlike traditional fine-tuning where you need human-labeled data, distillation uses the teacher model itself to generate training data and labels.

The Distillation Pipeline

The basic approach is straightforward: send your production prompts to the teacher model, collect its responses, and fine-tune the student model on those input-output pairs.

from openai import OpenAI
import json
import asyncio
from typing import Optional

client = OpenAI()

async def generate_teacher_response(
    client: OpenAI,
    messages: list[dict],
    model: str = "gpt-4o",
) -> Optional[str]:
    """Get a response from the teacher model."""
    try:
        response = client.chat.completions.create(
            model=model,
            messages=messages,
            temperature=0.0,
        )
        return response.choices[0].message.content
    except Exception as e:
        print(f"Teacher error: {e}")
        return None

def build_distillation_dataset(
    production_inputs: list[dict],
    teacher_model: str = "gpt-4o",
    system_prompt: str = "",
    output_path: str = "distillation_data.jsonl",
) -> int:
    """Generate distillation training data from production inputs."""
    count = 0

    with open(output_path, "w") as f:
        for input_data in production_inputs:
            user_message = input_data["user_message"]
            messages = []
            if system_prompt:
                messages.append({"role": "system", "content": system_prompt})
            messages.append({"role": "user", "content": user_message})

            teacher_response = client.chat.completions.create(
                model=teacher_model,
                messages=messages,
                temperature=0.0,
            ).choices[0].message.content

            if teacher_response:
                training_example = {
                    "messages": messages + [
                        {"role": "assistant", "content": teacher_response}
                    ]
                }
                f.write(json.dumps(training_example) + "\n")
                count += 1

    print(f"Generated {count} distillation examples")
    return count

Selective Distillation: Focus on What Matters

Not all teacher responses are worth learning from. A teacher that produces a mediocre response teaches mediocre behavior. Filter teacher responses before adding them to the training set.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

def selective_distillation(
    inputs: list[dict],
    teacher_model: str = "gpt-4o",
    judge_model: str = "gpt-4o-mini",
    quality_threshold: float = 4.0,
    system_prompt: str = "",
) -> list[dict]:
    """Generate and filter distillation data using a quality judge."""
    high_quality = []

    for input_data in inputs:
        user_message = input_data["user_message"]
        messages = []
        if system_prompt:
            messages.append({"role": "system", "content": system_prompt})
        messages.append({"role": "user", "content": user_message})

        # Get teacher response
        teacher_response = client.chat.completions.create(
            model=teacher_model,
            messages=messages,
            temperature=0.0,
        ).choices[0].message.content

        # Judge the quality
        judge_response = client.chat.completions.create(
            model=judge_model,
            messages=[
                {
                    "role": "system",
                    "content": (
                        "Rate this response on a 1-5 scale for accuracy, "
                        "helpfulness, and completeness. Output only the number."
                    ),
                },
                {
                    "role": "user",
                    "content": f"Question: {user_message}\n\nResponse: {teacher_response}",
                },
            ],
            temperature=0.0,
            max_tokens=5,
        )

        try:
            score = float(judge_response.choices[0].message.content.strip())
        except ValueError:
            continue

        if score >= quality_threshold:
            high_quality.append({
                "messages": messages + [
                    {"role": "assistant", "content": teacher_response}
                ],
                "quality_score": score,
            })

    print(f"Kept {len(high_quality)}/{len(inputs)} examples (score >= {quality_threshold})")
    return high_quality

Chain-of-Thought Distillation

For reasoning-heavy tasks, distill the teacher's reasoning process — not just its final answer. This transfers the problem-solving strategy, not merely the output.

def distill_with_reasoning(
    inputs: list[dict],
    teacher_model: str = "gpt-4o",
    system_prompt: str = "",
) -> list[dict]:
    """Distill chain-of-thought reasoning from teacher to student."""
    examples = []

    cot_system = (
        f"{system_prompt}\n\n"
        "Think through the problem step by step before giving your final answer. "
        "Format: first show your reasoning under '## Reasoning', "
        "then give the final answer under '## Answer'."
    )

    for input_data in inputs:
        user_message = input_data["user_message"]
        messages = [
            {"role": "system", "content": cot_system},
            {"role": "user", "content": user_message},
        ]

        response = client.chat.completions.create(
            model=teacher_model,
            messages=messages,
            temperature=0.0,
        ).choices[0].message.content

        # Verify the response contains both sections
        if "## Reasoning" in response and "## Answer" in response:
            examples.append({
                "messages": [
                    {"role": "system", "content": cot_system},
                    {"role": "user", "content": user_message},
                    {"role": "assistant", "content": response},
                ]
            })

    return examples

Cost Analysis: Teacher vs Student

def calculate_distillation_roi(
    daily_requests: int,
    avg_input_tokens: int,
    avg_output_tokens: int,
    teacher_input_cost_per_m: float,   # e.g., $2.50 for GPT-4o
    teacher_output_cost_per_m: float,  # e.g., $10.00 for GPT-4o
    student_input_cost_per_m: float,   # e.g., $0.15 for GPT-4o-mini
    student_output_cost_per_m: float,  # e.g., $0.60 for GPT-4o-mini
    distillation_examples: int = 5000,
    distillation_epochs: int = 3,
) -> dict:
    """Calculate the ROI of distillation."""
    # Monthly inference costs
    monthly_requests = daily_requests * 30
    monthly_input_tokens = monthly_requests * avg_input_tokens / 1_000_000
    monthly_output_tokens = monthly_requests * avg_output_tokens / 1_000_000

    teacher_monthly = (
        monthly_input_tokens * teacher_input_cost_per_m
        + monthly_output_tokens * teacher_output_cost_per_m
    )
    student_monthly = (
        monthly_input_tokens * student_input_cost_per_m
        + monthly_output_tokens * student_output_cost_per_m
    )

    # One-time distillation cost (generating training data with teacher)
    distillation_tokens = distillation_examples * (avg_input_tokens + avg_output_tokens)
    distillation_cost = (
        distillation_tokens / 1_000_000 * teacher_input_cost_per_m
        + distillation_tokens / 1_000_000 * teacher_output_cost_per_m
    )

    monthly_savings = teacher_monthly - student_monthly
    break_even_months = distillation_cost / monthly_savings if monthly_savings > 0 else float("inf")

    return {
        "teacher_monthly_cost": f"${teacher_monthly:,.2f}",
        "student_monthly_cost": f"${student_monthly:,.2f}",
        "monthly_savings": f"${monthly_savings:,.2f}",
        "distillation_cost": f"${distillation_cost:,.2f}",
        "break_even_months": round(break_even_months, 1),
        "annual_savings": f"${monthly_savings * 12 - distillation_cost:,.2f}",
    }

# Example: 50K requests/day, 500 input + 300 output tokens average
roi = calculate_distillation_roi(
    daily_requests=50_000,
    avg_input_tokens=500,
    avg_output_tokens=300,
    teacher_input_cost_per_m=2.50,
    teacher_output_cost_per_m=10.00,
    student_input_cost_per_m=0.15,
    student_output_cost_per_m=0.60,
)
# teacher_monthly: $5,625, student_monthly: $405, monthly_savings: $5,220
# break_even: 0.1 months, annual_savings: ~$62,000

FAQ

How much quality loss should I expect from distillation?

For well-defined tasks (classification, extraction, formatting), distilled models retain 90-98% of teacher quality. For open-ended generation and complex reasoning, expect 80-90%. Narrow tasks distill well; broad creative tasks distill poorly because the student cannot capture the teacher's full capability distribution.

Should I distill within the same model family or cross-family?

Same-family distillation (GPT-4o to GPT-4o-mini, Llama 70B to 8B) works better because architectures share representations. Cross-family works but needs more data. Choose based on deployment needs — if you need self-hosted, distill to open-source regardless of teacher family.

How many distillation examples do I need?

For focused tasks, 1,000-3,000 high-quality examples suffice. For broader capabilities, aim for 5,000-10,000. Coverage matters more than volume — a thousand diverse examples beats ten thousand repetitive ones.


#KnowledgeDistillation #ModelCompression #FineTuning #ProductionML #CostOptimization #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.