Upgrading LLM Models in Production: GPT-3.5 to GPT-4 to GPT-5 Migration

Why Model Upgrades Are Not Simple Config Changes

Swapping model="gpt-3.5-turbo" to model="gpt-4o" in your code takes five seconds. Making sure the upgrade actually improves your system without regressions, budget overruns, or latency spikes takes planning.

Each model generation behaves differently. Prompts that worked perfectly on GPT-3.5 may produce verbose or differently structured outputs on GPT-4. Tool calling schemas may be interpreted more strictly. Cost per token can jump by 10x or more. A disciplined upgrade process protects your users and your budget.

Step 1: Build an Evaluation Dataset

Before changing anything, create a gold-standard evaluation set from your current system.

import json
from dataclasses import dataclass

@dataclass
class EvalCase:
    input_messages: list[dict]
    expected_output: str
    category: str
    difficulty: str  # easy, medium, hard

def build_eval_set_from_logs(logs_path: str) -> list[EvalCase]:
    """Extract high-quality eval cases from production logs."""
    with open(logs_path) as f:
        logs = json.load(f)

    eval_cases = []
    for log in logs:
        if log.get("user_rating", 0) >= 4:  # only verified-good responses
            eval_cases.append(EvalCase(
                input_messages=log["messages"],
                expected_output=log["assistant_response"],
                category=log.get("category", "general"),
                difficulty=log.get("difficulty", "medium"),
            ))

    return eval_cases

eval_set = build_eval_set_from_logs("production_logs.json")
print(f"Built {len(eval_set)} evaluation cases")

Step 2: Run Comparative Evaluation

Test the new model against your evaluation set and score the results.

from openai import OpenAI

client = OpenAI()

def evaluate_model(
    eval_cases: list[EvalCase],
    model: str,
) -> dict:
    """Run eval cases against a model and compute metrics."""
    results = {"correct": 0, "total": 0, "total_tokens": 0, "total_cost": 0.0}

    for case in eval_cases:
        response = client.chat.completions.create(
            model=model,
            messages=case.input_messages,
            temperature=0,
        )
        output = response.choices[0].message.content
        tokens = response.usage.total_tokens

        # Use LLM-as-judge for semantic comparison
        judge_response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{
                "role": "user",
                "content": (
                    f"Compare these two responses for correctness.\n"
                    f"Expected: {case.expected_output}\n"
                    f"Actual: {output}\n"
                    f"Reply PASS or FAIL only."
                ),
            }],
            temperature=0,
        )
        passed = "PASS" in judge_response.choices[0].message.content
        results["correct"] += int(passed)
        results["total"] += 1
        results["total_tokens"] += tokens

    results["accuracy"] = results["correct"] / results["total"]
    return results

old_results = evaluate_model(eval_set, "gpt-3.5-turbo")
new_results = evaluate_model(eval_set, "gpt-4o")
print(f"GPT-3.5: {old_results['accuracy']:.1%} accuracy")
print(f"GPT-4o:  {new_results['accuracy']:.1%} accuracy")

Step 3: Adapt Prompts for the New Model

Newer models often respond better to concise instructions and may not need the verbose chain-of-thought scaffolding that older models required.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

PROMPT_VERSIONS = {
    "gpt-3.5-turbo": (
        "Think step by step. First analyze the question. "
        "Then reason through the answer. Finally provide "
        "a clear, concise response."
    ),
    "gpt-4o": (
        "Answer concisely and accurately. Use examples "
        "when they add clarity."
    ),
}

def get_system_prompt(model: str) -> str:
    return PROMPT_VERSIONS.get(model, PROMPT_VERSIONS["gpt-4o"])

Step 4: Progressive Rollout with Cost Monitoring

Roll out the new model gradually while tracking both quality and cost.

import random
import time

class ModelRouter:
    def __init__(self, new_model_pct: int = 5):
        self.new_model_pct = new_model_pct
        self.metrics = {"old": [], "new": []}

    def route(self, messages: list[dict]) -> str:
        use_new = random.randint(1, 100) <= self.new_model_pct
        model = "gpt-4o" if use_new else "gpt-3.5-turbo"
        tag = "new" if use_new else "old"

        start = time.monotonic()
        response = client.chat.completions.create(
            model=model, messages=messages
        )
        latency = time.monotonic() - start

        self.metrics[tag].append({
            "latency": latency,
            "tokens": response.usage.total_tokens,
        })
        return response.choices[0].message.content

FAQ

How much will upgrading from GPT-3.5 to GPT-4o cost?

GPT-4o is significantly cheaper than the original GPT-4 but still more expensive than GPT-3.5 Turbo. Expect roughly a 3-5x increase in token costs. However, GPT-4o often needs fewer tokens to produce correct answers because it requires less prompt scaffolding, which partially offsets the per-token cost increase.

Should I update all my prompts when upgrading models?

Not immediately. Start by running your existing prompts against the new model. Many prompts work fine across model generations. Only rewrite prompts that show regressions in your evaluation. Over time, simplify prompts that were using workarounds for older model limitations.

How do I handle model deprecation deadlines?

OpenAI announces deprecation dates months in advance. Set calendar reminders for 60 and 30 days before deprecation. Run your evaluation suite against the replacement model immediately after announcement, so you have maximum time to adapt prompts and test.

#LLMUpgrade #GPT4 #GPT5 #ProductionAI #ModelMigration #AgenticAI #LearnAI #AIEngineering

Upgrading LLM Models in Production: GPT-3.5 to GPT-4 to GPT-5 Migration

Why Model Upgrades Are Not Simple Config Changes

Step 1: Build an Evaluation Dataset

Step 2: Run Comparative Evaluation

Step 3: Adapt Prompts for the New Model

Step 4: Progressive Rollout with Cost Monitoring

FAQ

How much will upgrading from GPT-3.5 to GPT-4o cost?

Should I update all my prompts when upgrading models?

How do I handle model deprecation deadlines?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding