Evaluating Fine-Tuned Models: Benchmarks, Human Eval, and A/B Testing

Why Evaluation Is the Hardest Part

Training a fine-tuned model takes hours. Evaluating whether it actually improved takes weeks. The reason is that "better" is multidimensional: a model can improve on formatting while regressing on accuracy, or handle common cases better while breaking on edge cases.

A production-grade evaluation strategy combines three layers: automated benchmarks for fast iteration, human evaluation for nuanced quality assessment, and A/B testing for real-world impact measurement.

Layer 1: Automated Benchmarks

Automated benchmarks give fast feedback during the training cycle. Build a test set of 100-500 examples that the model never sees during training, then evaluate after each training run.

import json
from openai import OpenAI
from dataclasses import dataclass

client = OpenAI()

@dataclass
class EvalResult:
    example_id: int
    input_text: str
    expected: str
    predicted: str
    exact_match: bool
    format_correct: bool

def run_automated_eval(
    model: str,
    test_file: str,
    system_prompt: str = "",
) -> list[EvalResult]:
    """Run model on test set and collect results."""
    results = []

    with open(test_file, "r") as f:
        for idx, line in enumerate(f):
            data = json.loads(line)
            messages = data["messages"]
            expected = messages[-1]["content"]
            prompt = messages[:-1]

            response = client.chat.completions.create(
                model=model,
                messages=prompt,
                temperature=0.0,
                max_tokens=1024,
            )
            predicted = response.choices[0].message.content.strip()

            results.append(EvalResult(
                example_id=idx,
                input_text=messages[-2]["content"],
                expected=expected,
                predicted=predicted,
                exact_match=predicted == expected,
                format_correct=check_format(predicted),
            ))

    return results

def check_format(output: str) -> bool:
    """Validate output matches expected format. Customize per use case."""
    # Example: check for ICD-10 code format
    import re
    lines = output.strip().split("\n")
    for line in lines:
        if not re.match(r"^[A-Z]\d{2}\.\d{1,2}: .+", line):
            return False
    return True

Computing Metrics

def compute_metrics(results: list[EvalResult]) -> dict:
    """Compute aggregate metrics from evaluation results."""
    total = len(results)
    exact_matches = sum(1 for r in results if r.exact_match)
    format_correct = sum(1 for r in results if r.format_correct)

    # Token-level accuracy (partial credit)
    from difflib import SequenceMatcher
    similarities = [
        SequenceMatcher(None, r.expected, r.predicted).ratio()
        for r in results
    ]

    return {
        "total_examples": total,
        "exact_match_rate": exact_matches / total,
        "format_accuracy": format_correct / total,
        "avg_similarity": sum(similarities) / len(similarities),
        "min_similarity": min(similarities),
    }

def compare_models(
    base_results: list[EvalResult],
    ft_results: list[EvalResult],
) -> dict:
    """Compare base model vs fine-tuned model metrics."""
    base_metrics = compute_metrics(base_results)
    ft_metrics = compute_metrics(ft_results)

    return {
        "exact_match_improvement": (
            ft_metrics["exact_match_rate"] - base_metrics["exact_match_rate"]
        ),
        "format_improvement": (
            ft_metrics["format_accuracy"] - base_metrics["format_accuracy"]
        ),
        "similarity_improvement": (
            ft_metrics["avg_similarity"] - base_metrics["avg_similarity"]
        ),
        "base": base_metrics,
        "fine_tuned": ft_metrics,
    }

Layer 2: Human Evaluation

Automated metrics miss nuances that humans catch: tone, helpfulness, factual correctness in context, and whether the response actually addresses the user's intent.

import random

def prepare_human_eval_batch(base_results, ft_results, sample_size=50):
    """Prepare a blind evaluation batch for human reviewers."""
    indices = random.sample(range(len(base_results)), sample_size)
    batch = []
    for idx in indices:
        # Randomly assign A/B to avoid position bias
        if random.random() > 0.5:
            a, b = base_results[idx].predicted, ft_results[idx].predicted
        else:
            a, b = ft_results[idx].predicted, base_results[idx].predicted
        batch.append({
            "input": base_results[idx].input_text,
            "response_a": a,
            "response_b": b,
        })
    return batch

Layer 3: A/B Testing in Production

The ultimate test is whether the fine-tuned model improves outcomes for real users. A/B testing routes a percentage of traffic to the fine-tuned model and compares business metrics.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

import hashlib
import time
from dataclasses import dataclass, field

@dataclass
class ABTestConfig:
    experiment_name: str
    control_model: str
    treatment_model: str
    traffic_split: float = 0.1  # 10% to treatment
    min_samples: int = 1000

@dataclass
class ABTestResult:
    model: str
    variant: str
    user_id: str
    response: str
    latency_ms: float
    timestamp: float = field(default_factory=time.time)

def assign_variant(user_id: str, config: ABTestConfig) -> str:
    """Deterministic assignment based on user ID hash."""
    hash_val = int(hashlib.md5(
        f"{config.experiment_name}:{user_id}".encode()
    ).hexdigest(), 16)
    if (hash_val % 1000) / 1000 < config.traffic_split:
        return "treatment"
    return "control"

def run_ab_request(
    user_id: str,
    messages: list[dict],
    config: ABTestConfig,
    client: OpenAI,
) -> ABTestResult:
    """Route a request through the A/B test."""
    variant = assign_variant(user_id, config)
    model = (
        config.treatment_model if variant == "treatment"
        else config.control_model
    )

    start = time.perf_counter()
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0.0,
    )
    latency = (time.perf_counter() - start) * 1000

    return ABTestResult(
        model=model,
        variant=variant,
        user_id=user_id,
        response=response.choices[0].message.content,
        latency_ms=latency,
    )

Statistical Significance

Do not declare a winner until you have statistical significance. Use a simple proportion test.

import math

def proportion_z_test(
    successes_a: int, total_a: int,
    successes_b: int, total_b: int,
) -> dict:
    """Two-proportion z-test for A/B test significance."""
    p_a = successes_a / total_a
    p_b = successes_b / total_b
    p_pool = (successes_a + successes_b) / (total_a + total_b)

    se = math.sqrt(p_pool * (1 - p_pool) * (1 / total_a + 1 / total_b))

    if se == 0:
        return {"significant": False, "reason": "No variance"}

    z = (p_b - p_a) / se
    # Approximate p-value for two-tailed test
    p_value = 2 * (1 - 0.5 * (1 + math.erf(abs(z) / math.sqrt(2))))

    return {
        "control_rate": f"{p_a:.4f}",
        "treatment_rate": f"{p_b:.4f}",
        "lift": f"{(p_b - p_a) / p_a:.2%}" if p_a > 0 else "N/A",
        "z_score": f"{z:.3f}",
        "p_value": f"{p_value:.4f}",
        "significant": p_value < 0.05,
    }

FAQ

How large should my test set be for automated evaluation?

A test set of 200-500 examples is the sweet spot for most fine-tuning projects. Fewer than 100 examples gives unreliable metrics — individual examples have too much influence. More than 1,000 examples increases evaluation cost without proportionally improving confidence. Make sure your test set covers the distribution of real inputs, including edge cases.

When should I use human evaluation versus automated metrics?

Use automated metrics for fast iteration during training (comparing hyperparameters, checking for regressions). Use human evaluation before any production deployment to catch quality issues that automated metrics miss, such as hallucinations, unhelpful but technically correct responses, or subtle tone problems. In practice, run automated eval after every training run and human eval before every deployment.

How long should I run an A/B test before making a decision?

Run until you reach statistical significance (p < 0.05) with a minimum of 1,000 samples per variant. For most applications, this takes 1-2 weeks. Avoid peeking at results early and stopping when they look good — this inflates false positive rates. Pre-register your success metrics and minimum sample size before starting the test.

#ModelEvaluation #FineTuning #ABTesting #Benchmarks #LLMQuality #AgenticAI #LearnAI #AIEngineering

Evaluating Fine-Tuned Models: Benchmarks, Human Eval, and A/B Testing

Why Evaluation Is the Hardest Part

Layer 1: Automated Benchmarks

Computing Metrics

Layer 2: Human Evaluation

Layer 3: A/B Testing in Production

Statistical Significance

FAQ

How large should my test set be for automated evaluation?

When should I use human evaluation versus automated metrics?

How long should I run an A/B test before making a decision?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding