Evaluating Fine-Tuned Models: Benchmarks, Human Eval, and A/B Testing
Learn a comprehensive evaluation methodology for fine-tuned LLMs, combining automated benchmarks, human evaluation, and production A/B testing to measure real-world improvement with statistical rigor.
Why Evaluation Is the Hardest Part
Training a fine-tuned model takes hours. Evaluating whether it actually improved takes weeks. The reason is that "better" is multidimensional: a model can improve on formatting while regressing on accuracy, or handle common cases better while breaking on edge cases.
A production-grade evaluation strategy combines three layers: automated benchmarks for fast iteration, human evaluation for nuanced quality assessment, and A/B testing for real-world impact measurement.
Layer 1: Automated Benchmarks
Automated benchmarks give fast feedback during the training cycle. Build a test set of 100-500 examples that the model never sees during training, then evaluate after each training run.
import json
from openai import OpenAI
from dataclasses import dataclass
client = OpenAI()
@dataclass
class EvalResult:
example_id: int
input_text: str
expected: str
predicted: str
exact_match: bool
format_correct: bool
def run_automated_eval(
model: str,
test_file: str,
system_prompt: str = "",
) -> list[EvalResult]:
"""Run model on test set and collect results."""
results = []
with open(test_file, "r") as f:
for idx, line in enumerate(f):
data = json.loads(line)
messages = data["messages"]
expected = messages[-1]["content"]
prompt = messages[:-1]
response = client.chat.completions.create(
model=model,
messages=prompt,
temperature=0.0,
max_tokens=1024,
)
predicted = response.choices[0].message.content.strip()
results.append(EvalResult(
example_id=idx,
input_text=messages[-2]["content"],
expected=expected,
predicted=predicted,
exact_match=predicted == expected,
format_correct=check_format(predicted),
))
return results
def check_format(output: str) -> bool:
"""Validate output matches expected format. Customize per use case."""
# Example: check for ICD-10 code format
import re
lines = output.strip().split("\n")
for line in lines:
if not re.match(r"^[A-Z]\d{2}\.\d{1,2}: .+", line):
return False
return True
Computing Metrics
def compute_metrics(results: list[EvalResult]) -> dict:
"""Compute aggregate metrics from evaluation results."""
total = len(results)
exact_matches = sum(1 for r in results if r.exact_match)
format_correct = sum(1 for r in results if r.format_correct)
# Token-level accuracy (partial credit)
from difflib import SequenceMatcher
similarities = [
SequenceMatcher(None, r.expected, r.predicted).ratio()
for r in results
]
return {
"total_examples": total,
"exact_match_rate": exact_matches / total,
"format_accuracy": format_correct / total,
"avg_similarity": sum(similarities) / len(similarities),
"min_similarity": min(similarities),
}
def compare_models(
base_results: list[EvalResult],
ft_results: list[EvalResult],
) -> dict:
"""Compare base model vs fine-tuned model metrics."""
base_metrics = compute_metrics(base_results)
ft_metrics = compute_metrics(ft_results)
return {
"exact_match_improvement": (
ft_metrics["exact_match_rate"] - base_metrics["exact_match_rate"]
),
"format_improvement": (
ft_metrics["format_accuracy"] - base_metrics["format_accuracy"]
),
"similarity_improvement": (
ft_metrics["avg_similarity"] - base_metrics["avg_similarity"]
),
"base": base_metrics,
"fine_tuned": ft_metrics,
}
Layer 2: Human Evaluation
Automated metrics miss nuances that humans catch: tone, helpfulness, factual correctness in context, and whether the response actually addresses the user's intent.
import random
def prepare_human_eval_batch(base_results, ft_results, sample_size=50):
"""Prepare a blind evaluation batch for human reviewers."""
indices = random.sample(range(len(base_results)), sample_size)
batch = []
for idx in indices:
# Randomly assign A/B to avoid position bias
if random.random() > 0.5:
a, b = base_results[idx].predicted, ft_results[idx].predicted
else:
a, b = ft_results[idx].predicted, base_results[idx].predicted
batch.append({
"input": base_results[idx].input_text,
"response_a": a,
"response_b": b,
})
return batch
Layer 3: A/B Testing in Production
The ultimate test is whether the fine-tuned model improves outcomes for real users. A/B testing routes a percentage of traffic to the fine-tuned model and compares business metrics.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
import hashlib
import time
from dataclasses import dataclass, field
@dataclass
class ABTestConfig:
experiment_name: str
control_model: str
treatment_model: str
traffic_split: float = 0.1 # 10% to treatment
min_samples: int = 1000
@dataclass
class ABTestResult:
model: str
variant: str
user_id: str
response: str
latency_ms: float
timestamp: float = field(default_factory=time.time)
def assign_variant(user_id: str, config: ABTestConfig) -> str:
"""Deterministic assignment based on user ID hash."""
hash_val = int(hashlib.md5(
f"{config.experiment_name}:{user_id}".encode()
).hexdigest(), 16)
if (hash_val % 1000) / 1000 < config.traffic_split:
return "treatment"
return "control"
def run_ab_request(
user_id: str,
messages: list[dict],
config: ABTestConfig,
client: OpenAI,
) -> ABTestResult:
"""Route a request through the A/B test."""
variant = assign_variant(user_id, config)
model = (
config.treatment_model if variant == "treatment"
else config.control_model
)
start = time.perf_counter()
response = client.chat.completions.create(
model=model,
messages=messages,
temperature=0.0,
)
latency = (time.perf_counter() - start) * 1000
return ABTestResult(
model=model,
variant=variant,
user_id=user_id,
response=response.choices[0].message.content,
latency_ms=latency,
)
Statistical Significance
Do not declare a winner until you have statistical significance. Use a simple proportion test.
import math
def proportion_z_test(
successes_a: int, total_a: int,
successes_b: int, total_b: int,
) -> dict:
"""Two-proportion z-test for A/B test significance."""
p_a = successes_a / total_a
p_b = successes_b / total_b
p_pool = (successes_a + successes_b) / (total_a + total_b)
se = math.sqrt(p_pool * (1 - p_pool) * (1 / total_a + 1 / total_b))
if se == 0:
return {"significant": False, "reason": "No variance"}
z = (p_b - p_a) / se
# Approximate p-value for two-tailed test
p_value = 2 * (1 - 0.5 * (1 + math.erf(abs(z) / math.sqrt(2))))
return {
"control_rate": f"{p_a:.4f}",
"treatment_rate": f"{p_b:.4f}",
"lift": f"{(p_b - p_a) / p_a:.2%}" if p_a > 0 else "N/A",
"z_score": f"{z:.3f}",
"p_value": f"{p_value:.4f}",
"significant": p_value < 0.05,
}
FAQ
How large should my test set be for automated evaluation?
A test set of 200-500 examples is the sweet spot for most fine-tuning projects. Fewer than 100 examples gives unreliable metrics — individual examples have too much influence. More than 1,000 examples increases evaluation cost without proportionally improving confidence. Make sure your test set covers the distribution of real inputs, including edge cases.
When should I use human evaluation versus automated metrics?
Use automated metrics for fast iteration during training (comparing hyperparameters, checking for regressions). Use human evaluation before any production deployment to catch quality issues that automated metrics miss, such as hallucinations, unhelpful but technically correct responses, or subtle tone problems. In practice, run automated eval after every training run and human eval before every deployment.
How long should I run an A/B test before making a decision?
Run until you reach statistical significance (p < 0.05) with a minimum of 1,000 samples per variant. For most applications, this takes 1-2 weeks. Avoid peeking at results early and stopping when they look good — this inflates false positive rates. Pre-register your success metrics and minimum sample size before starting the test.
#ModelEvaluation #FineTuning #ABTesting #Benchmarks #LLMQuality #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.