Building Evaluation Datasets: Synthetic Generation, Human Labeling, and Active Learning
A practical guide to creating high-quality evaluation datasets for AI agents using synthetic data generation, human annotation pipelines, active learning for efficient labeling, and dataset versioning strategies.
The Dataset Is the Evaluation
Your evaluation is only as good as your dataset. A perfect scoring pipeline running against a biased or unrepresentative dataset gives you false confidence. Building evaluation datasets for AI agents is particularly challenging because agent interactions are multi-turn, involve tool calls, and have complex success criteria that go beyond simple text matching.
This guide covers three complementary approaches: synthetic generation for scale, human labeling for quality, and active learning for efficiency. Used together, they give you a dataset that is large enough for statistical reliability, accurate enough for trust, and continuously improving as your agent evolves.
Synthetic Dataset Generation
Use an LLM to generate diverse evaluation samples at scale. The key is generating both the user inputs and the expected agent behavior.
flowchart TD
START["Building Evaluation Datasets: Synthetic Generatio…"] --> A
A["The Dataset Is the Evaluation"]
A --> B
B["Synthetic Dataset Generation"]
B --> C
C["Human Annotation Pipeline"]
C --> D
D["Active Learning for Efficient Labeling"]
D --> E
E["Dataset Versioning and Quality Control"]
E --> F
F["FAQ"]
F --> DONE["Key Takeaways"]
style START fill:#4f46e5,stroke:#4338ca,color:#fff
style DONE fill:#059669,stroke:#047857,color:#fff
import json
import asyncio
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class SyntheticSample:
sample_id: str
user_input: str
expected_response: str
expected_tool_calls: list[dict] = field(
default_factory=list
)
difficulty: str = "medium"
tags: list[str] = field(default_factory=list)
generated_by: str = "synthetic"
async def generate_synthetic_samples(
llm_client,
task_description: str,
tool_definitions: list[dict],
count: int = 20,
difficulties: list[str] = None,
) -> list[SyntheticSample]:
difficulties = difficulties or ["easy", "medium", "hard"]
tools_text = json.dumps(tool_definitions, indent=2)
prompt = f"""Generate {count} diverse evaluation samples for
an AI agent with the following task and available tools.
## Task Description
{task_description}
## Available Tools
{tools_text}
For each sample, generate:
1. A realistic user input message
2. The expected agent response (or key points)
3. Expected tool calls with parameters
4. Difficulty level: {difficulties}
5. Tags describing the capability tested
Vary the samples across:
- Different user phrasings and communication styles
- Edge cases and unusual requests
- Multi-step and single-step tasks
- Clear and ambiguous intents
Return JSON array:
[
{{
"user_input": "...",
"expected_response_summary": "...",
"expected_tool_calls": [{{"name": "...", "params": {{}}}}],
"difficulty": "easy|medium|hard",
"tags": ["..."]
}}
]"""
response = await llm_client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
temperature=0.9,
)
raw = json.loads(response.choices[0].message.content)
items = raw if isinstance(raw, list) else raw.get("samples", [])
samples = []
for i, item in enumerate(items):
samples.append(SyntheticSample(
sample_id=f"syn-{i:04d}",
user_input=item["user_input"],
expected_response=item.get(
"expected_response_summary", ""
),
expected_tool_calls=item.get(
"expected_tool_calls", []
),
difficulty=item.get("difficulty", "medium"),
tags=item.get("tags", []),
))
return samples
Set the temperature high (0.8 to 1.0) for generation to maximize diversity. Then filter and validate the results. Synthetic data is a starting point — it fills the volume gap while you build out human-labeled gold sets.
Human Annotation Pipeline
Human-labeled data is your ground truth. Design the annotation workflow to maximize consistency and minimize annotator fatigue.
@dataclass
class AnnotationTask:
task_id: str
conversation: list[dict]
agent_response: str
questions: list[dict] # What to annotate
@dataclass
class Annotation:
task_id: str
annotator_id: str
labels: dict
confidence: float # 0.0 to 1.0
time_seconds: float
notes: Optional[str] = None
class AnnotationPipeline:
def __init__(self, min_annotators: int = 2):
self.min_annotators = min_annotators
self.tasks: list[AnnotationTask] = []
self.annotations: list[Annotation] = []
def add_task(self, task: AnnotationTask):
self.tasks.append(task)
def submit_annotation(self, annotation: Annotation):
self.annotations.append(annotation)
def get_consensus(self, task_id: str) -> Optional[dict]:
task_annotations = [
a for a in self.annotations
if a.task_id == task_id
]
if len(task_annotations) < self.min_annotators:
return None
# Majority vote per label field
label_keys = task_annotations[0].labels.keys()
consensus = {}
for key in label_keys:
values = [a.labels[key] for a in task_annotations]
consensus[key] = max(set(values), key=values.count)
# Agreement rate
agreements = sum(
1 for key in label_keys
if len(set(a.labels[key] for a in task_annotations)) == 1
)
agreement_rate = agreements / len(label_keys)
return {
"task_id": task_id,
"consensus_labels": consensus,
"agreement_rate": round(agreement_rate, 3),
"annotator_count": len(task_annotations),
"avg_confidence": round(
sum(a.confidence for a in task_annotations)
/ len(task_annotations),
3,
),
}
Require at least two annotators per sample to catch individual mistakes. When annotators disagree, route the sample to a third annotator or a domain expert. Samples with persistent disagreement often reveal genuinely ambiguous cases that deserve special handling in your evaluation.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Active Learning for Efficient Labeling
Label the samples that matter most — the ones your agent currently gets wrong or is uncertain about.
import random
class ActiveLearningSelector:
def __init__(self, uncertainty_threshold: float = 0.6):
self.threshold = uncertainty_threshold
self.labeled: list[dict] = []
self.unlabeled: list[dict] = []
def add_unlabeled(self, samples: list[dict]):
self.unlabeled.extend(samples)
def score_uncertainty(self, sample: dict) -> float:
"""Score how uncertain the agent is on this sample.
Higher = more valuable to label."""
agent_confidence = sample.get(
"agent_confidence", 0.5
)
# Invert: low agent confidence = high labeling value
uncertainty = 1.0 - agent_confidence
# Boost novel patterns
if sample.get("is_novel_pattern", False):
uncertainty = min(1.0, uncertainty + 0.2)
return uncertainty
def select_batch(self, batch_size: int = 50) -> list[dict]:
scored = [
(self.score_uncertainty(s), s)
for s in self.unlabeled
]
# Mix: 70% highest uncertainty, 30% random
scored.sort(key=lambda x: -x[0])
n_uncertain = int(batch_size * 0.7)
n_random = batch_size - n_uncertain
selected = [s for _, s in scored[:n_uncertain]]
remaining = [s for _, s in scored[n_uncertain:]]
if remaining:
selected.extend(
random.sample(
remaining, min(n_random, len(remaining))
)
)
# Remove selected from unlabeled pool
selected_ids = {s.get("id") for s in selected}
self.unlabeled = [
s for s in self.unlabeled
if s.get("id") not in selected_ids
]
return selected
The 70/30 split between uncertain and random samples is important. Pure uncertainty sampling can create a biased dataset that only covers hard cases. The random component ensures your dataset still represents the full distribution of user requests.
Dataset Versioning and Quality Control
Track every change to your dataset so evaluation results are always reproducible.
import hashlib
from datetime import datetime
@dataclass
class DatasetVersion:
version: str
fingerprint: str
sample_count: int
created_at: str
parent_version: Optional[str] = None
changes: list[str] = field(default_factory=list)
class VersionedDataset:
def __init__(self, name: str):
self.name = name
self.samples: list[dict] = []
self.versions: list[DatasetVersion] = []
def fingerprint(self) -> str:
content = json.dumps(self.samples, sort_keys=True)
return hashlib.sha256(
content.encode()
).hexdigest()[:12]
def commit(
self, version: str, changes: list[str]
) -> DatasetVersion:
parent = (
self.versions[-1].version
if self.versions else None
)
v = DatasetVersion(
version=version,
fingerprint=self.fingerprint(),
sample_count=len(self.samples),
created_at=datetime.utcnow().isoformat(),
parent_version=parent,
changes=changes,
)
self.versions.append(v)
return v
def quality_report(self) -> dict:
tags_coverage = set()
difficulties = {"easy": 0, "medium": 0, "hard": 0}
for sample in self.samples:
tags_coverage.update(sample.get("tags", []))
diff = sample.get("difficulty", "medium")
difficulties[diff] = difficulties.get(diff, 0) + 1
return {
"total_samples": len(self.samples),
"unique_tags": len(tags_coverage),
"difficulty_distribution": difficulties,
"fingerprint": self.fingerprint(),
"version": (
self.versions[-1].version
if self.versions else "uncommitted"
),
}
Always reference the dataset fingerprint alongside evaluation results. When a score changes, you can immediately determine whether it was caused by a model change or a dataset change.
FAQ
How many samples do I need for a reliable evaluation dataset?
Aim for at least 50 samples per capability or task type. For statistical significance when comparing two models, you need 200 or more samples per comparison. Start with synthetic generation to reach volume, then replace low-quality synthetic samples with human-labeled ones over time. A 500-sample dataset that is 30 percent human-labeled and 70 percent high-quality synthetic is a strong starting point.
How do I detect and remove bad synthetic samples?
Run three quality filters. First, a deterministic filter that catches formatting issues, empty fields, and duplicate inputs. Second, a self-consistency check where you generate the same task twice with different seeds and compare — inconsistent outputs suggest an underspecified prompt. Third, a human spot-check on 10 percent of each generated batch. Track the rejection rate to improve your generation prompts.
When should I create a new dataset version versus modifying the existing one?
Create a new version whenever you add more than 10 percent new samples, remove samples, change annotation guidelines, or fix systematic labeling errors. For small additions (under 10 percent), append and increment a minor version. Always preserve old versions so you can re-run evaluations against them for trend analysis.
#EvaluationDatasets #SyntheticData #DataLabeling #ActiveLearning #Python #AgenticAI #LearnAI #AIEngineering
Written by
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.