Building Evaluation Datasets: Synthetic Generation, Human Labeling, and Active Learning

The Dataset Is the Evaluation

Your evaluation is only as good as your dataset. A perfect scoring pipeline running against a biased or unrepresentative dataset gives you false confidence. Building evaluation datasets for AI agents is particularly challenging because agent interactions are multi-turn, involve tool calls, and have complex success criteria that go beyond simple text matching.

This guide covers three complementary approaches: synthetic generation for scale, human labeling for quality, and active learning for efficiency. Used together, they give you a dataset that is large enough for statistical reliability, accurate enough for trust, and continuously improving as your agent evolves.

Synthetic Dataset Generation

Use an LLM to generate diverse evaluation samples at scale. The key is generating both the user inputs and the expected agent behavior.

flowchart TD
    START["Building Evaluation Datasets: Synthetic Generatio…"] --> A
    A["The Dataset Is the Evaluation"]
    A --> B
    B["Synthetic Dataset Generation"]
    B --> C
    C["Human Annotation Pipeline"]
    C --> D
    D["Active Learning for Efficient Labeling"]
    D --> E
    E["Dataset Versioning and Quality Control"]
    E --> F
    F["FAQ"]
    F --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff

import json
import asyncio
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class SyntheticSample:
    sample_id: str
    user_input: str
    expected_response: str
    expected_tool_calls: list[dict] = field(
        default_factory=list
    )
    difficulty: str = "medium"
    tags: list[str] = field(default_factory=list)
    generated_by: str = "synthetic"

async def generate_synthetic_samples(
    llm_client,
    task_description: str,
    tool_definitions: list[dict],
    count: int = 20,
    difficulties: list[str] = None,
) -> list[SyntheticSample]:
    difficulties = difficulties or ["easy", "medium", "hard"]
    tools_text = json.dumps(tool_definitions, indent=2)

    prompt = f"""Generate {count} diverse evaluation samples for
an AI agent with the following task and available tools.

## Task Description
{task_description}

## Available Tools
{tools_text}

For each sample, generate:
1. A realistic user input message
2. The expected agent response (or key points)
3. Expected tool calls with parameters
4. Difficulty level: {difficulties}
5. Tags describing the capability tested

Vary the samples across:
- Different user phrasings and communication styles
- Edge cases and unusual requests
- Multi-step and single-step tasks
- Clear and ambiguous intents

Return JSON array:
[
  {{
    "user_input": "...",
    "expected_response_summary": "...",
    "expected_tool_calls": [{{"name": "...", "params": {{}}}}],
    "difficulty": "easy|medium|hard",
    "tags": ["..."]
  }}
]"""

    response = await llm_client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
        temperature=0.9,
    )
    raw = json.loads(response.choices[0].message.content)
    items = raw if isinstance(raw, list) else raw.get("samples", [])

    samples = []
    for i, item in enumerate(items):
        samples.append(SyntheticSample(
            sample_id=f"syn-{i:04d}",
            user_input=item["user_input"],
            expected_response=item.get(
                "expected_response_summary", ""
            ),
            expected_tool_calls=item.get(
                "expected_tool_calls", []
            ),
            difficulty=item.get("difficulty", "medium"),
            tags=item.get("tags", []),
        ))
    return samples

Set the temperature high (0.8 to 1.0) for generation to maximize diversity. Then filter and validate the results. Synthetic data is a starting point — it fills the volume gap while you build out human-labeled gold sets.

Human Annotation Pipeline

Human-labeled data is your ground truth. Design the annotation workflow to maximize consistency and minimize annotator fatigue.

@dataclass
class AnnotationTask:
    task_id: str
    conversation: list[dict]
    agent_response: str
    questions: list[dict]  # What to annotate

@dataclass
class Annotation:
    task_id: str
    annotator_id: str
    labels: dict
    confidence: float  # 0.0 to 1.0
    time_seconds: float
    notes: Optional[str] = None

class AnnotationPipeline:
    def __init__(self, min_annotators: int = 2):
        self.min_annotators = min_annotators
        self.tasks: list[AnnotationTask] = []
        self.annotations: list[Annotation] = []

    def add_task(self, task: AnnotationTask):
        self.tasks.append(task)

    def submit_annotation(self, annotation: Annotation):
        self.annotations.append(annotation)

    def get_consensus(self, task_id: str) -> Optional[dict]:
        task_annotations = [
            a for a in self.annotations
            if a.task_id == task_id
        ]
        if len(task_annotations) < self.min_annotators:
            return None

        # Majority vote per label field
        label_keys = task_annotations[0].labels.keys()
        consensus = {}
        for key in label_keys:
            values = [a.labels[key] for a in task_annotations]
            consensus[key] = max(set(values), key=values.count)

        # Agreement rate
        agreements = sum(
            1 for key in label_keys
            if len(set(a.labels[key] for a in task_annotations)) == 1
        )
        agreement_rate = agreements / len(label_keys)

        return {
            "task_id": task_id,
            "consensus_labels": consensus,
            "agreement_rate": round(agreement_rate, 3),
            "annotator_count": len(task_annotations),
            "avg_confidence": round(
                sum(a.confidence for a in task_annotations)
                / len(task_annotations),
                3,
            ),
        }

Require at least two annotators per sample to catch individual mistakes. When annotators disagree, route the sample to a third annotator or a domain expert. Samples with persistent disagreement often reveal genuinely ambiguous cases that deserve special handling in your evaluation.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

Active Learning for Efficient Labeling

Label the samples that matter most — the ones your agent currently gets wrong or is uncertain about.

import random

class ActiveLearningSelector:
    def __init__(self, uncertainty_threshold: float = 0.6):
        self.threshold = uncertainty_threshold
        self.labeled: list[dict] = []
        self.unlabeled: list[dict] = []

    def add_unlabeled(self, samples: list[dict]):
        self.unlabeled.extend(samples)

    def score_uncertainty(self, sample: dict) -> float:
        """Score how uncertain the agent is on this sample.
        Higher = more valuable to label."""
        agent_confidence = sample.get(
            "agent_confidence", 0.5
        )
        # Invert: low agent confidence = high labeling value
        uncertainty = 1.0 - agent_confidence

        # Boost novel patterns
        if sample.get("is_novel_pattern", False):
            uncertainty = min(1.0, uncertainty + 0.2)

        return uncertainty

    def select_batch(self, batch_size: int = 50) -> list[dict]:
        scored = [
            (self.score_uncertainty(s), s)
            for s in self.unlabeled
        ]
        # Mix: 70% highest uncertainty, 30% random
        scored.sort(key=lambda x: -x[0])
        n_uncertain = int(batch_size * 0.7)
        n_random = batch_size - n_uncertain

        selected = [s for _, s in scored[:n_uncertain]]
        remaining = [s for _, s in scored[n_uncertain:]]
        if remaining:
            selected.extend(
                random.sample(
                    remaining, min(n_random, len(remaining))
                )
            )

        # Remove selected from unlabeled pool
        selected_ids = {s.get("id") for s in selected}
        self.unlabeled = [
            s for s in self.unlabeled
            if s.get("id") not in selected_ids
        ]
        return selected

The 70/30 split between uncertain and random samples is important. Pure uncertainty sampling can create a biased dataset that only covers hard cases. The random component ensures your dataset still represents the full distribution of user requests.

Dataset Versioning and Quality Control

Track every change to your dataset so evaluation results are always reproducible.

import hashlib
from datetime import datetime

@dataclass
class DatasetVersion:
    version: str
    fingerprint: str
    sample_count: int
    created_at: str
    parent_version: Optional[str] = None
    changes: list[str] = field(default_factory=list)

class VersionedDataset:
    def __init__(self, name: str):
        self.name = name
        self.samples: list[dict] = []
        self.versions: list[DatasetVersion] = []

    def fingerprint(self) -> str:
        content = json.dumps(self.samples, sort_keys=True)
        return hashlib.sha256(
            content.encode()
        ).hexdigest()[:12]

    def commit(
        self, version: str, changes: list[str]
    ) -> DatasetVersion:
        parent = (
            self.versions[-1].version
            if self.versions else None
        )
        v = DatasetVersion(
            version=version,
            fingerprint=self.fingerprint(),
            sample_count=len(self.samples),
            created_at=datetime.utcnow().isoformat(),
            parent_version=parent,
            changes=changes,
        )
        self.versions.append(v)
        return v

    def quality_report(self) -> dict:
        tags_coverage = set()
        difficulties = {"easy": 0, "medium": 0, "hard": 0}
        for sample in self.samples:
            tags_coverage.update(sample.get("tags", []))
            diff = sample.get("difficulty", "medium")
            difficulties[diff] = difficulties.get(diff, 0) + 1

        return {
            "total_samples": len(self.samples),
            "unique_tags": len(tags_coverage),
            "difficulty_distribution": difficulties,
            "fingerprint": self.fingerprint(),
            "version": (
                self.versions[-1].version
                if self.versions else "uncommitted"
            ),
        }

Always reference the dataset fingerprint alongside evaluation results. When a score changes, you can immediately determine whether it was caused by a model change or a dataset change.

FAQ

How many samples do I need for a reliable evaluation dataset?

Aim for at least 50 samples per capability or task type. For statistical significance when comparing two models, you need 200 or more samples per comparison. Start with synthetic generation to reach volume, then replace low-quality synthetic samples with human-labeled ones over time. A 500-sample dataset that is 30 percent human-labeled and 70 percent high-quality synthetic is a strong starting point.

How do I detect and remove bad synthetic samples?

Run three quality filters. First, a deterministic filter that catches formatting issues, empty fields, and duplicate inputs. Second, a self-consistency check where you generate the same task twice with different seeds and compare — inconsistent outputs suggest an underspecified prompt. Third, a human spot-check on 10 percent of each generated batch. Track the rejection rate to improve your generation prompts.

When should I create a new dataset version versus modifying the existing one?

Create a new version whenever you add more than 10 percent new samples, remove samples, change annotation guidelines, or fix systematic labeling errors. For small additions (under 10 percent), append and increment a minor version. Always preserve old versions so you can re-run evaluations against them for trend analysis.

#EvaluationDatasets #SyntheticData #DataLabeling #ActiveLearning #Python #AgenticAI #LearnAI #AIEngineering

Building Evaluation Datasets: Synthetic Generation, Human Labeling, and Active Learning

The Dataset Is the Evaluation

Synthetic Dataset Generation

Human Annotation Pipeline

Active Learning for Efficient Labeling

Dataset Versioning and Quality Control

FAQ

How many samples do I need for a reliable evaluation dataset?

How do I detect and remove bad synthetic samples?

When should I create a new dataset version versus modifying the existing one?

Try CallSphere AI Voice Agents

Related Articles

Building an AI Agent with Tool-Use Chains: Sequential Tool Orchestration for Complex Tasks

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Building a Hypothesis-Testing Agent: Scientific Method Applied to Data Analysis