Evaluation Datasets for AI Agents: Building Ground Truth for Automated Testing

Why Evaluation Datasets Are the Foundation of Agent Quality

An AI agent without an evaluation dataset is like a web service without tests — you only discover problems after users report them. Evaluation datasets provide ground truth: curated input-output pairs that define what correct behavior looks like. They enable automated regression testing, prompt comparison, and model migration decisions.

The difference between a toy eval set and a production-grade one is coverage, labeling quality, and maintenance discipline. This guide walks through building eval datasets that actually catch real problems.

Dataset Structure

An eval dataset is a collection of test cases, each containing an input, the expected behavior, and metadata for slicing results.

from dataclasses import dataclass, field
from enum import Enum
from typing import Optional

class Difficulty(str, Enum):
    EASY = "easy"
    MEDIUM = "medium"
    HARD = "hard"

class Category(str, Enum):
    TOOL_USE = "tool_use"
    REASONING = "reasoning"
    REFUSAL = "refusal"
    MULTI_STEP = "multi_step"

@dataclass
class EvalCase:
    id: str
    input_text: str
    expected_output: str
    expected_tool_calls: list[str] = field(default_factory=list)
    category: Category = Category.REASONING
    difficulty: Difficulty = Difficulty.MEDIUM
    tags: list[str] = field(default_factory=list)
    notes: Optional[str] = None

Store eval cases in a structured format — JSON Lines works well because you can append new cases without rewriting the file.

import json
from pathlib import Path

def save_eval_dataset(cases: list[EvalCase], path: Path):
    with open(path, "w") as f:
        for case in cases:
            f.write(json.dumps(vars(case)) + "\n")

def load_eval_dataset(path: Path) -> list[EvalCase]:
    cases = []
    with open(path) as f:
        for line in f:
            data = json.loads(line)
            data["category"] = Category(data["category"])
            data["difficulty"] = Difficulty(data["difficulty"])
            cases.append(EvalCase(**data))
    return cases

Designing for Diversity

A common mistake is building eval sets that only test the happy path. Effective datasets cover five dimensions of diversity.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

DIVERSITY_CHECKLIST = {
    "intent_types": [
        "simple_question",      # "What is X?"
        "multi_step_task",      # "Find X, then do Y with it"
        "ambiguous_request",    # "Help me with the thing"
        "out_of_scope",         # "Write me a poem" (if agent is task-specific)
        "adversarial",          # Prompt injection attempts
    ],
    "input_variations": [
        "formal_english",
        "casual_with_typos",
        "non_english",
        "very_long_input",
        "empty_or_minimal",
    ],
    "expected_behaviors": [
        "direct_answer",
        "tool_call",
        "clarifying_question",
        "polite_refusal",
        "multi_tool_chain",
    ],
}

def audit_coverage(cases: list[EvalCase]) -> dict:
    """Check which categories and difficulties are represented."""
    coverage = {
        "categories": {},
        "difficulties": {},
        "total": len(cases),
    }
    for case in cases:
        coverage["categories"][case.category.value] = (
            coverage["categories"].get(case.category.value, 0) + 1
        )
        coverage["difficulties"][case.difficulty.value] = (
            coverage["difficulties"].get(case.difficulty.value, 0) + 1
        )
    return coverage

Labeling Best Practices

Ground truth labels must be unambiguous. For open-ended outputs, use criteria-based labels instead of exact strings.

@dataclass
class CriteriaLabel:
    """Define correctness as a checklist rather than an exact string."""
    must_contain: list[str] = field(default_factory=list)
    must_not_contain: list[str] = field(default_factory=list)
    expected_tool: Optional[str] = None
    min_length: int = 0
    max_length: int = 10_000

    def evaluate(self, output: str, tool_calls: list[str]) -> dict:
        results = {}
        results["contains_required"] = all(
            kw.lower() in output.lower() for kw in self.must_contain
        )
        results["avoids_forbidden"] = not any(
            kw.lower() in output.lower() for kw in self.must_not_contain
        )
        results["correct_tool"] = (
            self.expected_tool in tool_calls if self.expected_tool else True
        )
        results["length_ok"] = self.min_length <= len(output) <= self.max_length
        results["pass"] = all(results.values())
        return results

Maintaining Eval Datasets Over Time

Eval datasets rot when your agent's capabilities change but the dataset does not. Schedule quarterly reviews.

from datetime import datetime

@dataclass
class EvalMetadata:
    created: str
    last_reviewed: str
    owner: str
    version: int = 1

    def needs_review(self, review_interval_days: int = 90) -> bool:
        last = datetime.fromisoformat(self.last_reviewed)
        return (datetime.now() - last).days > review_interval_days

Add new cases from production failures — every bug report is a potential eval case. Remove cases that no longer represent valid behavior.

FAQ

How many eval cases do I need?

Start with 50-100 cases that cover your major use cases and known edge cases. Grow the dataset over time by adding cases from production failures. Quality and diversity matter more than raw count.

Should I use synthetic data to generate eval cases?

Synthetic generation with an LLM is useful for initial dataset bootstrapping, but always have a human review and correct the labels. LLM-generated ground truth inherits the model's biases and errors.

How do I handle eval cases where multiple answers are correct?

Use criteria-based labels (must contain certain keywords, must call certain tools) instead of exact string matching. This accommodates valid variation in phrasing while still catching incorrect behavior.

#Evaluation #Datasets #AIAgents #GroundTruth #Testing #Python #AgenticAI #LearnAI #AIEngineering

Evaluation Datasets for AI Agents: Building Ground Truth for Automated Testing

Why Evaluation Datasets Are the Foundation of Agent Quality

Dataset Structure

Designing for Diversity

Labeling Best Practices

Maintaining Eval Datasets Over Time

FAQ

How many eval cases do I need?

Should I use synthetic data to generate eval cases?

How do I handle eval cases where multiple answers are correct?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding