Evaluation Datasets for AI Agents: Building Ground Truth for Automated Testing
Learn how to design, label, and maintain evaluation datasets for AI agents, covering dataset structure, diversity requirements, edge cases, and ongoing maintenance strategies.
Why Evaluation Datasets Are the Foundation of Agent Quality
An AI agent without an evaluation dataset is like a web service without tests — you only discover problems after users report them. Evaluation datasets provide ground truth: curated input-output pairs that define what correct behavior looks like. They enable automated regression testing, prompt comparison, and model migration decisions.
The difference between a toy eval set and a production-grade one is coverage, labeling quality, and maintenance discipline. This guide walks through building eval datasets that actually catch real problems.
Dataset Structure
An eval dataset is a collection of test cases, each containing an input, the expected behavior, and metadata for slicing results.
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
class Difficulty(str, Enum):
EASY = "easy"
MEDIUM = "medium"
HARD = "hard"
class Category(str, Enum):
TOOL_USE = "tool_use"
REASONING = "reasoning"
REFUSAL = "refusal"
MULTI_STEP = "multi_step"
@dataclass
class EvalCase:
id: str
input_text: str
expected_output: str
expected_tool_calls: list[str] = field(default_factory=list)
category: Category = Category.REASONING
difficulty: Difficulty = Difficulty.MEDIUM
tags: list[str] = field(default_factory=list)
notes: Optional[str] = None
Store eval cases in a structured format — JSON Lines works well because you can append new cases without rewriting the file.
import json
from pathlib import Path
def save_eval_dataset(cases: list[EvalCase], path: Path):
with open(path, "w") as f:
for case in cases:
f.write(json.dumps(vars(case)) + "\n")
def load_eval_dataset(path: Path) -> list[EvalCase]:
cases = []
with open(path) as f:
for line in f:
data = json.loads(line)
data["category"] = Category(data["category"])
data["difficulty"] = Difficulty(data["difficulty"])
cases.append(EvalCase(**data))
return cases
Designing for Diversity
A common mistake is building eval sets that only test the happy path. Effective datasets cover five dimensions of diversity.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
DIVERSITY_CHECKLIST = {
"intent_types": [
"simple_question", # "What is X?"
"multi_step_task", # "Find X, then do Y with it"
"ambiguous_request", # "Help me with the thing"
"out_of_scope", # "Write me a poem" (if agent is task-specific)
"adversarial", # Prompt injection attempts
],
"input_variations": [
"formal_english",
"casual_with_typos",
"non_english",
"very_long_input",
"empty_or_minimal",
],
"expected_behaviors": [
"direct_answer",
"tool_call",
"clarifying_question",
"polite_refusal",
"multi_tool_chain",
],
}
def audit_coverage(cases: list[EvalCase]) -> dict:
"""Check which categories and difficulties are represented."""
coverage = {
"categories": {},
"difficulties": {},
"total": len(cases),
}
for case in cases:
coverage["categories"][case.category.value] = (
coverage["categories"].get(case.category.value, 0) + 1
)
coverage["difficulties"][case.difficulty.value] = (
coverage["difficulties"].get(case.difficulty.value, 0) + 1
)
return coverage
Labeling Best Practices
Ground truth labels must be unambiguous. For open-ended outputs, use criteria-based labels instead of exact strings.
@dataclass
class CriteriaLabel:
"""Define correctness as a checklist rather than an exact string."""
must_contain: list[str] = field(default_factory=list)
must_not_contain: list[str] = field(default_factory=list)
expected_tool: Optional[str] = None
min_length: int = 0
max_length: int = 10_000
def evaluate(self, output: str, tool_calls: list[str]) -> dict:
results = {}
results["contains_required"] = all(
kw.lower() in output.lower() for kw in self.must_contain
)
results["avoids_forbidden"] = not any(
kw.lower() in output.lower() for kw in self.must_not_contain
)
results["correct_tool"] = (
self.expected_tool in tool_calls if self.expected_tool else True
)
results["length_ok"] = self.min_length <= len(output) <= self.max_length
results["pass"] = all(results.values())
return results
Maintaining Eval Datasets Over Time
Eval datasets rot when your agent's capabilities change but the dataset does not. Schedule quarterly reviews.
from datetime import datetime
@dataclass
class EvalMetadata:
created: str
last_reviewed: str
owner: str
version: int = 1
def needs_review(self, review_interval_days: int = 90) -> bool:
last = datetime.fromisoformat(self.last_reviewed)
return (datetime.now() - last).days > review_interval_days
Add new cases from production failures — every bug report is a potential eval case. Remove cases that no longer represent valid behavior.
FAQ
How many eval cases do I need?
Start with 50-100 cases that cover your major use cases and known edge cases. Grow the dataset over time by adding cases from production failures. Quality and diversity matter more than raw count.
Should I use synthetic data to generate eval cases?
Synthetic generation with an LLM is useful for initial dataset bootstrapping, but always have a human review and correct the labels. LLM-generated ground truth inherits the model's biases and errors.
How do I handle eval cases where multiple answers are correct?
Use criteria-based labels (must contain certain keywords, must call certain tools) instead of exact string matching. This accommodates valid variation in phrasing while still catching incorrect behavior.
#Evaluation #Datasets #AIAgents #GroundTruth #Testing #Python #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.