Building an Agent Evaluation Framework: Metrics, Datasets, and Automated Scoring
Learn how to design a comprehensive evaluation framework for AI agents covering metric selection, dataset creation, and automated scoring pipelines that scale across dozens of agent capabilities.
Why You Need a Structured Evaluation Framework
Deploying an AI agent without structured evaluation is like shipping software without tests. The agent might work perfectly in a demo, then fail spectacularly on the first edge case a real user throws at it. An evaluation framework gives you repeatable, quantitative measurements that tell you exactly where your agent excels and where it breaks.
A good framework has three pillars: metrics that capture what matters, datasets that represent real usage, and scoring pipelines that run automatically. Let us build each one from scratch.
Designing Your Metric Taxonomy
Metrics for AI agents fall into four categories. Each captures a different dimension of quality.
from dataclasses import dataclass, field
from enum import Enum
from typing import Callable, Any
class MetricCategory(Enum):
TASK = "task_completion"
QUALITY = "response_quality"
EFFICIENCY = "efficiency"
SAFETY = "safety"
@dataclass
class EvalMetric:
name: str
category: MetricCategory
scorer: Callable[[dict], float]
weight: float = 1.0
description: str = ""
@dataclass
class EvalFramework:
metrics: list[EvalMetric] = field(default_factory=list)
def register(self, metric: EvalMetric):
self.metrics.append(metric)
def score(self, sample: dict) -> dict[str, float]:
results = {}
for metric in self.metrics:
try:
results[metric.name] = metric.scorer(sample)
except Exception as e:
results[metric.name] = 0.0
results[f"{metric.name}_error"] = str(e)
return results
def weighted_aggregate(self, results: dict[str, float]) -> float:
total_weight = sum(m.weight for m in self.metrics)
weighted_sum = sum(
results.get(m.name, 0.0) * m.weight
for m in self.metrics
)
return weighted_sum / total_weight if total_weight > 0 else 0.0
This gives you a registry where each metric is a named scorer function. The framework runs all metrics against a sample and returns individual scores plus a weighted aggregate.
Creating Evaluation Datasets
Your dataset should mirror production traffic. Each sample includes the user input, the expected behavior, and any context the agent had access to.
import json
import hashlib
from datetime import datetime
@dataclass
class EvalSample:
sample_id: str
user_input: str
expected_output: str
expected_tool_calls: list[dict] = field(default_factory=list)
context: dict = field(default_factory=dict)
tags: list[str] = field(default_factory=list)
difficulty: str = "medium"
class EvalDataset:
def __init__(self, name: str, version: str = "1.0"):
self.name = name
self.version = version
self.samples: list[EvalSample] = []
self.created_at = datetime.utcnow().isoformat()
def add_sample(self, sample: EvalSample):
self.samples.append(sample)
def fingerprint(self) -> str:
content = json.dumps(
[s.__dict__ for s in self.samples],
sort_keys=True
)
return hashlib.sha256(content.encode()).hexdigest()[:12]
def filter_by_tag(self, tag: str) -> list[EvalSample]:
return [s for s in self.samples if tag in s.tags]
def save(self, path: str):
data = {
"name": self.name,
"version": self.version,
"fingerprint": self.fingerprint(),
"created_at": self.created_at,
"samples": [s.__dict__ for s in self.samples],
}
with open(path, "w") as f:
json.dump(data, f, indent=2)
The fingerprint ensures you always know exactly which dataset version produced a given set of results. Tag-based filtering lets you slice results by capability, difficulty, or domain.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Building the Automated Scoring Pipeline
The pipeline connects your agent, dataset, and metrics into a single runnable evaluation.
import asyncio
from typing import Protocol
class AgentRunner(Protocol):
async def run(self, user_input: str, context: dict) -> dict:
...
async def run_evaluation(
agent: AgentRunner,
dataset: EvalDataset,
framework: EvalFramework,
concurrency: int = 5,
) -> list[dict]:
semaphore = asyncio.Semaphore(concurrency)
results = []
async def evaluate_sample(sample: EvalSample):
async with semaphore:
agent_output = await agent.run(
sample.user_input, sample.context
)
scored = framework.score({
"sample": sample.__dict__,
"output": agent_output,
})
scored["sample_id"] = sample.sample_id
scored["aggregate"] = framework.weighted_aggregate(scored)
return scored
tasks = [evaluate_sample(s) for s in dataset.samples]
results = await asyncio.gather(*tasks, return_exceptions=True)
return [
r if isinstance(r, dict) else {"error": str(r)}
for r in results
]
The semaphore controls concurrency so you do not overwhelm your agent or your LLM provider. Results come back with per-sample scores and an aggregate, ready for analysis.
Interpreting Results
After running the pipeline, aggregate results by category and tag to find patterns.
from collections import defaultdict
def summarize_results(
results: list[dict], framework: EvalFramework
) -> dict:
category_scores = defaultdict(list)
for result in results:
for metric in framework.metrics:
if metric.name in result:
category_scores[metric.category.value].append(
result[metric.name]
)
summary = {}
for category, scores in category_scores.items():
summary[category] = {
"mean": sum(scores) / len(scores),
"min": min(scores),
"max": max(scores),
"count": len(scores),
}
return summary
Look for categories where the minimum score is significantly below the mean — those represent your worst failure modes and should be your top priority for improvement.
FAQ
How many evaluation samples do I need for reliable results?
Start with at least 50 to 100 samples per capability you want to measure. For statistical significance when comparing two agent versions, you typically need 200 or more samples. The key is coverage across edge cases, not raw volume. Ten diverse samples beat a hundred repetitive ones.
Should I use LLM-as-judge or deterministic scoring?
Use deterministic scoring wherever possible — exact match for tool calls, regex for structured outputs, keyword checks for required information. Reserve LLM-as-judge for subjective quality dimensions like helpfulness or coherence. Deterministic metrics are faster, cheaper, and reproducible.
How often should I re-run evaluations?
Run the full evaluation suite on every model change, prompt update, or tool modification. Set up nightly runs against your production configuration to catch regressions from upstream model updates. Store every result with the dataset fingerprint and agent version so you can track trends over time.
#AgentEvaluation #Benchmarking #Python #MLOps #Testing #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.