Building an Agent Evaluation Framework: Metrics, Datasets, and Automated Scoring

Why You Need a Structured Evaluation Framework

Deploying an AI agent without structured evaluation is like shipping software without tests. The agent might work perfectly in a demo, then fail spectacularly on the first edge case a real user throws at it. An evaluation framework gives you repeatable, quantitative measurements that tell you exactly where your agent excels and where it breaks.

A good framework has three pillars: metrics that capture what matters, datasets that represent real usage, and scoring pipelines that run automatically. Let us build each one from scratch.

Designing Your Metric Taxonomy

Metrics for AI agents fall into four categories. Each captures a different dimension of quality.

from dataclasses import dataclass, field
from enum import Enum
from typing import Callable, Any

class MetricCategory(Enum):
    TASK = "task_completion"
    QUALITY = "response_quality"
    EFFICIENCY = "efficiency"
    SAFETY = "safety"

@dataclass
class EvalMetric:
    name: str
    category: MetricCategory
    scorer: Callable[[dict], float]
    weight: float = 1.0
    description: str = ""

@dataclass
class EvalFramework:
    metrics: list[EvalMetric] = field(default_factory=list)

    def register(self, metric: EvalMetric):
        self.metrics.append(metric)

    def score(self, sample: dict) -> dict[str, float]:
        results = {}
        for metric in self.metrics:
            try:
                results[metric.name] = metric.scorer(sample)
            except Exception as e:
                results[metric.name] = 0.0
                results[f"{metric.name}_error"] = str(e)
        return results

    def weighted_aggregate(self, results: dict[str, float]) -> float:
        total_weight = sum(m.weight for m in self.metrics)
        weighted_sum = sum(
            results.get(m.name, 0.0) * m.weight
            for m in self.metrics
        )
        return weighted_sum / total_weight if total_weight > 0 else 0.0

This gives you a registry where each metric is a named scorer function. The framework runs all metrics against a sample and returns individual scores plus a weighted aggregate.

Creating Evaluation Datasets

Your dataset should mirror production traffic. Each sample includes the user input, the expected behavior, and any context the agent had access to.

import json
import hashlib
from datetime import datetime

@dataclass
class EvalSample:
    sample_id: str
    user_input: str
    expected_output: str
    expected_tool_calls: list[dict] = field(default_factory=list)
    context: dict = field(default_factory=dict)
    tags: list[str] = field(default_factory=list)
    difficulty: str = "medium"

class EvalDataset:
    def __init__(self, name: str, version: str = "1.0"):
        self.name = name
        self.version = version
        self.samples: list[EvalSample] = []
        self.created_at = datetime.utcnow().isoformat()

    def add_sample(self, sample: EvalSample):
        self.samples.append(sample)

    def fingerprint(self) -> str:
        content = json.dumps(
            [s.__dict__ for s in self.samples],
            sort_keys=True
        )
        return hashlib.sha256(content.encode()).hexdigest()[:12]

    def filter_by_tag(self, tag: str) -> list[EvalSample]:
        return [s for s in self.samples if tag in s.tags]

    def save(self, path: str):
        data = {
            "name": self.name,
            "version": self.version,
            "fingerprint": self.fingerprint(),
            "created_at": self.created_at,
            "samples": [s.__dict__ for s in self.samples],
        }
        with open(path, "w") as f:
            json.dump(data, f, indent=2)

The fingerprint ensures you always know exactly which dataset version produced a given set of results. Tag-based filtering lets you slice results by capability, difficulty, or domain.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

Building the Automated Scoring Pipeline

The pipeline connects your agent, dataset, and metrics into a single runnable evaluation.

import asyncio
from typing import Protocol

class AgentRunner(Protocol):
    async def run(self, user_input: str, context: dict) -> dict:
        ...

async def run_evaluation(
    agent: AgentRunner,
    dataset: EvalDataset,
    framework: EvalFramework,
    concurrency: int = 5,
) -> list[dict]:
    semaphore = asyncio.Semaphore(concurrency)
    results = []

    async def evaluate_sample(sample: EvalSample):
        async with semaphore:
            agent_output = await agent.run(
                sample.user_input, sample.context
            )
            scored = framework.score({
                "sample": sample.__dict__,
                "output": agent_output,
            })
            scored["sample_id"] = sample.sample_id
            scored["aggregate"] = framework.weighted_aggregate(scored)
            return scored

    tasks = [evaluate_sample(s) for s in dataset.samples]
    results = await asyncio.gather(*tasks, return_exceptions=True)

    return [
        r if isinstance(r, dict) else {"error": str(r)}
        for r in results
    ]

The semaphore controls concurrency so you do not overwhelm your agent or your LLM provider. Results come back with per-sample scores and an aggregate, ready for analysis.

Interpreting Results

After running the pipeline, aggregate results by category and tag to find patterns.

from collections import defaultdict

def summarize_results(
    results: list[dict], framework: EvalFramework
) -> dict:
    category_scores = defaultdict(list)
    for result in results:
        for metric in framework.metrics:
            if metric.name in result:
                category_scores[metric.category.value].append(
                    result[metric.name]
                )

    summary = {}
    for category, scores in category_scores.items():
        summary[category] = {
            "mean": sum(scores) / len(scores),
            "min": min(scores),
            "max": max(scores),
            "count": len(scores),
        }
    return summary

Look for categories where the minimum score is significantly below the mean — those represent your worst failure modes and should be your top priority for improvement.

FAQ

How many evaluation samples do I need for reliable results?

Start with at least 50 to 100 samples per capability you want to measure. For statistical significance when comparing two agent versions, you typically need 200 or more samples. The key is coverage across edge cases, not raw volume. Ten diverse samples beat a hundred repetitive ones.

Should I use LLM-as-judge or deterministic scoring?

Use deterministic scoring wherever possible — exact match for tool calls, regex for structured outputs, keyword checks for required information. Reserve LLM-as-judge for subjective quality dimensions like helpfulness or coherence. Deterministic metrics are faster, cheaper, and reproducible.

How often should I re-run evaluations?

Run the full evaluation suite on every model change, prompt update, or tool modification. Set up nightly runs against your production configuration to catch regressions from upstream model updates. Store every result with the dataset fingerprint and agent version so you can track trends over time.

#AgentEvaluation #Benchmarking #Python #MLOps #Testing #AgenticAI #LearnAI #AIEngineering

Building an Agent Evaluation Framework: Metrics, Datasets, and Automated Scoring

Why You Need a Structured Evaluation Framework

Designing Your Metric Taxonomy

Creating Evaluation Datasets

Building the Automated Scoring Pipeline

Interpreting Results

FAQ

How many evaluation samples do I need for reliable results?

Should I use LLM-as-judge or deterministic scoring?

How often should I re-run evaluations?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding