Tool Usage Accuracy: Evaluating Whether Agents Call the Right Tools with Right Parameters

Why Tool Usage Accuracy Is Critical

An AI agent's power comes from the tools it can call — APIs, databases, calculators, search engines. But a tool called incorrectly is worse than no tool call at all. A wrong API parameter can book the wrong flight, charge the wrong amount, or delete the wrong record. Tool usage accuracy measures whether your agent selects the correct tool for a given intent and passes the correct parameters every time.

This metric splits into three sub-dimensions: tool selection accuracy (did it pick the right tool?), parameter accuracy (did it pass the right values?), and sequencing accuracy (did it call tools in the right order for multi-step operations?).

Logging Tool Calls for Evaluation

The foundation of tool accuracy measurement is a detailed log of every tool call the agent makes.

from dataclasses import dataclass, field
from datetime import datetime
from typing import Any, Optional
import json

@dataclass
class ToolCallLog:
    call_id: str
    tool_name: str
    parameters: dict[str, Any]
    result: Any = None
    error: Optional[str] = None
    timestamp: str = field(
        default_factory=lambda: datetime.utcnow().isoformat()
    )
    latency_ms: Optional[int] = None

@dataclass
class ConversationToolTrace:
    conversation_id: str
    calls: list[ToolCallLog] = field(default_factory=list)

    def add_call(self, call: ToolCallLog):
        self.calls.append(call)

    def tool_sequence(self) -> list[str]:
        return [c.tool_name for c in self.calls]

    def to_dict(self) -> dict:
        return {
            "conversation_id": self.conversation_id,
            "calls": [
                {
                    "call_id": c.call_id,
                    "tool_name": c.tool_name,
                    "parameters": c.parameters,
                    "error": c.error,
                }
                for c in self.calls
            ],
        }

Wrap your tool execution layer so every call is automatically captured. Never rely on the agent to self-report which tools it called — instrument the execution layer directly.

Measuring Tool Selection Accuracy

Given a user intent, did the agent pick the correct tool? This requires a ground truth mapping from intents to expected tools.

@dataclass
class ToolAccuracyEval:
    expected_tool: str
    expected_params: dict[str, Any]
    param_match_mode: str = "exact"  # exact, subset, fuzzy

def score_tool_selection(
    actual_calls: list[ToolCallLog],
    expected: list[ToolAccuracyEval],
) -> dict:
    if not expected:
        return {
            "selection_accuracy": 1.0 if not actual_calls else 0.0,
            "spurious_calls": len(actual_calls),
        }

    matched = 0
    for i, exp in enumerate(expected):
        if i < len(actual_calls):
            if actual_calls[i].tool_name == exp.expected_tool:
                matched += 1

    return {
        "selection_accuracy": matched / len(expected),
        "expected_count": len(expected),
        "actual_count": len(actual_calls),
        "spurious_calls": max(0, len(actual_calls) - len(expected)),
        "missed_calls": max(0, len(expected) - len(actual_calls)),
    }

Spurious calls — tools the agent called that it should not have — are just as important as missed calls. An agent that calls a payment API unnecessarily is a liability.

Parameter Validation Scoring

Selecting the right tool is necessary but not sufficient. The parameters must also be correct.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

from typing import Union

def score_parameters(
    actual: dict[str, Any],
    expected: dict[str, Any],
    mode: str = "exact",
) -> dict:
    if mode == "exact":
        return _exact_match(actual, expected)
    elif mode == "subset":
        return _subset_match(actual, expected)
    elif mode == "fuzzy":
        return _fuzzy_match(actual, expected)
    raise ValueError(f"Unknown mode: {mode}")

def _exact_match(actual: dict, expected: dict) -> dict:
    correct = 0
    total = len(expected)
    errors = []

    for key, exp_value in expected.items():
        act_value = actual.get(key)
        if act_value == exp_value:
            correct += 1
        else:
            errors.append({
                "param": key,
                "expected": exp_value,
                "actual": act_value,
            })

    extra_params = set(actual.keys()) - set(expected.keys())

    return {
        "param_accuracy": correct / total if total > 0 else 1.0,
        "correct": correct,
        "total": total,
        "errors": errors,
        "extra_params": list(extra_params),
    }

def _subset_match(actual: dict, expected: dict) -> dict:
    correct = sum(
        1 for k, v in expected.items()
        if actual.get(k) == v
    )
    return {
        "param_accuracy": correct / len(expected) if expected else 1.0,
        "correct": correct,
        "total": len(expected),
    }

def _fuzzy_match(actual: dict, expected: dict) -> dict:
    correct = 0
    for key, exp_value in expected.items():
        act_value = actual.get(key)
        if act_value == exp_value:
            correct += 1
        elif (
            isinstance(exp_value, str)
            and isinstance(act_value, str)
            and exp_value.lower().strip() == act_value.lower().strip()
        ):
            correct += 1
    return {
        "param_accuracy": correct / len(expected) if expected else 1.0,
        "correct": correct,
        "total": len(expected),
    }

Use exact match for IDs, amounts, and dates. Use fuzzy match for names and free-text fields where minor differences are acceptable. Always log the specific parameter errors — they reveal systematic patterns like date format confusion or unit mismatches.

Sequence Accuracy for Multi-Step Operations

Some tasks require tools to be called in a specific order. Checking availability before booking, or looking up a customer before modifying their account.

def score_sequence(
    actual_sequence: list[str],
    expected_sequence: list[str],
) -> dict:
    if not expected_sequence:
        return {"sequence_accuracy": 1.0}

    # Longest common subsequence approach
    m, n = len(actual_sequence), len(expected_sequence)
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if actual_sequence[i-1] == expected_sequence[j-1]:
                dp[i][j] = dp[i-1][j-1] + 1
            else:
                dp[i][j] = max(dp[i-1][j], dp[i][j-1])

    lcs_length = dp[m][n]
    return {
        "sequence_accuracy": lcs_length / len(expected_sequence),
        "lcs_length": lcs_length,
        "expected_length": len(expected_sequence),
        "actual_length": len(actual_sequence),
    }

The longest common subsequence (LCS) approach is forgiving of extra calls the agent inserts (like a redundant lookup) while still penalizing wrong ordering and missing steps.

Putting It All Together

Combine selection, parameter, and sequence scores into a single tool usage report.

def full_tool_accuracy_report(
    trace: ConversationToolTrace,
    expected_evals: list[ToolAccuracyEval],
) -> dict:
    selection = score_tool_selection(trace.calls, expected_evals)
    param_scores = []
    for i, exp in enumerate(expected_evals):
        if i < len(trace.calls):
            ps = score_parameters(
                trace.calls[i].parameters,
                exp.expected_params,
                exp.param_match_mode,
            )
            param_scores.append(ps["param_accuracy"])
    sequence = score_sequence(
        trace.tool_sequence(),
        [e.expected_tool for e in expected_evals],
    )
    avg_param = (
        sum(param_scores) / len(param_scores)
        if param_scores else 0.0
    )
    return {
        "selection": selection,
        "avg_param_accuracy": round(avg_param, 3),
        "sequence": sequence,
        "composite_score": round(
            selection["selection_accuracy"] * 0.4
            + avg_param * 0.4
            + sequence["sequence_accuracy"] * 0.2,
            3,
        ),
    }

FAQ

How do I build ground truth for tool call evaluation?

Start with your most common user intents. For each intent, manually define the expected tool calls and parameters. Use production conversation logs as your source — sample 50 conversations per task type and annotate the correct tool sequence. Automate what you can with deterministic rules, and use human annotators for ambiguous cases.

What is an acceptable tool selection accuracy?

For production agents handling real transactions, target 95 percent or higher tool selection accuracy. Anything below 90 percent means roughly one in ten user requests triggers the wrong action. For read-only tools like search or lookup, 85 percent is workable. For tools that modify state — payments, bookings, deletions — you need near-perfect accuracy.

How do I handle cases where multiple tool sequences are valid?

Define a set of acceptable sequences rather than a single expected sequence. Score against the best-matching sequence from the set. Alternatively, define ordering constraints (A must come before B) rather than a full sequence, and verify that all constraints are satisfied.

#ToolUse #AgentEvaluation #FunctionCalling #Python #Benchmarking #AgenticAI #LearnAI #AIEngineering

Tool Usage Accuracy: Evaluating Whether Agents Call the Right Tools with Right Parameters

Why Tool Usage Accuracy Is Critical

Logging Tool Calls for Evaluation

Measuring Tool Selection Accuracy

Parameter Validation Scoring

Sequence Accuracy for Multi-Step Operations

Putting It All Together

FAQ

How do I build ground truth for tool call evaluation?

What is an acceptable tool selection accuracy?

How do I handle cases where multiple tool sequences are valid?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding