Dynamic Tool Selection: AI Agents That Choose Tools Based on Context

The Tool Selection Problem

When an agent has 3 tools, the LLM picks the right one almost every time. At 10 tools, accuracy starts declining. At 50+ tools, the model frequently picks wrong tools, hallucinates parameters, or calls tools that are irrelevant to the task. This is the too-many-tools problem, and solving it is essential for building agents that work with large toolsets.

The fundamental insight is that tool selection is a search problem. The LLM needs enough information to discriminate between tools, but not so much that it is overwhelmed.

How LLMs Select Tools

When you provide tools to an LLM, the model uses three signals to decide which tool to call:

The tool name — semantic meaning extracted from the function name
The tool description — the primary source of selection guidance
The parameter schema — structural hints about what data the tool expects

The description is by far the most important. A good description acts as a routing instruction.

Writing Descriptions That Discriminate

Each tool description should answer: what does this tool do, when should it be used, and when should a different tool be used instead.

# Bad: overlapping, ambiguous descriptions
tools_bad = [
    {"name": "search", "description": "Search for information"},
    {"name": "lookup", "description": "Look up data"},
    {"name": "find", "description": "Find results"},
]

# Good: clear boundaries between tools
tools_good = [
    {
        "name": "search_web",
        "description": "Search the public internet for current information. Use for recent events, general knowledge, or topics not in our internal database. Do NOT use for internal company data."
    },
    {
        "name": "search_knowledge_base",
        "description": "Search the internal company knowledge base for policies, procedures, and documentation. Use for company-specific questions. Do NOT use for general internet searches."
    },
    {
        "name": "search_customer_db",
        "description": "Look up a specific customer by name, email, or ID in the customer database. Use when the user asks about a specific customer's account, orders, or status. Requires at least one identifier."
    },
]

The "Do NOT use for" clause is surprisingly effective. It gives the LLM a negative signal that prevents common misrouting.

Strategy 1: Tool Categories with Pre-Routing

For large toolsets, pre-filter tools based on the conversation context before passing them to the LLM:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

from dataclasses import dataclass, field

@dataclass
class ToolCategory:
    name: str
    description: str
    keywords: list[str]
    tools: list[dict]

class ToolRouter:
    def __init__(self):
        self.categories: list[ToolCategory] = []

    def add_category(self, category: ToolCategory):
        self.categories.append(category)

    def select_tools(self, user_message: str, max_tools: int = 10) -> list[dict]:
        message_lower = user_message.lower()
        scored_categories = []

        for category in self.categories:
            score = sum(
                1 for kw in category.keywords
                if kw.lower() in message_lower
            )
            if score > 0:
                scored_categories.append((score, category))

        scored_categories.sort(key=lambda x: x[0], reverse=True)

        selected_tools = []
        for _, category in scored_categories:
            for tool in category.tools:
                if len(selected_tools) < max_tools:
                    selected_tools.append(tool)

        # Always include core tools
        if not selected_tools:
            return self.categories[0].tools[:max_tools]

        return selected_tools

Usage:

router = ToolRouter()

router.add_category(ToolCategory(
    name="data_analysis",
    description="Tools for querying and analyzing data",
    keywords=["data", "query", "sql", "analyze", "statistics", "count", "average"],
    tools=[query_db_tool, chart_tool, export_csv_tool],
))

router.add_category(ToolCategory(
    name="communication",
    description="Tools for sending messages and notifications",
    keywords=["send", "email", "message", "notify", "slack", "alert"],
    tools=[send_email_tool, slack_tool, sms_tool],
))

# At runtime, only pass relevant tools to the LLM
relevant_tools = router.select_tools(user_message, max_tools=8)
response = await client.chat.completions.create(
    model="gpt-4o",
    messages=messages,
    tools=relevant_tools,
)

Strategy 2: Two-Stage Tool Selection

For very large toolsets (50+ tools), use a two-stage approach where the first LLM call selects the tool category, and the second call uses only tools from that category:

async def two_stage_tool_selection(user_message: str, all_categories: list[ToolCategory]):
    # Stage 1: Ask LLM to pick the right category
    category_descriptions = "\n".join(
        f"- {cat.name}: {cat.description}"
        for cat in all_categories
    )

    stage1_response = await client.chat.completions.create(
        model="gpt-4o-mini",  # Cheaper model for routing
        messages=[
            {"role": "system", "content": f"Select the tool category most relevant to the user's request. Available categories:\n{category_descriptions}\n\nRespond with only the category name."},
            {"role": "user", "content": user_message},
        ],
    )

    selected_name = stage1_response.choices[0].message.content.strip()

    # Stage 2: Run agent with only tools from selected category
    selected_category = next(
        (cat for cat in all_categories if cat.name == selected_name),
        all_categories[0]
    )

    return await run_agent(
        user_message,
        tools=selected_category.tools,
        system_prompt="You are a helpful assistant.",
    )

Using a cheaper model (GPT-4o-mini) for routing keeps costs low while ensuring the main agent only sees relevant tools.

Strategy 3: Embedding-Based Tool Selection

For the most sophisticated approach, use embeddings to match user intent to tool descriptions:

import numpy as np

class EmbeddingToolSelector:
    def __init__(self, tools: list[dict]):
        self.tools = tools
        self.embeddings = None

    async def build_index(self):
        descriptions = [
            f"{t['function']['name']}: {t['function']['description']}"
            for t in self.tools
        ]
        response = await client.embeddings.create(
            model="text-embedding-3-small",
            input=descriptions,
        )
        self.embeddings = np.array([e.embedding for e in response.data])

    async def select(self, query: str, top_k: int = 5) -> list[dict]:
        response = await client.embeddings.create(
            model="text-embedding-3-small",
            input=[query],
        )
        query_embedding = np.array(response.data[0].embedding)

        similarities = np.dot(self.embeddings, query_embedding)
        top_indices = np.argsort(similarities)[-top_k:][::-1]

        return [self.tools[i] for i in top_indices]

This approach scales to hundreds of tools and handles semantic matching — "show me revenue numbers" correctly routes to the database query tool even without the word "query" appearing.

FAQ

What is the maximum number of tools I should give an LLM at once?

Empirically, most models handle 10-15 tools well. Beyond 20, selection accuracy degrades noticeably. If you have more than 20 tools, use one of the pre-routing strategies described above to narrow the active toolset per conversation turn.

How do I debug tool selection mistakes?

Log the tool calls the LLM makes alongside the user message. Look for patterns: does the model confuse two specific tools? Add "Do NOT use for" clauses to their descriptions. Does it pick the right tool but with wrong parameters? The parameter descriptions need improvement. Track selection accuracy as a metric over time.

Should I fine-tune a model for tool selection?

Only as a last resort. For most applications, better tool descriptions, pre-routing, and the two-stage approach solve selection problems without fine-tuning. Fine-tuning makes sense when you have a very large, domain-specific toolset and can generate training data from production logs.

#ToolSelection #AgentArchitecture #FunctionCalling #AIAgents #AgenticAI #LearnAI #AIEngineering

Dynamic Tool Selection: AI Agents That Choose Tools Based on Context

The Tool Selection Problem

How LLMs Select Tools

Writing Descriptions That Discriminate

Strategy 1: Tool Categories with Pre-Routing

Strategy 2: Two-Stage Tool Selection

Strategy 3: Embedding-Based Tool Selection

FAQ

What is the maximum number of tools I should give an LLM at once?

How do I debug tool selection mistakes?

Should I fine-tune a model for tool selection?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding