Tool Result Formatting: Helping LLMs Understand Tool Outputs

The Forgotten Half of Tool Design

Most developers spend their time on tool schemas and execution logic. But the tool result — the string you pass back to the LLM — is equally important. A well-formatted result helps the LLM extract the right information on the first pass. A poorly formatted result leads to hallucinations, missed data, or unnecessary follow-up tool calls.

The tool result is a string. That is your only interface. Everything you need the LLM to understand must be encoded in that string.

Principle 1: Lead with the Answer

Put the most important information first. LLMs process text sequentially and are better at using information that appears early in a message:

flowchart TD
    START["Tool Result Formatting: Helping LLMs Understand T…"] --> A
    A["The Forgotten Half of Tool Design"]
    A --> B
    B["Principle 1: Lead with the Answer"]
    B --> C
    C["Principle 2: Use Consistent Structure"]
    C --> D
    D["Principle 3: Truncate Thoughtfully"]
    D --> E
    E["Principle 4: Format Errors as Actionabl…"]
    E --> F
    F["Principle 5: Tables for Tabular Data"]
    F --> G
    G["Principle 6: Include Metadata When It A…"]
    G --> H
    H["Combining Patterns: A Complete Formatter"]
    H --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff

# Bad: answer buried after metadata
def format_weather(data: dict) -> str:
    return f"""API Response:
Status: 200 OK
Cache: HIT
Request ID: abc-123
Timestamp: 2026-03-17T10:30:00Z

Location: {data['city']}
Temperature: {data['temp_f']}F
Conditions: {data['conditions']}"""

# Good: answer first, metadata optional
def format_weather(data: dict) -> str:
    return f"""Current weather in {data['city']}:
Temperature: {data['temp_f']}F ({data['temp_c']}C)
Conditions: {data['conditions']}
Humidity: {data['humidity']}%
Wind: {data['wind_speed']} mph {data['wind_dir']}"""

The LLM does not need your HTTP status codes, cache headers, or request IDs. It needs the weather data.

Principle 2: Use Consistent Structure

When a tool can return different types of results, maintain a consistent format:

def format_search_results(results: list[dict]) -> str:
    if not results:
        return "No results found."

    lines = [f"Found {len(results)} result(s):\n"]

    for i, r in enumerate(results, 1):
        lines.append(f"{i}. {r['title']}")
        lines.append(f"   URL: {r['url']}")
        lines.append(f"   Snippet: {r['snippet']}")
        lines.append("")

    return "\n".join(lines)

Numbered items with consistent indentation and field labels make it easy for the LLM to parse individual results and refer to them by number in its response.

Principle 3: Truncate Thoughtfully

Raw tool outputs can be massive. Truncation is not optional — it is a core design decision:

def truncate_result(content: str, max_chars: int = 4000) -> str:
    if len(content) <= max_chars:
        return content

    # Try to truncate at a natural boundary
    truncated = content[:max_chars]
    last_newline = truncated.rfind("\n")
    if last_newline > max_chars * 0.8:
        truncated = truncated[:last_newline]

    total_chars = len(content)
    return f"{truncated}\n\n[Truncated: showing {len(truncated)} of {total_chars} characters. Call with offset parameter to see more.]"

The truncation message tells the LLM how much data was cut and how to get more. Without this, the LLM may assume it has all the data and produce incomplete answers.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

Principle 4: Format Errors as Actionable Messages

Error results should tell the LLM what went wrong and what it can do about it:

# Bad: generic error
def handle_error_bad(e: Exception) -> str:
    return f"Error: {str(e)}"

# Good: actionable error with context
def handle_error_good(tool_name: str, e: Exception, suggestion: str = "") -> str:
    error_msg = f"Tool '{tool_name}' failed: {str(e)}"

    if suggestion:
        error_msg += f"\nSuggestion: {suggestion}"

    return error_msg

# Usage examples
handle_error_good(
    "query_database",
    Exception("relation 'users' does not exist"),
    "The table might be named 'customers'. Call get_schema to check available tables."
)

handle_error_good(
    "fetch_webpage",
    Exception("HTTP 403 Forbidden"),
    "This site blocks automated requests. Try a different source for this information."
)

The suggestion guides the LLM toward recovery instead of blindly retrying the same call.

Principle 5: Tables for Tabular Data

When returning rows of data, format them as aligned tables:

def format_as_table(rows: list[dict], columns: list[str] = None) -> str:
    if not rows:
        return "No data."

    if columns is None:
        columns = list(rows[0].keys())

    # Calculate column widths
    widths = {col: len(col) for col in columns}
    for row in rows:
        for col in columns:
            val = str(row.get(col, ""))
            widths[col] = max(widths[col], min(len(val), 40))

    # Build header
    header = " | ".join(col.ljust(widths[col]) for col in columns)
    separator = "-+-".join("-" * widths[col] for col in columns)

    # Build rows
    lines = [header, separator]
    for row in rows:
        line = " | ".join(
            str(row.get(col, ""))[:40].ljust(widths[col])
            for col in columns
        )
        lines.append(line)

    return "\n".join(lines)

Tables are more token-efficient than JSON for tabular data and easier for the LLM to scan visually.

Principle 6: Include Metadata When It Aids Reasoning

Some metadata helps the LLM make better decisions:

def format_db_results(rows: list[dict], query_time_ms: float, total_count: int) -> str:
    output = format_as_table(rows)

    metadata = []
    if len(rows) < total_count:
        metadata.append(f"Showing {len(rows)} of {total_count} total rows")
    metadata.append(f"Query executed in {query_time_ms:.0f}ms")

    if metadata:
        output += "\n\n" + " | ".join(metadata)

    return output

Knowing that there are 500 more rows helps the LLM decide whether to add filters. Knowing query time helps it avoid expensive queries.

Combining Patterns: A Complete Formatter

class ToolResultFormatter:
    def __init__(self, max_chars: int = 4000):
        self.max_chars = max_chars

    def format(self, data, tool_name: str) -> str:
        if isinstance(data, list) and data and isinstance(data[0], dict):
            result = format_as_table(data)
        elif isinstance(data, dict):
            import json
            result = json.dumps(data, indent=2, default=str)
        elif isinstance(data, str):
            result = data
        else:
            result = str(data)

        return truncate_result(result, self.max_chars)

    def error(self, tool_name: str, error: str, suggestion: str = "") -> str:
        msg = f"[{tool_name}] Error: {error}"
        if suggestion:
            msg += f"\nSuggestion: {suggestion}"
        return msg

    def empty(self, tool_name: str, query_description: str = "") -> str:
        msg = f"[{tool_name}] No results found"
        if query_description:
            msg += f" for: {query_description}"
        msg += ". Try broadening your search criteria."
        return msg

FAQ

Should I return JSON or plain text from tools?

It depends on the data. For structured records (API responses, database rows), JSON or tables work well. For text content (web pages, file contents, search snippets), plain text is more natural. The key metric is: can the LLM parse the result accurately on the first attempt?

How long should tool results be?

Keep results under 4000 characters as a default. Beyond that, you are spending tokens on data the LLM may not fully process. For data-heavy tools, return summaries or the first N results with a note about how to get more. The sweet spot is enough data to answer the question without drowning the model in noise.

Should I format results differently for different LLMs?

In practice, the formatting principles above work well across all major models. The differences in how GPT-4, Claude, and Gemini process tool results are minor compared to the impact of good formatting practices. Focus on clarity, conciseness, and putting important information first.

#ToolDesign #LLMOptimization #FunctionCalling #AIAgents #AgenticAI #LearnAI #AIEngineering