Debugging Token Usage: Finding Why Your Agent Consumes More Tokens Than Expected

Why Your Token Bill Keeps Growing

You launch an AI agent that costs a few cents per conversation in testing. In production, some conversations cost several dollars. The model is the same, the prompts have not changed, but the token usage has exploded. Where are the tokens going?

Token consumption in agentic systems is fundamentally different from simple chat applications. Every tool call, every tool result, every intermediate reasoning step, and every message in the conversation history gets sent back to the model on the next turn. A 10-turn agent conversation does not cost 10 times a single turn — it can cost 55 times (1 + 2 + 3 + ... + 10) because of the accumulating context window.

Building a Token Profiler

The first step is measuring where tokens are actually being spent:

import tiktoken
from dataclasses import dataclass

@dataclass
class TokenBreakdown:
    system_prompt: int = 0
    tool_definitions: int = 0
    conversation_history: int = 0
    current_turn: int = 0
    total: int = 0

class TokenProfiler:
    def __init__(self, model: str = "gpt-4o"):
        self.encoder = tiktoken.encoding_for_model(model)
        self.turn_snapshots: list[TokenBreakdown] = []

    def count(self, text: str) -> int:
        return len(self.encoder.encode(text))

    def profile_request(self, messages: list[dict], tools: list[dict] = None):
        breakdown = TokenBreakdown()

        for msg in messages:
            tokens = self.count(msg.get("content", "") or "")
            if msg["role"] == "system":
                breakdown.system_prompt += tokens
            elif msg == messages[-1]:
                breakdown.current_turn += tokens
            else:
                breakdown.conversation_history += tokens

        if tools:
            import json
            tool_text = json.dumps(tools)
            breakdown.tool_definitions = self.count(tool_text)

        breakdown.total = (
            breakdown.system_prompt
            + breakdown.tool_definitions
            + breakdown.conversation_history
            + breakdown.current_turn
        )
        self.turn_snapshots.append(breakdown)
        return breakdown

    def print_report(self):
        print("Turn | System | Tools | History | Current | Total")
        print("-----|--------|-------|---------|---------|------")
        for i, snap in enumerate(self.turn_snapshots):
            print(
                f"  {i+1:2d} | {snap.system_prompt:6d} | "
                f"{snap.tool_definitions:5d} | {snap.conversation_history:7d} | "
                f"{snap.current_turn:7d} | {snap.total:5d}"
            )

Running this profiler across a multi-turn conversation reveals exactly where the growth happens.

Common Token Bloat Patterns

Pattern 1: Tool results that are too large. A database query tool returns the entire row set including columns the agent does not need:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

# Bad: returns everything
@function_tool
async def get_customer(customer_id: str) -> str:
    row = await db.fetch_one(
        "SELECT * FROM customers WHERE id = $1", customer_id
    )
    return json.dumps(dict(row))  # 50+ columns, 2000 tokens

# Good: return only what the agent needs
@function_tool
async def get_customer(customer_id: str) -> str:
    row = await db.fetch_one(
        "SELECT name, email, plan, status FROM customers WHERE id = $1",
        customer_id,
    )
    return json.dumps(dict(row))  # 4 columns, 80 tokens

Pattern 2: Conversation history that never gets trimmed. Every message from every turn stays in the context:

class ConversationManager:
    def __init__(self, max_history_tokens: int = 4000):
        self.messages: list[dict] = []
        self.max_tokens = max_history_tokens
        self.encoder = tiktoken.encoding_for_model("gpt-4o")

    def add_message(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})
        self._trim()

    def _trim(self):
        """Remove oldest messages when history exceeds token budget."""
        while self._total_tokens() > self.max_tokens and len(self.messages) > 2:
            # Keep system prompt (index 0), remove oldest user/assistant
            self.messages.pop(1)

    def _total_tokens(self) -> int:
        return sum(
            len(self.encoder.encode(m.get("content", "") or ""))
            for m in self.messages
        )

Pattern 3: Verbose system prompts that repeat information already in tool descriptions. Consolidate instructions and avoid duplication between your system prompt and tool docstrings.

Setting Token Budgets

Define per-conversation and per-turn budgets to catch runaway usage early:

class TokenBudget:
    def __init__(self, per_turn: int = 8000, per_conversation: int = 50000):
        self.per_turn = per_turn
        self.per_conversation = per_conversation
        self.total_used = 0

    def check(self, turn_tokens: int) -> bool:
        if turn_tokens > self.per_turn:
            raise TokenBudgetExceeded(
                f"Turn used {turn_tokens} tokens (limit: {self.per_turn})"
            )
        self.total_used += turn_tokens
        if self.total_used > self.per_conversation:
            raise TokenBudgetExceeded(
                f"Conversation total {self.total_used} tokens "
                f"(limit: {self.per_conversation})"
            )
        return True

class TokenBudgetExceeded(Exception):
    pass

FAQ

Why does the same agent cost five times more for some conversations than others?

Conversation length is the primary driver. A 3-turn conversation might use 15,000 tokens total, but a 10-turn conversation with large tool results can use 150,000 tokens because the full history is re-sent on every turn. Tool result size also varies — a search returning 2 results costs far less than one returning 20.

How do I reduce token usage without losing agent capabilities?

Focus on the three biggest levers: trim tool results to include only fields the agent needs, implement conversation history summarization for long sessions, and remove redundancy between your system prompt and tool descriptions. These three changes typically reduce token usage by 40 to 60 percent.

Should I use a cheaper model for some turns to save tokens?

Yes. Route simple classification or extraction tasks to smaller, cheaper models and reserve the large model for complex reasoning. This is called model cascading and can cut costs by 60 to 80 percent while maintaining quality for the tasks that need it.

#Debugging #TokenUsage #CostOptimization #AIAgents #Performance #AgenticAI #LearnAI #AIEngineering

Debugging Token Usage: Finding Why Your Agent Consumes More Tokens Than Expected

Why Your Token Bill Keeps Growing

Building a Token Profiler

Common Token Bloat Patterns

Setting Token Budgets

FAQ

Why does the same agent cost five times more for some conversations than others?

How do I reduce token usage without losing agent capabilities?

Should I use a cheaper model for some turns to save tokens?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding