Skip to content
Learn Agentic AI11 min read0 views

Debugging Token Usage: Finding Why Your Agent Consumes More Tokens Than Expected

Discover how to identify and fix excessive token consumption in AI agents by analyzing prompt bloat, conversation history growth, tool definition overhead, and applying targeted optimization strategies.

Why Your Token Bill Keeps Growing

You launch an AI agent that costs a few cents per conversation in testing. In production, some conversations cost several dollars. The model is the same, the prompts have not changed, but the token usage has exploded. Where are the tokens going?

Token consumption in agentic systems is fundamentally different from simple chat applications. Every tool call, every tool result, every intermediate reasoning step, and every message in the conversation history gets sent back to the model on the next turn. A 10-turn agent conversation does not cost 10 times a single turn — it can cost 55 times (1 + 2 + 3 + ... + 10) because of the accumulating context window.

Building a Token Profiler

The first step is measuring where tokens are actually being spent:

import tiktoken
from dataclasses import dataclass

@dataclass
class TokenBreakdown:
    system_prompt: int = 0
    tool_definitions: int = 0
    conversation_history: int = 0
    current_turn: int = 0
    total: int = 0

class TokenProfiler:
    def __init__(self, model: str = "gpt-4o"):
        self.encoder = tiktoken.encoding_for_model(model)
        self.turn_snapshots: list[TokenBreakdown] = []

    def count(self, text: str) -> int:
        return len(self.encoder.encode(text))

    def profile_request(self, messages: list[dict], tools: list[dict] = None):
        breakdown = TokenBreakdown()

        for msg in messages:
            tokens = self.count(msg.get("content", "") or "")
            if msg["role"] == "system":
                breakdown.system_prompt += tokens
            elif msg == messages[-1]:
                breakdown.current_turn += tokens
            else:
                breakdown.conversation_history += tokens

        if tools:
            import json
            tool_text = json.dumps(tools)
            breakdown.tool_definitions = self.count(tool_text)

        breakdown.total = (
            breakdown.system_prompt
            + breakdown.tool_definitions
            + breakdown.conversation_history
            + breakdown.current_turn
        )
        self.turn_snapshots.append(breakdown)
        return breakdown

    def print_report(self):
        print("Turn | System | Tools | History | Current | Total")
        print("-----|--------|-------|---------|---------|------")
        for i, snap in enumerate(self.turn_snapshots):
            print(
                f"  {i+1:2d} | {snap.system_prompt:6d} | "
                f"{snap.tool_definitions:5d} | {snap.conversation_history:7d} | "
                f"{snap.current_turn:7d} | {snap.total:5d}"
            )

Running this profiler across a multi-turn conversation reveals exactly where the growth happens.

Common Token Bloat Patterns

Pattern 1: Tool results that are too large. A database query tool returns the entire row set including columns the agent does not need:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

# Bad: returns everything
@function_tool
async def get_customer(customer_id: str) -> str:
    row = await db.fetch_one(
        "SELECT * FROM customers WHERE id = $1", customer_id
    )
    return json.dumps(dict(row))  # 50+ columns, 2000 tokens

# Good: return only what the agent needs
@function_tool
async def get_customer(customer_id: str) -> str:
    row = await db.fetch_one(
        "SELECT name, email, plan, status FROM customers WHERE id = $1",
        customer_id,
    )
    return json.dumps(dict(row))  # 4 columns, 80 tokens

Pattern 2: Conversation history that never gets trimmed. Every message from every turn stays in the context:

class ConversationManager:
    def __init__(self, max_history_tokens: int = 4000):
        self.messages: list[dict] = []
        self.max_tokens = max_history_tokens
        self.encoder = tiktoken.encoding_for_model("gpt-4o")

    def add_message(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})
        self._trim()

    def _trim(self):
        """Remove oldest messages when history exceeds token budget."""
        while self._total_tokens() > self.max_tokens and len(self.messages) > 2:
            # Keep system prompt (index 0), remove oldest user/assistant
            self.messages.pop(1)

    def _total_tokens(self) -> int:
        return sum(
            len(self.encoder.encode(m.get("content", "") or ""))
            for m in self.messages
        )

Pattern 3: Verbose system prompts that repeat information already in tool descriptions. Consolidate instructions and avoid duplication between your system prompt and tool docstrings.

Setting Token Budgets

Define per-conversation and per-turn budgets to catch runaway usage early:

class TokenBudget:
    def __init__(self, per_turn: int = 8000, per_conversation: int = 50000):
        self.per_turn = per_turn
        self.per_conversation = per_conversation
        self.total_used = 0

    def check(self, turn_tokens: int) -> bool:
        if turn_tokens > self.per_turn:
            raise TokenBudgetExceeded(
                f"Turn used {turn_tokens} tokens (limit: {self.per_turn})"
            )
        self.total_used += turn_tokens
        if self.total_used > self.per_conversation:
            raise TokenBudgetExceeded(
                f"Conversation total {self.total_used} tokens "
                f"(limit: {self.per_conversation})"
            )
        return True

class TokenBudgetExceeded(Exception):
    pass

FAQ

Why does the same agent cost five times more for some conversations than others?

Conversation length is the primary driver. A 3-turn conversation might use 15,000 tokens total, but a 10-turn conversation with large tool results can use 150,000 tokens because the full history is re-sent on every turn. Tool result size also varies — a search returning 2 results costs far less than one returning 20.

How do I reduce token usage without losing agent capabilities?

Focus on the three biggest levers: trim tool results to include only fields the agent needs, implement conversation history summarization for long sessions, and remove redundancy between your system prompt and tool descriptions. These three changes typically reduce token usage by 40 to 60 percent.

Should I use a cheaper model for some turns to save tokens?

Yes. Route simple classification or extraction tasks to smaller, cheaper models and reserve the large model for complex reasoning. This is called model cascading and can cut costs by 60 to 80 percent while maintaining quality for the tasks that need it.


#Debugging #TokenUsage #CostOptimization #AIAgents #Performance #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.