Token-Efficient Agent Design: Reducing LLM Costs Without Sacrificing Quality

Why Token Costs Compound in Agentic Systems

A single chatbot exchange might use 2,000 tokens. A single agent interaction that involves planning, tool use, evaluation, and response generation can easily consume 50,000-200,000 tokens. Multiply that by thousands of daily interactions and the cost curve becomes a serious business constraint.

The problem compounds because of how agent loops work. Each iteration of the planning loop sends the full conversation history (including all previous tool calls and results) back to the model. If an agent takes 8 steps to complete a task and each step adds 3,000 tokens of tool results, the final call includes 24,000 tokens of accumulated context on top of the system prompt and original user message.

Token-efficient agent design is not about making your agents dumber. It is about being strategic about what information reaches the model at each step, using the right model for each task, and eliminating waste without sacrificing the quality of the agent's reasoning.

Strategy 1: Compact System Prompts

System prompts are the largest fixed cost in agent systems because they are sent with every single LLM call. A verbose system prompt of 3,000 tokens multiplied by 10 calls per interaction multiplied by 10,000 daily interactions equals 300 million tokens per day in system prompts alone.

The solution is not to remove information from system prompts but to express the same information more concisely.

# Before: Verbose system prompt (2,847 tokens)
VERBOSE_PROMPT = """
You are a helpful customer service assistant for TechCorp.
Your name is Alex. You should always be polite and professional.
When a customer asks about their order, you should look up the
order using the order_lookup tool. Make sure to verify the
customer's identity before sharing order details. You have
access to the following tools...
[... 2000 more tokens of instructions ...]
"""

# After: Compact system prompt (892 tokens)
COMPACT_PROMPT = """Role: TechCorp customer service agent (Alex)
Tone: Professional, concise

## Rules
1. Verify identity before sharing account data
2. Use tools for data lookup; never fabricate order details
3. Escalate to human if: refund > $500, legal threat, repeated failure

## Tool Selection
- order_lookup: order status, tracking, history
- account_info: profile, preferences, subscription
- refund_process: initiate refunds (auto-approve ≤ $500)
- escalate: transfer to human agent with context summary
"""

# Token savings: 1,955 tokens per call
# At 10 calls/interaction, 10K interactions/day:
# 195.5M tokens saved daily

Key techniques for compact prompts:

Use structured formats (markdown headers, numbered lists) instead of prose
Eliminate redundancy: "You should look up the order using the order_lookup tool" becomes a tool description
Replace examples with rules: instead of showing 5 example conversations, state the behavioral rules they illustrate
Use abbreviations consistently within the prompt

Prompt Caching

Most major LLM providers now support prompt caching, where the system prompt (and any static prefix) is cached between calls. This can reduce costs by 80-90% for the cached portion. To maximize cache hit rates:

Keep your system prompt identical across all calls (do not inject dynamic data into the system prompt)
Place static content before dynamic content in your messages
Use the same model for all calls within an agent session

Strategy 2: Tool Result Summarization

Tool results are the fastest-growing cost center in agent systems. A database query might return a 5,000-token JSON response, but the agent only needs 3 fields from it. A web search might return 10,000 tokens of content, but only 2 paragraphs are relevant.

# Tool result summarization pipeline

from typing import Any

class ToolResultSummarizer:
    """
    Reduces tool output tokens before they enter the agent context.
    Uses rules-based summarization for structured data and
    a fast model for unstructured content.
    """

    def __init__(self, fast_model):
        self.fast_model = fast_model
        self.rules = {}

    def register_rule(self, tool_name: str, summarizer):
        """Register a rules-based summarizer for a specific tool."""
        self.rules[tool_name] = summarizer

    async def summarize(
        self, tool_name: str, raw_result: Any, query_context: str
    ) -> str:
        # Try rules-based summarization first (zero token cost)
        if tool_name in self.rules:
            return self.rules[tool_name](raw_result)

        # Fall back to model-based summarization for unstructured data
        return await self._model_summarize(raw_result, query_context)

    async def _model_summarize(self, raw_result: Any, context: str) -> str:
        result_str = str(raw_result)
        if len(result_str) < 500:
            return result_str  # Short enough, no summarization needed

        response = await self.fast_model.complete(
            prompt=(
                f"Summarize this tool result in under 200 words, "
                f"keeping only information relevant to: {context}\n\n"
                f"Tool result:\n{result_str[:3000]}"  # Cap input
            ),
            max_tokens=300,
        )
        return response.text

# Rules-based summarizers for structured data
def summarize_order_lookup(result: dict) -> str:
    """Extract only the fields the agent needs."""
    order = result.get("order", {})
    return (
        f"Order #{order.get('id')}: "
        f"Status={order.get('status')}, "
        f"Items={len(order.get('items', []))}, "
        f"Total=${order.get('total', 0):.2f}, "
        f"Shipped={order.get('shipped_at', 'pending')}, "
        f"ETA={order.get('estimated_delivery', 'unknown')}"
    )

def summarize_db_query(result: list[dict]) -> str:
    """Summarize database query results."""
    if not result:
        return "No results found."
    count = len(result)
    # Include first 3 rows in detail, summarize the rest
    detail = "\n".join(
        f"- {json.dumps(row, default=str)}" for row in result[:3]
    )
    suffix = f"\n... and {count - 3} more rows" if count > 3 else ""
    return f"Found {count} results:\n{detail}{suffix}"

# Usage
summarizer = ToolResultSummarizer(fast_model=haiku_client)
summarizer.register_rule("order_lookup", summarize_order_lookup)
summarizer.register_rule("db_query", summarize_db_query)

The impact is substantial. A raw order lookup response might be 1,200 tokens. The summarized version is 40 tokens. Over 8 agent steps, that saves 9,280 tokens per interaction.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

Strategy 3: Selective Context Inclusion

Not every previous message needs to be in the context window for every LLM call. An agent executing step 8 of a plan rarely needs the full verbatim content of steps 1-3. It needs the plan, the current step, and the results of the immediately preceding steps.

# Context window manager with selective inclusion

from dataclasses import dataclass

@dataclass
class ContextBudget:
    max_tokens: int
    system_prompt_tokens: int
    current_message_tokens: int
    reserved_for_response: int

    @property
    def available_for_history(self) -> int:
        return (
            self.max_tokens
            - self.system_prompt_tokens
            - self.current_message_tokens
            - self.reserved_for_response
        )

class SelectiveContextManager:
    def __init__(self, tokenizer):
        self.tokenizer = tokenizer

    def build_context(
        self,
        full_history: list[dict],
        budget: ContextBudget,
        current_step: int,
    ) -> list[dict]:
        available = budget.available_for_history
        context = []
        used_tokens = 0

        # Priority 1: Always include the original user request
        if full_history:
            first_msg = full_history[0]
            tokens = self.tokenizer.count(str(first_msg))
            context.append(first_msg)
            used_tokens += tokens

        # Priority 2: Include the last 3 exchanges (most recent context)
        recent = full_history[-6:]  # 3 exchanges = 6 messages
        for msg in recent:
            tokens = self.tokenizer.count(str(msg))
            if used_tokens + tokens > available:
                break
            context.append(msg)
            used_tokens += tokens

        # Priority 3: Include summarized middle context if budget allows
        middle = full_history[1:-6] if len(full_history) > 7 else []
        if middle and used_tokens < available * 0.7:
            summary = self._summarize_middle(middle)
            summary_tokens = self.tokenizer.count(summary)
            if used_tokens + summary_tokens <= available:
                context.insert(1, {
                    "role": "system",
                    "content": f"[Summary of earlier conversation]\n{summary}"
                })

        return context

    def _summarize_middle(self, messages: list[dict]) -> str:
        """Create a bullet-point summary of middle conversation turns."""
        points = []
        for msg in messages:
            role = msg["role"]
            content = msg.get("content", "")
            if role == "tool":
                # Compress tool results aggressively
                points.append(f"- Tool returned: {content[:100]}...")
            elif role == "assistant" and "tool_use" in str(msg):
                points.append(f"- Agent called tool")
            else:
                points.append(f"- {role}: {content[:80]}...")
        return "\n".join(points)

Strategy 4: Model Tiering

Not every LLM call in an agent pipeline requires the same capability. Classification and routing can use a fast, cheap model. Complex reasoning requires a capable, expensive model. Using the right model for each task can reduce costs by 60-80%.

# Model tiering strategy for agent pipelines

from enum import Enum

class ModelTier(Enum):
    FAST = "fast"      # Classification, routing, simple extraction
    CAPABLE = "capable" # Reasoning, planning, complex tool use
    PREMIUM = "premium" # Critical decisions, complex analysis

# Model mapping (adjust based on your provider)
MODEL_MAP = {
    ModelTier.FAST: {
        "name": "claude-3-5-haiku-20241022",
        "cost_per_1m_input": 0.80,
        "cost_per_1m_output": 4.00,
    },
    ModelTier.CAPABLE: {
        "name": "claude-sonnet-4-20250514",
        "cost_per_1m_input": 3.00,
        "cost_per_1m_output": 15.00,
    },
    ModelTier.PREMIUM: {
        "name": "claude-opus-4-20250918",
        "cost_per_1m_input": 15.00,
        "cost_per_1m_output": 75.00,
    },
}

class TieredAgentExecutor:
    def __init__(self, llm_pool: LLMConnectionPool):
        self.pool = llm_pool

    async def route_message(self, message: str, context: dict) -> str:
        """FAST tier: classify and route incoming messages."""
        return await self.pool.chat_completion(
            model=MODEL_MAP[ModelTier.FAST]["name"],
            messages=[{
                "role": "user",
                "content": f"Classify this message into one of: "
                           f"billing, technical, account, escalation.\n"
                           f"Message: {message}\nCategory:"
            }],
            max_tokens=20,
        )

    async def plan_actions(self, task: str, context: dict) -> list:
        """CAPABLE tier: create execution plan."""
        return await self.pool.chat_completion(
            model=MODEL_MAP[ModelTier.CAPABLE]["name"],
            messages=[{
                "role": "system",
                "content": "Create an action plan for the given task."
            }, {
                "role": "user",
                "content": f"Task: {task}\nContext: {context}"
            }],
            max_tokens=1000,
        )

    async def critical_decision(self, decision: str, stakes: dict) -> dict:
        """PREMIUM tier: high-stakes decisions requiring maximum accuracy."""
        return await self.pool.chat_completion(
            model=MODEL_MAP[ModelTier.PREMIUM]["name"],
            messages=[{
                "role": "system",
                "content": "You are making a high-stakes decision. "
                           "Reason carefully and explain your logic."
            }, {
                "role": "user",
                "content": f"Decision: {decision}\nStakes: {stakes}"
            }],
            max_tokens=2000,
        )

# Cost comparison per interaction:
# All-premium: ~$0.45/interaction
# All-capable: ~$0.09/interaction
# Tiered (70% fast, 25% capable, 5% premium): ~$0.04/interaction
# Savings: 91% vs all-premium, 56% vs all-capable

Strategy 5: Response Streaming and Early Termination

Streaming responses reduce perceived latency and enable early termination when the model starts generating irrelevant content. This saves both output tokens and user wait time.

Implement a streaming monitor that watches for quality signals:

If the model starts repeating itself, stop generation
If the model produces a complete tool call, stop waiting for more text
If the model produces a complete answer before reaching max tokens, the streaming endpoint closes naturally

Combined with the other strategies, streaming and early termination typically save 10-15% of output tokens.

Putting It All Together: Cost Impact Analysis

For a system processing 10,000 agent interactions per day with an average of 8 LLM calls per interaction:

Strategy	Token Savings	Cost Reduction
Compact prompts	30-50% of system tokens	15-20% total
Tool summarization	60-80% of tool tokens	20-30% total
Selective context	40-60% of history tokens	15-25% total
Model tiering	N/A (model cost reduction)	50-70% total
Streaming + early stop	10-15% of output tokens	5-10% total

Applied together, these strategies can reduce total LLM costs by 70-85% compared to a naive implementation. For a system that would cost $5,000 per day without optimization, this brings the cost down to $750-1,500 per day.

FAQ

Do token optimization strategies degrade agent quality?

When applied carefully, no. The key is to optimize information density, not reduce information. A summarized tool result that contains all relevant fields is just as useful to the model as the full JSON response. A compact system prompt that covers the same rules is just as effective as a verbose one. The risk comes from over-aggressive summarization that drops critical context. Always evaluate agent quality metrics after applying optimizations.

How do you measure token efficiency?

Track three metrics: tokens per interaction (total tokens consumed for a complete agent interaction), cost per successful resolution (total cost divided by the number of interactions that achieved the user's goal), and quality-adjusted cost (cost weighted by customer satisfaction score). The third metric prevents optimizing cost at the expense of quality.

Is prompt caching compatible with dynamic system prompts?

Prompt caching works best with static prefixes. If your system prompt changes between calls (e.g., injecting current user data), the dynamic portion will not be cached. The solution is to structure your prompts with the static portion first (agent role, rules, tool descriptions) and dynamic data second (current user context, conversation history). The static prefix gets cached even if the dynamic suffix changes.

When should I use a smaller model versus context truncation?

Use a smaller model when the task is inherently simple (classification, extraction, formatting) regardless of context length. Use context truncation when the task is complex but the model does not need all available context. If the task is complex and requires extensive context, use the capable model with full context and accept the higher cost. The worst outcome is using a small model on a complex task where it fails and requires a retry on the expensive model, doubling your cost.