Token-Efficient Agent Design: Reducing LLM Costs Without Sacrificing Quality
Practical strategies for reducing LLM token costs in agentic systems including compact prompts, tool result summarization, selective context, and model tiering approaches.
Why Token Costs Compound in Agentic Systems
A single chatbot exchange might use 2,000 tokens. A single agent interaction that involves planning, tool use, evaluation, and response generation can easily consume 50,000-200,000 tokens. Multiply that by thousands of daily interactions and the cost curve becomes a serious business constraint.
The problem compounds because of how agent loops work. Each iteration of the planning loop sends the full conversation history (including all previous tool calls and results) back to the model. If an agent takes 8 steps to complete a task and each step adds 3,000 tokens of tool results, the final call includes 24,000 tokens of accumulated context on top of the system prompt and original user message.
Token-efficient agent design is not about making your agents dumber. It is about being strategic about what information reaches the model at each step, using the right model for each task, and eliminating waste without sacrificing the quality of the agent's reasoning.
Strategy 1: Compact System Prompts
System prompts are the largest fixed cost in agent systems because they are sent with every single LLM call. A verbose system prompt of 3,000 tokens multiplied by 10 calls per interaction multiplied by 10,000 daily interactions equals 300 million tokens per day in system prompts alone.
The solution is not to remove information from system prompts but to express the same information more concisely.
# Before: Verbose system prompt (2,847 tokens)
VERBOSE_PROMPT = """
You are a helpful customer service assistant for TechCorp.
Your name is Alex. You should always be polite and professional.
When a customer asks about their order, you should look up the
order using the order_lookup tool. Make sure to verify the
customer's identity before sharing order details. You have
access to the following tools...
[... 2000 more tokens of instructions ...]
"""
# After: Compact system prompt (892 tokens)
COMPACT_PROMPT = """Role: TechCorp customer service agent (Alex)
Tone: Professional, concise
## Rules
1. Verify identity before sharing account data
2. Use tools for data lookup; never fabricate order details
3. Escalate to human if: refund > $500, legal threat, repeated failure
## Tool Selection
- order_lookup: order status, tracking, history
- account_info: profile, preferences, subscription
- refund_process: initiate refunds (auto-approve ≤ $500)
- escalate: transfer to human agent with context summary
"""
# Token savings: 1,955 tokens per call
# At 10 calls/interaction, 10K interactions/day:
# 195.5M tokens saved daily
Key techniques for compact prompts:
- Use structured formats (markdown headers, numbered lists) instead of prose
- Eliminate redundancy: "You should look up the order using the order_lookup tool" becomes a tool description
- Replace examples with rules: instead of showing 5 example conversations, state the behavioral rules they illustrate
- Use abbreviations consistently within the prompt
Prompt Caching
Most major LLM providers now support prompt caching, where the system prompt (and any static prefix) is cached between calls. This can reduce costs by 80-90% for the cached portion. To maximize cache hit rates:
- Keep your system prompt identical across all calls (do not inject dynamic data into the system prompt)
- Place static content before dynamic content in your messages
- Use the same model for all calls within an agent session
Strategy 2: Tool Result Summarization
Tool results are the fastest-growing cost center in agent systems. A database query might return a 5,000-token JSON response, but the agent only needs 3 fields from it. A web search might return 10,000 tokens of content, but only 2 paragraphs are relevant.
# Tool result summarization pipeline
from typing import Any
class ToolResultSummarizer:
"""
Reduces tool output tokens before they enter the agent context.
Uses rules-based summarization for structured data and
a fast model for unstructured content.
"""
def __init__(self, fast_model):
self.fast_model = fast_model
self.rules = {}
def register_rule(self, tool_name: str, summarizer):
"""Register a rules-based summarizer for a specific tool."""
self.rules[tool_name] = summarizer
async def summarize(
self, tool_name: str, raw_result: Any, query_context: str
) -> str:
# Try rules-based summarization first (zero token cost)
if tool_name in self.rules:
return self.rules[tool_name](raw_result)
# Fall back to model-based summarization for unstructured data
return await self._model_summarize(raw_result, query_context)
async def _model_summarize(self, raw_result: Any, context: str) -> str:
result_str = str(raw_result)
if len(result_str) < 500:
return result_str # Short enough, no summarization needed
response = await self.fast_model.complete(
prompt=(
f"Summarize this tool result in under 200 words, "
f"keeping only information relevant to: {context}\n\n"
f"Tool result:\n{result_str[:3000]}" # Cap input
),
max_tokens=300,
)
return response.text
# Rules-based summarizers for structured data
def summarize_order_lookup(result: dict) -> str:
"""Extract only the fields the agent needs."""
order = result.get("order", {})
return (
f"Order #{order.get('id')}: "
f"Status={order.get('status')}, "
f"Items={len(order.get('items', []))}, "
f"Total=${order.get('total', 0):.2f}, "
f"Shipped={order.get('shipped_at', 'pending')}, "
f"ETA={order.get('estimated_delivery', 'unknown')}"
)
def summarize_db_query(result: list[dict]) -> str:
"""Summarize database query results."""
if not result:
return "No results found."
count = len(result)
# Include first 3 rows in detail, summarize the rest
detail = "\n".join(
f"- {json.dumps(row, default=str)}" for row in result[:3]
)
suffix = f"\n... and {count - 3} more rows" if count > 3 else ""
return f"Found {count} results:\n{detail}{suffix}"
# Usage
summarizer = ToolResultSummarizer(fast_model=haiku_client)
summarizer.register_rule("order_lookup", summarize_order_lookup)
summarizer.register_rule("db_query", summarize_db_query)
The impact is substantial. A raw order lookup response might be 1,200 tokens. The summarized version is 40 tokens. Over 8 agent steps, that saves 9,280 tokens per interaction.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Strategy 3: Selective Context Inclusion
Not every previous message needs to be in the context window for every LLM call. An agent executing step 8 of a plan rarely needs the full verbatim content of steps 1-3. It needs the plan, the current step, and the results of the immediately preceding steps.
# Context window manager with selective inclusion
from dataclasses import dataclass
@dataclass
class ContextBudget:
max_tokens: int
system_prompt_tokens: int
current_message_tokens: int
reserved_for_response: int
@property
def available_for_history(self) -> int:
return (
self.max_tokens
- self.system_prompt_tokens
- self.current_message_tokens
- self.reserved_for_response
)
class SelectiveContextManager:
def __init__(self, tokenizer):
self.tokenizer = tokenizer
def build_context(
self,
full_history: list[dict],
budget: ContextBudget,
current_step: int,
) -> list[dict]:
available = budget.available_for_history
context = []
used_tokens = 0
# Priority 1: Always include the original user request
if full_history:
first_msg = full_history[0]
tokens = self.tokenizer.count(str(first_msg))
context.append(first_msg)
used_tokens += tokens
# Priority 2: Include the last 3 exchanges (most recent context)
recent = full_history[-6:] # 3 exchanges = 6 messages
for msg in recent:
tokens = self.tokenizer.count(str(msg))
if used_tokens + tokens > available:
break
context.append(msg)
used_tokens += tokens
# Priority 3: Include summarized middle context if budget allows
middle = full_history[1:-6] if len(full_history) > 7 else []
if middle and used_tokens < available * 0.7:
summary = self._summarize_middle(middle)
summary_tokens = self.tokenizer.count(summary)
if used_tokens + summary_tokens <= available:
context.insert(1, {
"role": "system",
"content": f"[Summary of earlier conversation]\n{summary}"
})
return context
def _summarize_middle(self, messages: list[dict]) -> str:
"""Create a bullet-point summary of middle conversation turns."""
points = []
for msg in messages:
role = msg["role"]
content = msg.get("content", "")
if role == "tool":
# Compress tool results aggressively
points.append(f"- Tool returned: {content[:100]}...")
elif role == "assistant" and "tool_use" in str(msg):
points.append(f"- Agent called tool")
else:
points.append(f"- {role}: {content[:80]}...")
return "\n".join(points)
Strategy 4: Model Tiering
Not every LLM call in an agent pipeline requires the same capability. Classification and routing can use a fast, cheap model. Complex reasoning requires a capable, expensive model. Using the right model for each task can reduce costs by 60-80%.
# Model tiering strategy for agent pipelines
from enum import Enum
class ModelTier(Enum):
FAST = "fast" # Classification, routing, simple extraction
CAPABLE = "capable" # Reasoning, planning, complex tool use
PREMIUM = "premium" # Critical decisions, complex analysis
# Model mapping (adjust based on your provider)
MODEL_MAP = {
ModelTier.FAST: {
"name": "claude-3-5-haiku-20241022",
"cost_per_1m_input": 0.80,
"cost_per_1m_output": 4.00,
},
ModelTier.CAPABLE: {
"name": "claude-sonnet-4-20250514",
"cost_per_1m_input": 3.00,
"cost_per_1m_output": 15.00,
},
ModelTier.PREMIUM: {
"name": "claude-opus-4-20250918",
"cost_per_1m_input": 15.00,
"cost_per_1m_output": 75.00,
},
}
class TieredAgentExecutor:
def __init__(self, llm_pool: LLMConnectionPool):
self.pool = llm_pool
async def route_message(self, message: str, context: dict) -> str:
"""FAST tier: classify and route incoming messages."""
return await self.pool.chat_completion(
model=MODEL_MAP[ModelTier.FAST]["name"],
messages=[{
"role": "user",
"content": f"Classify this message into one of: "
f"billing, technical, account, escalation.\n"
f"Message: {message}\nCategory:"
}],
max_tokens=20,
)
async def plan_actions(self, task: str, context: dict) -> list:
"""CAPABLE tier: create execution plan."""
return await self.pool.chat_completion(
model=MODEL_MAP[ModelTier.CAPABLE]["name"],
messages=[{
"role": "system",
"content": "Create an action plan for the given task."
}, {
"role": "user",
"content": f"Task: {task}\nContext: {context}"
}],
max_tokens=1000,
)
async def critical_decision(self, decision: str, stakes: dict) -> dict:
"""PREMIUM tier: high-stakes decisions requiring maximum accuracy."""
return await self.pool.chat_completion(
model=MODEL_MAP[ModelTier.PREMIUM]["name"],
messages=[{
"role": "system",
"content": "You are making a high-stakes decision. "
"Reason carefully and explain your logic."
}, {
"role": "user",
"content": f"Decision: {decision}\nStakes: {stakes}"
}],
max_tokens=2000,
)
# Cost comparison per interaction:
# All-premium: ~$0.45/interaction
# All-capable: ~$0.09/interaction
# Tiered (70% fast, 25% capable, 5% premium): ~$0.04/interaction
# Savings: 91% vs all-premium, 56% vs all-capable
Strategy 5: Response Streaming and Early Termination
Streaming responses reduce perceived latency and enable early termination when the model starts generating irrelevant content. This saves both output tokens and user wait time.
Implement a streaming monitor that watches for quality signals:
- If the model starts repeating itself, stop generation
- If the model produces a complete tool call, stop waiting for more text
- If the model produces a complete answer before reaching max tokens, the streaming endpoint closes naturally
Combined with the other strategies, streaming and early termination typically save 10-15% of output tokens.
Putting It All Together: Cost Impact Analysis
For a system processing 10,000 agent interactions per day with an average of 8 LLM calls per interaction:
| Strategy | Token Savings | Cost Reduction |
|---|---|---|
| Compact prompts | 30-50% of system tokens | 15-20% total |
| Tool summarization | 60-80% of tool tokens | 20-30% total |
| Selective context | 40-60% of history tokens | 15-25% total |
| Model tiering | N/A (model cost reduction) | 50-70% total |
| Streaming + early stop | 10-15% of output tokens | 5-10% total |
Applied together, these strategies can reduce total LLM costs by 70-85% compared to a naive implementation. For a system that would cost $5,000 per day without optimization, this brings the cost down to $750-1,500 per day.
FAQ
Do token optimization strategies degrade agent quality?
When applied carefully, no. The key is to optimize information density, not reduce information. A summarized tool result that contains all relevant fields is just as useful to the model as the full JSON response. A compact system prompt that covers the same rules is just as effective as a verbose one. The risk comes from over-aggressive summarization that drops critical context. Always evaluate agent quality metrics after applying optimizations.
How do you measure token efficiency?
Track three metrics: tokens per interaction (total tokens consumed for a complete agent interaction), cost per successful resolution (total cost divided by the number of interactions that achieved the user's goal), and quality-adjusted cost (cost weighted by customer satisfaction score). The third metric prevents optimizing cost at the expense of quality.
Is prompt caching compatible with dynamic system prompts?
Prompt caching works best with static prefixes. If your system prompt changes between calls (e.g., injecting current user data), the dynamic portion will not be cached. The solution is to structure your prompts with the static portion first (agent role, rules, tool descriptions) and dynamic data second (current user context, conversation history). The static prefix gets cached even if the dynamic suffix changes.
When should I use a smaller model versus context truncation?
Use a smaller model when the task is inherently simple (classification, extraction, formatting) regardless of context length. Use context truncation when the task is complex but the model does not need all available context. If the task is complex and requires extensive context, use the capable model with full context and accept the higher cost. The worst outcome is using a small model on a complex task where it fails and requires a retry on the expensive model, doubling your cost.
Written by
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.