Skip to content
Learn Agentic AI10 min read0 views

Token Optimization: Reducing LLM Input Size Without Losing Quality

Master prompt compression, context pruning, conversation summarization, and selective history techniques to cut LLM costs and latency while preserving response quality in your AI agents.

Why Token Count Is Your Primary Cost and Latency Driver

Every token sent to an LLM costs money and adds latency. Input tokens are priced per thousand, and the time the model spends processing your prompt scales roughly linearly with token count. A 4,000-token prompt processes noticeably faster than a 16,000-token prompt — and costs 75% less.

For AI agents that maintain conversation history, tool outputs, and system instructions, token counts grow rapidly. A 20-turn conversation with tool results can easily reach 30,000+ input tokens per completion call. Optimizing this is not premature — it is essential for production viability.

Prompt Compression: Saying the Same Thing in Fewer Tokens

System prompts are sent with every request. Compressing them yields compounding savings. The key principle is to remove redundancy without removing information.

# BEFORE: 87 tokens
VERBOSE_PROMPT = """
You are a helpful customer service assistant for our company.
You should always be polite and professional in your responses.
When a customer asks a question, you should try to provide
a helpful and accurate answer. If you do not know the answer,
you should let the customer know that you will escalate their
question to a human agent who can help them.
"""

# AFTER: 34 tokens (61% reduction)
COMPRESSED_PROMPT = """You are a customer service assistant. Be polite and professional.
Answer accurately. If unsure, escalate to a human agent."""

Rules for prompt compression without quality loss: remove filler words ("try to", "should always"), eliminate repeated instructions, use imperative mood, and combine related sentences.

Context Pruning: Keeping Only What Matters

Not every message in a conversation is relevant to the current turn. Context pruning removes or shortens messages that no longer contribute to the response.

from dataclasses import dataclass

@dataclass
class Message:
    role: str
    content: str
    turn_number: int
    token_count: int

class ContextPruner:
    def __init__(self, max_tokens: int = 8000):
        self.max_tokens = max_tokens

    def prune(self, messages: list[Message], current_turn: int) -> list[Message]:
        """Keep system prompt, recent messages, and summarize old ones."""
        system_msgs = [m for m in messages if m.role == "system"]
        conversation = [m for m in messages if m.role != "system"]

        # Always keep the last 6 messages (3 turns)
        recent = conversation[-6:]
        older = conversation[:-6]

        # Calculate remaining token budget
        system_tokens = sum(m.token_count for m in system_msgs)
        recent_tokens = sum(m.token_count for m in recent)
        budget = self.max_tokens - system_tokens - recent_tokens

        # From older messages, keep only those within budget
        kept_older = []
        used = 0
        for msg in reversed(older):
            if used + msg.token_count <= budget:
                kept_older.insert(0, msg)
                used += msg.token_count
            else:
                break

        return system_msgs + kept_older + recent

This approach guarantees the most recent context is always preserved while gracefully dropping older messages when the budget is tight.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Conversation Summarization: Compressing History Into Summaries

When a conversation grows long, you can replace older messages with a summary that captures the essential information in far fewer tokens.

import asyncio
from openai import AsyncOpenAI

class ConversationSummarizer:
    def __init__(self, client: AsyncOpenAI):
        self.client = client

    async def summarize_window(self, messages: list[dict]) -> str:
        """Compress a window of messages into a concise summary."""
        formatted = "\n".join(
            f"{m['role']}: {m['content']}" for m in messages
        )
        response = await self.client.chat.completions.create(
            model="gpt-4o-mini",  # Use a cheap model for summarization
            messages=[
                {
                    "role": "system",
                    "content": "Summarize this conversation in 2-3 sentences. "
                    "Preserve key facts, decisions, and user preferences.",
                },
                {"role": "user", "content": formatted},
            ],
            max_tokens=150,
        )
        return response.choices[0].message.content

class SlidingWindowManager:
    def __init__(self, summarizer: ConversationSummarizer, window_size: int = 10):
        self.summarizer = summarizer
        self.window_size = window_size
        self.summary: str = ""
        self.messages: list[dict] = []

    async def add_and_compact(self, message: dict) -> list[dict]:
        self.messages.append(message)

        if len(self.messages) > self.window_size:
            # Summarize the oldest half
            split = len(self.messages) // 2
            to_summarize = self.messages[:split]
            self.messages = self.messages[split:]

            new_summary = await self.summarizer.summarize_window(to_summarize)
            self.summary = (
                f"{self.summary} {new_summary}".strip() if self.summary else new_summary
            )

        # Build the context for the LLM
        context = []
        if self.summary:
            context.append({
                "role": "system",
                "content": f"Conversation summary so far: {self.summary}",
            })
        context.extend(self.messages)
        return context

The cost of the summarization call (using a cheap model like gpt-4o-mini) is far less than sending the full history to an expensive model on every turn.

Selective History: Including Only Relevant Turns

Instead of sending the entire conversation, you can use embedding similarity to select only the turns that are relevant to the current query.

import numpy as np

class SelectiveHistory:
    def __init__(self, embedder, top_k: int = 5):
        self.embedder = embedder
        self.top_k = top_k
        self.history: list[dict] = []
        self.embeddings: list[np.ndarray] = []

    async def add_turn(self, message: dict):
        self.history.append(message)
        embedding = await self.embedder.embed(message["content"])
        self.embeddings.append(embedding)

    async def get_relevant_context(self, query: str) -> list[dict]:
        if len(self.history) <= self.top_k:
            return self.history

        query_embedding = await self.embedder.embed(query)
        similarities = [
            np.dot(query_embedding, emb)
            / (np.linalg.norm(query_embedding) * np.linalg.norm(emb))
            for emb in self.embeddings
        ]

        # Always include the last 2 messages plus top-k most similar
        recent_indices = set(range(len(self.history) - 2, len(self.history)))
        top_indices = set(np.argsort(similarities)[-self.top_k:])
        selected = sorted(recent_indices | top_indices)

        return [self.history[i] for i in selected]

Truncating Tool Outputs

Tool outputs are often the largest token consumers. A database query result or API response can be thousands of tokens when only a few fields matter.

import json

def truncate_tool_output(output: str, max_tokens: int = 500) -> str:
    """Reduce tool output size while preserving structure."""
    try:
        data = json.loads(output)
        if isinstance(data, list) and len(data) > 5:
            truncated = data[:5]
            return json.dumps(truncated) + f"\n... ({len(data) - 5} more items)"
        return json.dumps(data, indent=None, separators=(",", ":"))
    except json.JSONDecodeError:
        # Plain text: truncate by character count (rough token estimate)
        char_limit = max_tokens * 4
        if len(output) > char_limit:
            return output[:char_limit] + "... (truncated)"
        return output

FAQ

Does reducing tokens actually change the quality of LLM responses?

It depends on what you remove. Removing filler words, redundant instructions, and irrelevant old messages has minimal impact on quality. Removing recent context, key user preferences, or important facts will degrade responses. The techniques above specifically target low-information content.

When should I use summarization vs. pruning vs. selective history?

Use pruning when conversations are short-to-medium (under 30 turns) and you just need to stay within the context window. Use summarization for long-running sessions where old context still matters broadly. Use selective history when conversations cover many topics and only specific past turns are relevant to the current query.

How do I measure whether my token optimization is hurting quality?

Run A/B evaluations. Send the same set of test queries through both the full-context and optimized-context paths, then compare response quality using an LLM-as-judge or human reviewers. Track a metric like "answer correctness" alongside your token savings to find the optimal tradeoff.


#TokenOptimization #PromptEngineering #CostReduction #ContextManagement #Python #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.