Agentic AI Context Optimization: Managing Million-Token Agent Conversations

The Context Window Is Your Agent's Working Memory

Every piece of information in the context window competes for the model's attention. System prompts, conversation history, tool definitions, tool results, retrieved documents — they all consume tokens and influence the model's behavior. As agent conversations grow longer and tools return large payloads, context management becomes a critical engineering challenge.

Modern models offer large context windows — Claude supports up to 200K tokens, Gemini supports up to 1M tokens, and GPT-4o supports 128K tokens. But larger windows do not solve the problem. Research consistently shows that model performance degrades on information placed in the middle of long contexts (the "lost in the middle" effect). Throwing everything into the context is not a strategy — it is an anti-pattern.

Effective context management means putting the right information in the right place at the right time, and aggressively removing information that is no longer relevant.

Conversation Summarization

Long-running agent conversations accumulate history that is no longer directly relevant. A customer support session that started with account verification twenty turns ago does not need those verification turns in full detail — a summary suffices.

Rolling Summarization

After every N turns (typically 5-10), summarize the oldest unsummarized turns and replace them with the summary. This keeps the full context within a budget while preserving the key information.

class ConversationSummarizer:
    def __init__(self, llm_client, max_full_turns: int = 10):
        self.llm = llm_client
        self.max_full_turns = max_full_turns
        self.summaries: list[str] = []
        self.full_turns: list[dict] = []

    async def add_turn(self, role: str, content: str):
        self.full_turns.append({"role": role, "content": content})

        if len(self.full_turns) > self.max_full_turns:
            # Summarize oldest turns
            turns_to_summarize = self.full_turns[:5]
            summary = await self._summarize_turns(turns_to_summarize)
            self.summaries.append(summary)
            self.full_turns = self.full_turns[5:]

    async def _summarize_turns(self, turns: list[dict]) -> str:
        turn_text = "\n".join(
            f"{t['role']}: {t['content']}" for t in turns
        )
        response = await self.llm.chat(
            system="Summarize this conversation segment concisely. "
                   "Preserve key decisions, facts, and action items. "
                   "Omit pleasantries and redundant confirmations.",
            messages=[{"role": "user", "content": turn_text}],
        )
        return response

    def build_context(self) -> list[dict]:
        context = []
        if self.summaries:
            summary_block = "\n\n".join(self.summaries)
            context.append({
                "role": "system",
                "content": f"Previous conversation summary:\n{summary_block}",
            })
        context.extend(self.full_turns)
        return context

Importance-Based Retention

Not all turns are equal. Turns where the user provided key information (account number, problem description, preferences) or where the agent made important decisions should be retained in full, while routine exchanges can be summarized more aggressively.

class ImportanceScorer:
    HIGH_IMPORTANCE_SIGNALS = [
        "account", "order", "booking", "confirmed", "agreed",
        "decided", "problem is", "issue is", "error",
    ]

    def score_turn(self, turn: dict) -> float:
        content_lower = turn["content"].lower()
        score = 0.5  # Base score

        # Tool calls are always important
        if turn.get("tool_calls"):
            score += 0.3

        # Key information signals
        for signal in self.HIGH_IMPORTANCE_SIGNALS:
            if signal in content_lower:
                score += 0.1

        # Long turns tend to contain more information
        word_count = len(turn["content"].split())
        if word_count > 100:
            score += 0.1

        return min(score, 1.0)

Sliding Window Techniques

For agents that process streams of data (monitoring agents, chat agents handling rapid-fire messages), a sliding window ensures the context stays current without growing unbounded.

Token-Budget Sliding Window

Instead of a fixed number of turns, define a token budget for conversation history and drop the oldest turns when the budget is exceeded.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

import tiktoken

class TokenBudgetWindow:
    def __init__(self, token_budget: int = 50000, model: str = "gpt-4o"):
        self.token_budget = token_budget
        self.encoder = tiktoken.encoding_for_model(model)
        self.turns: list[dict] = []

    def count_tokens(self, text: str) -> int:
        return len(self.encoder.encode(text))

    def add_turn(self, turn: dict):
        self.turns.append(turn)
        self._enforce_budget()

    def _enforce_budget(self):
        total = sum(
            self.count_tokens(t["content"]) for t in self.turns
        )
        while total > self.token_budget and len(self.turns) > 1:
            removed = self.turns.pop(0)
            total -= self.count_tokens(removed["content"])

    def get_turns(self) -> list[dict]:
        return self.turns

Context Compression

Sometimes you need all the information in the context but in a more compact form. Context compression techniques reduce token count while preserving information density.

Tool Result Compression

Tool results are often the largest context consumers. A database query might return 50 rows when the agent only needs 3. A web search might return full page content when the agent only needs key paragraphs.

class ToolResultCompressor:
    def __init__(self, llm_client):
        self.llm = llm_client

    async def compress_tool_result(
        self,
        tool_name: str,
        raw_result: str,
        user_query: str,
        max_tokens: int = 500,
    ) -> str:
        if self.count_tokens(raw_result) <= max_tokens:
            return raw_result

        compressed = await self.llm.chat(
            system=(
                f"Compress the following {tool_name} result to under "
                f"{max_tokens} tokens. Preserve all information relevant "
                f"to the user's query. Remove redundant or irrelevant data."
            ),
            messages=[
                {
                    "role": "user",
                    "content": f"User query: {user_query}\n\n"
                               f"Tool result:\n{raw_result}",
                }
            ],
        )
        return compressed

Structured Data Summarization

When tools return tabular data, convert it to a narrative summary rather than including the raw table.

def summarize_table_result(
    rows: list[dict],
    query_context: str
) -> str:
    if len(rows) <= 5:
        # Small result set, include as-is
        return format_as_table(rows)

    # Summarize large result sets
    summary_parts = [
        f"Query returned {len(rows)} results.",
        f"Key statistics:",
    ]

    # Add relevant aggregations based on data types
    numeric_cols = [k for k, v in rows[0].items() if isinstance(v, (int, float))]
    for col in numeric_cols:
        values = [r[col] for r in rows if r.get(col) is not None]
        if values:
            summary_parts.append(
                f"  - {col}: min={min(values)}, max={max(values)}, "
                f"avg={sum(values)/len(values):.1f}"
            )

    # Include top 5 results
    summary_parts.append(f"\nTop 5 results:")
    for row in rows[:5]:
        summary_parts.append(f"  {row}")

    return "\n".join(summary_parts)

Selective Memory Injection

Not all agent memory should be in the context at all times. Selective injection loads relevant memories on demand based on the current conversation turn.

Relevance-Based Memory Loading

class SelectiveMemory:
    def __init__(self, vector_store, max_memory_tokens: int = 2000):
        self.vector_store = vector_store
        self.max_memory_tokens = max_memory_tokens

    async def get_relevant_memories(
        self,
        current_message: str,
        session_id: str,
    ) -> str:
        embedding = await generate_embedding(current_message)
        memories = await self.vector_store.query(
            vector=embedding,
            top_k=10,
            filter={"session_id": session_id},
        )

        # Select memories that fit within token budget
        selected = []
        token_count = 0
        for memory in memories.matches:
            memory_tokens = count_tokens(memory.metadata["content"])
            if token_count + memory_tokens > self.max_memory_tokens:
                break
            selected.append(memory.metadata["content"])
            token_count += memory_tokens

        if not selected:
            return ""

        return "Relevant context from earlier in this session:\n" + "\n".join(selected)

Hierarchical Context Structure

Organize the context window into layers with different update frequencies and priority levels.

The Context Hierarchy

System layer (static): Agent identity, role, rules, capabilities — loaded once per session
Session layer (slow-changing): User profile, session metadata, business rules — updated on session events
Conversation layer (dynamic): Recent conversation history — updated every turn
Retrieval layer (per-turn): RAG results, tool outputs — replaced each turn
Instruction layer (static): Output format requirements, safety constraints — loaded once

class HierarchicalContext:
    def __init__(self, total_budget: int = 100000):
        self.budgets = {
            "system": int(total_budget * 0.15),
            "session": int(total_budget * 0.10),
            "conversation": int(total_budget * 0.40),
            "retrieval": int(total_budget * 0.25),
            "instruction": int(total_budget * 0.10),
        }
        self.layers: dict[str, str] = {}

    def set_layer(self, layer: str, content: str):
        tokens = count_tokens(content)
        if tokens > self.budgets[layer]:
            content = truncate_to_tokens(content, self.budgets[layer])
        self.layers[layer] = content

    def build_prompt(self) -> str:
        ordered = ["system", "session", "instruction", "retrieval", "conversation"]
        parts = []
        for layer in ordered:
            if layer in self.layers and self.layers[layer]:
                parts.append(self.layers[layer])
        return "\n\n---\n\n".join(parts)

Token Budgeting Per Agent

Different agents need different context distributions. A customer support agent needs more conversation history budget (to maintain context across a long troubleshooting session) while a research agent needs more retrieval budget (to incorporate multiple sources). Define per-agent token budgets as configuration.

Frequently Asked Questions

Does a larger context window mean better agent performance?

Not necessarily. Larger context windows allow more information to be included, but model attention degrades with length. The "lost in the middle" effect means information placed in the middle of long contexts is less likely to be used by the model. Strategic context management — putting the most relevant information at the beginning and end of the context — typically outperforms simply filling a large window with everything available.

How often should conversation history be summarized?

Summarize when the conversation history exceeds your token budget for that context layer. A common approach is to summarize every 5-10 turns, keeping the most recent turns in full detail and older turns as summaries. For high-stakes conversations (financial transactions, medical consultations), retain more turns in full to ensure no critical detail is lost in summarization.

What is the cost impact of large context windows?

LLM API pricing is typically per-token for both input and output. A 100K token context costs roughly 20-50x more per request than a 5K token context, depending on the model. Context optimization directly reduces API costs. Aggressive summarization and compression can reduce context size by 60-80% without meaningful quality loss for most agent applications.

How do you handle tool results that exceed the token budget?

Three strategies: truncation (cut the result to fit, losing tail data), compression (use an LLM to summarize the result, preserving the most relevant information), and pagination (return a subset of results with a "get more" tool the agent can call if needed). Compression is generally preferred because it preserves relevance, but pagination works well for structured data like database query results.

Should each agent in a multi-agent system have its own context window?

Yes. Each agent should maintain its own context optimized for its role. A triage agent needs minimal context (just the current request). A specialist agent needs rich domain context. A supervisor agent needs summaries from subordinate agents. Sharing a single context across all agents leads to bloat and confusion.