Agentic AI Context Optimization: Managing Million-Token Agent Conversations
Optimize million-token context windows for agentic AI with summarization, compression, sliding windows, and hierarchical context injection.
The Context Window Is Your Agent's Working Memory
Every piece of information in the context window competes for the model's attention. System prompts, conversation history, tool definitions, tool results, retrieved documents — they all consume tokens and influence the model's behavior. As agent conversations grow longer and tools return large payloads, context management becomes a critical engineering challenge.
Modern models offer large context windows — Claude supports up to 200K tokens, Gemini supports up to 1M tokens, and GPT-4o supports 128K tokens. But larger windows do not solve the problem. Research consistently shows that model performance degrades on information placed in the middle of long contexts (the "lost in the middle" effect). Throwing everything into the context is not a strategy — it is an anti-pattern.
Effective context management means putting the right information in the right place at the right time, and aggressively removing information that is no longer relevant.
Conversation Summarization
Long-running agent conversations accumulate history that is no longer directly relevant. A customer support session that started with account verification twenty turns ago does not need those verification turns in full detail — a summary suffices.
Rolling Summarization
After every N turns (typically 5-10), summarize the oldest unsummarized turns and replace them with the summary. This keeps the full context within a budget while preserving the key information.
class ConversationSummarizer:
def __init__(self, llm_client, max_full_turns: int = 10):
self.llm = llm_client
self.max_full_turns = max_full_turns
self.summaries: list[str] = []
self.full_turns: list[dict] = []
async def add_turn(self, role: str, content: str):
self.full_turns.append({"role": role, "content": content})
if len(self.full_turns) > self.max_full_turns:
# Summarize oldest turns
turns_to_summarize = self.full_turns[:5]
summary = await self._summarize_turns(turns_to_summarize)
self.summaries.append(summary)
self.full_turns = self.full_turns[5:]
async def _summarize_turns(self, turns: list[dict]) -> str:
turn_text = "\n".join(
f"{t['role']}: {t['content']}" for t in turns
)
response = await self.llm.chat(
system="Summarize this conversation segment concisely. "
"Preserve key decisions, facts, and action items. "
"Omit pleasantries and redundant confirmations.",
messages=[{"role": "user", "content": turn_text}],
)
return response
def build_context(self) -> list[dict]:
context = []
if self.summaries:
summary_block = "\n\n".join(self.summaries)
context.append({
"role": "system",
"content": f"Previous conversation summary:\n{summary_block}",
})
context.extend(self.full_turns)
return context
Importance-Based Retention
Not all turns are equal. Turns where the user provided key information (account number, problem description, preferences) or where the agent made important decisions should be retained in full, while routine exchanges can be summarized more aggressively.
class ImportanceScorer:
HIGH_IMPORTANCE_SIGNALS = [
"account", "order", "booking", "confirmed", "agreed",
"decided", "problem is", "issue is", "error",
]
def score_turn(self, turn: dict) -> float:
content_lower = turn["content"].lower()
score = 0.5 # Base score
# Tool calls are always important
if turn.get("tool_calls"):
score += 0.3
# Key information signals
for signal in self.HIGH_IMPORTANCE_SIGNALS:
if signal in content_lower:
score += 0.1
# Long turns tend to contain more information
word_count = len(turn["content"].split())
if word_count > 100:
score += 0.1
return min(score, 1.0)
Sliding Window Techniques
For agents that process streams of data (monitoring agents, chat agents handling rapid-fire messages), a sliding window ensures the context stays current without growing unbounded.
Token-Budget Sliding Window
Instead of a fixed number of turns, define a token budget for conversation history and drop the oldest turns when the budget is exceeded.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
import tiktoken
class TokenBudgetWindow:
def __init__(self, token_budget: int = 50000, model: str = "gpt-4o"):
self.token_budget = token_budget
self.encoder = tiktoken.encoding_for_model(model)
self.turns: list[dict] = []
def count_tokens(self, text: str) -> int:
return len(self.encoder.encode(text))
def add_turn(self, turn: dict):
self.turns.append(turn)
self._enforce_budget()
def _enforce_budget(self):
total = sum(
self.count_tokens(t["content"]) for t in self.turns
)
while total > self.token_budget and len(self.turns) > 1:
removed = self.turns.pop(0)
total -= self.count_tokens(removed["content"])
def get_turns(self) -> list[dict]:
return self.turns
Context Compression
Sometimes you need all the information in the context but in a more compact form. Context compression techniques reduce token count while preserving information density.
Tool Result Compression
Tool results are often the largest context consumers. A database query might return 50 rows when the agent only needs 3. A web search might return full page content when the agent only needs key paragraphs.
class ToolResultCompressor:
def __init__(self, llm_client):
self.llm = llm_client
async def compress_tool_result(
self,
tool_name: str,
raw_result: str,
user_query: str,
max_tokens: int = 500,
) -> str:
if self.count_tokens(raw_result) <= max_tokens:
return raw_result
compressed = await self.llm.chat(
system=(
f"Compress the following {tool_name} result to under "
f"{max_tokens} tokens. Preserve all information relevant "
f"to the user's query. Remove redundant or irrelevant data."
),
messages=[
{
"role": "user",
"content": f"User query: {user_query}\n\n"
f"Tool result:\n{raw_result}",
}
],
)
return compressed
Structured Data Summarization
When tools return tabular data, convert it to a narrative summary rather than including the raw table.
def summarize_table_result(
rows: list[dict],
query_context: str
) -> str:
if len(rows) <= 5:
# Small result set, include as-is
return format_as_table(rows)
# Summarize large result sets
summary_parts = [
f"Query returned {len(rows)} results.",
f"Key statistics:",
]
# Add relevant aggregations based on data types
numeric_cols = [k for k, v in rows[0].items() if isinstance(v, (int, float))]
for col in numeric_cols:
values = [r[col] for r in rows if r.get(col) is not None]
if values:
summary_parts.append(
f" - {col}: min={min(values)}, max={max(values)}, "
f"avg={sum(values)/len(values):.1f}"
)
# Include top 5 results
summary_parts.append(f"\nTop 5 results:")
for row in rows[:5]:
summary_parts.append(f" {row}")
return "\n".join(summary_parts)
Selective Memory Injection
Not all agent memory should be in the context at all times. Selective injection loads relevant memories on demand based on the current conversation turn.
Relevance-Based Memory Loading
class SelectiveMemory:
def __init__(self, vector_store, max_memory_tokens: int = 2000):
self.vector_store = vector_store
self.max_memory_tokens = max_memory_tokens
async def get_relevant_memories(
self,
current_message: str,
session_id: str,
) -> str:
embedding = await generate_embedding(current_message)
memories = await self.vector_store.query(
vector=embedding,
top_k=10,
filter={"session_id": session_id},
)
# Select memories that fit within token budget
selected = []
token_count = 0
for memory in memories.matches:
memory_tokens = count_tokens(memory.metadata["content"])
if token_count + memory_tokens > self.max_memory_tokens:
break
selected.append(memory.metadata["content"])
token_count += memory_tokens
if not selected:
return ""
return "Relevant context from earlier in this session:\n" + "\n".join(selected)
Hierarchical Context Structure
Organize the context window into layers with different update frequencies and priority levels.
The Context Hierarchy
- System layer (static): Agent identity, role, rules, capabilities — loaded once per session
- Session layer (slow-changing): User profile, session metadata, business rules — updated on session events
- Conversation layer (dynamic): Recent conversation history — updated every turn
- Retrieval layer (per-turn): RAG results, tool outputs — replaced each turn
- Instruction layer (static): Output format requirements, safety constraints — loaded once
class HierarchicalContext:
def __init__(self, total_budget: int = 100000):
self.budgets = {
"system": int(total_budget * 0.15),
"session": int(total_budget * 0.10),
"conversation": int(total_budget * 0.40),
"retrieval": int(total_budget * 0.25),
"instruction": int(total_budget * 0.10),
}
self.layers: dict[str, str] = {}
def set_layer(self, layer: str, content: str):
tokens = count_tokens(content)
if tokens > self.budgets[layer]:
content = truncate_to_tokens(content, self.budgets[layer])
self.layers[layer] = content
def build_prompt(self) -> str:
ordered = ["system", "session", "instruction", "retrieval", "conversation"]
parts = []
for layer in ordered:
if layer in self.layers and self.layers[layer]:
parts.append(self.layers[layer])
return "\n\n---\n\n".join(parts)
Token Budgeting Per Agent
Different agents need different context distributions. A customer support agent needs more conversation history budget (to maintain context across a long troubleshooting session) while a research agent needs more retrieval budget (to incorporate multiple sources). Define per-agent token budgets as configuration.
Frequently Asked Questions
Does a larger context window mean better agent performance?
Not necessarily. Larger context windows allow more information to be included, but model attention degrades with length. The "lost in the middle" effect means information placed in the middle of long contexts is less likely to be used by the model. Strategic context management — putting the most relevant information at the beginning and end of the context — typically outperforms simply filling a large window with everything available.
How often should conversation history be summarized?
Summarize when the conversation history exceeds your token budget for that context layer. A common approach is to summarize every 5-10 turns, keeping the most recent turns in full detail and older turns as summaries. For high-stakes conversations (financial transactions, medical consultations), retain more turns in full to ensure no critical detail is lost in summarization.
What is the cost impact of large context windows?
LLM API pricing is typically per-token for both input and output. A 100K token context costs roughly 20-50x more per request than a 5K token context, depending on the model. Context optimization directly reduces API costs. Aggressive summarization and compression can reduce context size by 60-80% without meaningful quality loss for most agent applications.
How do you handle tool results that exceed the token budget?
Three strategies: truncation (cut the result to fit, losing tail data), compression (use an LLM to summarize the result, preserving the most relevant information), and pagination (return a subset of results with a "get more" tool the agent can call if needed). Compression is generally preferred because it preserves relevance, but pagination works well for structured data like database query results.
Should each agent in a multi-agent system have its own context window?
Yes. Each agent should maintain its own context optimized for its role. A triage agent needs minimal context (just the current request). A specialist agent needs rich domain context. A supervisor agent needs summaries from subordinate agents. Sharing a single context across all agents leads to bloat and confusion.
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.