Context Window Management for AI Agents: Summarization, Pruning, and Sliding Window Strategies
Managing context in long-running AI agents: conversation summarization, selective pruning, sliding window approaches, and when to leverage 1M token context versus optimization strategies.
The Context Window Bottleneck
Every AI agent runs within the constraints of its model's context window — the maximum number of tokens the model can process in a single request. Even with models offering 200K to 1M token windows, context management matters because: (1) cost scales linearly with input tokens, (2) latency increases with context length, (3) model attention degrades on very long contexts ("lost in the middle" effect), and (4) many production tasks involve agents that run for hours or days, generating more context than any window can hold.
A customer service agent handling 50 calls per day with an average of 20 turns per call generates roughly 100,000 tokens of conversation history. A coding agent working on a large codebase might need to reference hundreds of files. A research agent exploring a topic might traverse dozens of web pages. Without active context management, these agents either crash against the token limit or degrade in quality as the context fills with noise.
Strategy 1: Conversation Summarization
The most common approach for long-running conversational agents is to periodically summarize older parts of the conversation, replacing verbose history with a compact summary that preserves key facts.
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class ConversationMemory:
summary: str = ""
recent_messages: list[dict] = field(default_factory=list)
key_facts: list[str] = field(default_factory=list)
total_messages_processed: int = 0
class SummarizationManager:
"""Manages context through periodic summarization."""
def __init__(
self,
llm_client,
max_recent_messages: int = 20,
summarize_every: int = 10,
max_summary_tokens: int = 500,
):
self.llm = llm_client
self.max_recent = max_recent_messages
self.summarize_every = summarize_every
self.max_summary_tokens = max_summary_tokens
self.memory = ConversationMemory()
async def add_message(self, message: dict):
self.memory.recent_messages.append(message)
self.memory.total_messages_processed += 1
# Check if we need to summarize
if len(self.memory.recent_messages) > self.max_recent:
await self._summarize_oldest()
async def _summarize_oldest(self):
# Take the oldest messages beyond the recent window
to_summarize = self.memory.recent_messages[
: -self.max_recent
]
self.memory.recent_messages = self.memory.recent_messages[
-self.max_recent :
]
conversation_text = "\n".join(
f"{m['role']}: {m['content']}" for m in to_summarize
)
response = await self.llm.chat(
messages=[{
"role": "user",
"content": (
f"Summarize this conversation segment, preserving "
f"key facts, decisions, and unresolved items. "
f"Be concise but complete.\n\n"
f"Previous summary: {self.memory.summary}\n\n"
f"New conversation to summarize:\n"
f"{conversation_text}"
),
}],
max_tokens=self.max_summary_tokens,
)
self.memory.summary = response.content
# Extract key facts for quick reference
facts = await self._extract_key_facts(to_summarize)
self.memory.key_facts.extend(facts)
# Keep only the most recent 20 key facts
self.memory.key_facts = self.memory.key_facts[-20:]
async def _extract_key_facts(
self, messages: list[dict]
) -> list[str]:
conversation_text = "\n".join(
f"{m['role']}: {m['content']}" for m in messages
)
response = await self.llm.chat(messages=[{
"role": "user",
"content": (
f"Extract key facts from this conversation as a "
f"bullet list. Include: names, numbers, decisions, "
f"commitments, and unresolved questions.\n\n"
f"{conversation_text}"
),
}])
facts = [
line.strip().lstrip("- ")
for line in response.content.split("\n")
if line.strip().startswith("-")
]
return facts
def build_context(self) -> list[dict]:
"""Build the context to send to the LLM."""
context = []
if self.memory.summary:
context.append({
"role": "system",
"content": (
f"CONVERSATION HISTORY SUMMARY:\n"
f"{self.memory.summary}\n\n"
f"KEY FACTS:\n"
+ "\n".join(
f"- {f}" for f in self.memory.key_facts
)
),
})
context.extend(self.memory.recent_messages)
return context
Strategy 2: Selective Pruning
Summarization compresses everything equally. Selective pruning is smarter: it identifies which parts of the context are most relevant to the current task and drops the rest. This is particularly useful for coding agents that need to reference specific files.
from dataclasses import dataclass
from typing import Optional
@dataclass
class ContextBlock:
id: str
content: str
token_count: int
relevance_score: float = 0.0
category: str = "general" # "code", "conversation", "tool_result"
timestamp: float = 0.0
pinned: bool = False # pinned items are never pruned
class SelectivePruner:
"""Prunes context blocks based on relevance to current task."""
def __init__(
self,
llm_client,
embeddings_client,
max_tokens: int = 100000,
reserve_tokens: int = 4000, # reserve for response
):
self.llm = llm_client
self.embeddings = embeddings_client
self.max_tokens = max_tokens
self.reserve = reserve_tokens
self.blocks: list[ContextBlock] = []
def add_block(self, block: ContextBlock):
self.blocks.append(block)
async def prune_for_query(
self, query: str
) -> list[ContextBlock]:
available_tokens = self.max_tokens - self.reserve
# Always include pinned blocks
pinned = [b for b in self.blocks if b.pinned]
pinned_tokens = sum(b.token_count for b in pinned)
if pinned_tokens > available_tokens:
raise ValueError(
"Pinned blocks alone exceed context limit"
)
remaining_tokens = available_tokens - pinned_tokens
unpinned = [b for b in self.blocks if not b.pinned]
# Score unpinned blocks by relevance
scored = await self._score_relevance(query, unpinned)
scored.sort(key=lambda b: b.relevance_score, reverse=True)
# Greedily add blocks until we hit the token limit
selected = list(pinned)
tokens_used = pinned_tokens
for block in scored:
if tokens_used + block.token_count <= remaining_tokens:
selected.append(block)
tokens_used += block.token_count
# Sort selected by original order (timestamp)
selected.sort(key=lambda b: b.timestamp)
return selected
async def _score_relevance(
self, query: str, blocks: list[ContextBlock]
) -> list[ContextBlock]:
if not blocks:
return blocks
query_embedding = await self.embeddings.embed(query)
for block in blocks:
block_embedding = await self.embeddings.embed(
block.content[:500] # embed first 500 chars
)
# Cosine similarity
dot = sum(
a * b for a, b in zip(
query_embedding, block_embedding
)
)
norm_q = sum(a ** 2 for a in query_embedding) ** 0.5
norm_b = sum(b ** 2 for b in block_embedding) ** 0.5
block.relevance_score = (
dot / (norm_q * norm_b) if norm_q and norm_b else 0
)
# Boost recent blocks slightly
recency_bonus = min(block.timestamp / 1e10, 0.1)
block.relevance_score += recency_bonus
return blocks
Strategy 3: Sliding Window with Memory Store
The sliding window approach maintains a fixed-size recent context window while persisting older information in an external memory store (database, vector store) that can be queried on demand.
from dataclasses import dataclass, field
from typing import Any
@dataclass
class MemoryEntry:
id: str
content: str
embedding: list[float] = field(default_factory=list)
metadata: dict = field(default_factory=dict)
timestamp: float = 0.0
class SlidingWindowWithMemory:
"""Fixed-size context window backed by queryable memory store."""
def __init__(
self,
llm_client,
embeddings_client,
vector_store,
window_size: int = 20,
memory_retrieval_k: int = 5,
):
self.llm = llm_client
self.embeddings = embeddings_client
self.store = vector_store
self.window_size = window_size
self.retrieval_k = memory_retrieval_k
self.window: list[dict] = []
self._message_counter = 0
async def add_message(self, message: dict):
self.window.append(message)
self._message_counter += 1
# When window overflows, move oldest to memory store
while len(self.window) > self.window_size:
oldest = self.window.pop(0)
await self._persist_to_memory(oldest)
async def _persist_to_memory(self, message: dict):
content = message.get("content", "")
embedding = await self.embeddings.embed(content)
entry = MemoryEntry(
id=f"msg_{self._message_counter}",
content=content,
embedding=embedding,
metadata={
"role": message.get("role", "unknown"),
"message_number": self._message_counter,
},
timestamp=self._message_counter,
)
await self.store.upsert({
"id": entry.id,
"embedding": entry.embedding,
"text": entry.content,
"metadata": entry.metadata,
})
async def build_context(
self, current_query: str
) -> list[dict]:
# Retrieve relevant memories
query_embedding = await self.embeddings.embed(current_query)
memories = await self.store.query(
embedding=query_embedding,
top_k=self.retrieval_k,
)
context = []
# Add retrieved memories as system context
if memories:
memory_text = "\n".join(
f"[{m['metadata']['role']}] {m['text']}"
for m in memories
)
context.append({
"role": "system",
"content": (
f"RELEVANT CONTEXT FROM EARLIER:\n"
f"{memory_text}"
),
})
# Add the current sliding window
context.extend(self.window)
return context
When to Use 1M Context vs Optimization
Models with 1M token context windows (like Claude with extended context) change the calculus. But "can fit" does not mean "should fit."
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Use the full 1M context when:
- The task genuinely requires cross-referencing information spread across a large corpus (entire codebase analysis, long document QA)
- Accuracy on distant context references is critical (legal document review, compliance checking)
- The cost of missing a detail outweighs the inference cost
- The task is latency-insensitive (batch processing, async analysis)
Optimize context even with 1M available when:
- The agent runs in a real-time conversational loop (latency matters)
- The task processes many requests (cost scales with volume)
- Most of the context is noise for any given query
- The agent runs for extended periods generating massive context
class AdaptiveContextManager:
"""Automatically selects context strategy based on task."""
def __init__(
self,
summarizer: SummarizationManager,
pruner: SelectivePruner,
sliding_window: SlidingWindowWithMemory,
model_context_limit: int = 200000,
):
self.summarizer = summarizer
self.pruner = pruner
self.sliding = sliding_window
self.limit = model_context_limit
async def build_context(
self,
query: str,
total_context_tokens: int,
latency_sensitive: bool = True,
accuracy_critical: bool = False,
) -> list[dict]:
# Decision tree
if total_context_tokens < self.limit * 0.3:
# Under 30% of limit: use everything
return self.sliding.window
if accuracy_critical and total_context_tokens < self.limit:
# Accuracy critical and fits: use everything
return self.sliding.window
if latency_sensitive:
# Real-time: use pruning for fast, relevant context
blocks = await self.pruner.prune_for_query(query)
return [
{"role": "system", "content": b.content}
for b in blocks
]
# Default: summarization for older + recent window
return self.summarizer.build_context()
Measuring Context Management Quality
How do you know if your context management strategy is working? Track these metrics:
- Recall rate: When the agent needs information from earlier in the conversation, how often does the context management system provide it? Test by asking the agent about facts from messages that have been summarized or pruned.
- Context utilization: What percentage of the context window is actively relevant to the current query? Low utilization means you are paying for tokens that do not help.
- Summary accuracy: Periodically compare summaries against the original messages. Do they preserve the key facts? Automated evaluation can score this.
- Latency impact: Measure the time difference between full-context and optimized-context requests. The optimization is only valuable if it saves meaningful latency.
FAQ
Does the "lost in the middle" problem affect all models equally?
No. The "lost in the middle" effect — where models attend less to information in the middle of long contexts compared to the beginning and end — varies significantly by model architecture and training. Models trained with long-context-specific objectives (like those using ALiBi positional encoding or trained on long documents) show less degradation. However, even the best models show some attention bias. For critical information, placing it near the beginning or end of the context (or repeating it) is a practical mitigation.
Should I always summarize or can I just use a larger context window?
Larger context windows are a valid strategy when cost and latency are acceptable. However, summarization provides benefits beyond fitting in the window: it forces information distillation, reduces noise, and can actually improve quality by removing irrelevant details that might confuse the model. The best approach is hybrid — use the full window for the current session and summarize across sessions.
How do you handle context management for multi-agent systems where agents share context?
In multi-agent systems, each agent should maintain its own context relevant to its specialization, plus a shared context layer that contains cross-agent information. The shared layer should use the selective pruning strategy — each agent retrieves from it based on its current task relevance. Avoid broadcasting all context to all agents, which wastes tokens and can confuse specialists with irrelevant information.
What is the cost difference between full context and optimized context for a high-volume agent?
For an agent processing 1,000 interactions per day at 50,000 tokens per interaction with full context: ~50M input tokens/day at $3/M tokens = $150/day. With context optimization reducing average input to 15,000 tokens: ~15M tokens/day = $45/day. That is $105/day saved, or $38,000/year — for a single agent deployment. At enterprise scale with hundreds of agents, context optimization is a significant cost lever.
#ContextWindow #MemoryManagement #Summarization #AIAgents #Optimization #TokenManagement
Written by
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.