Skip to content
Learn Agentic AI
Learn Agentic AI14 min read0 views

Context Window Management for AI Agents: Summarization, Pruning, and Sliding Window Strategies

Managing context in long-running AI agents: conversation summarization, selective pruning, sliding window approaches, and when to leverage 1M token context versus optimization strategies.

The Context Window Bottleneck

Every AI agent runs within the constraints of its model's context window — the maximum number of tokens the model can process in a single request. Even with models offering 200K to 1M token windows, context management matters because: (1) cost scales linearly with input tokens, (2) latency increases with context length, (3) model attention degrades on very long contexts ("lost in the middle" effect), and (4) many production tasks involve agents that run for hours or days, generating more context than any window can hold.

A customer service agent handling 50 calls per day with an average of 20 turns per call generates roughly 100,000 tokens of conversation history. A coding agent working on a large codebase might need to reference hundreds of files. A research agent exploring a topic might traverse dozens of web pages. Without active context management, these agents either crash against the token limit or degrade in quality as the context fills with noise.

Strategy 1: Conversation Summarization

The most common approach for long-running conversational agents is to periodically summarize older parts of the conversation, replacing verbose history with a compact summary that preserves key facts.

from dataclasses import dataclass, field
from typing import Optional

@dataclass
class ConversationMemory:
    summary: str = ""
    recent_messages: list[dict] = field(default_factory=list)
    key_facts: list[str] = field(default_factory=list)
    total_messages_processed: int = 0

class SummarizationManager:
    """Manages context through periodic summarization."""

    def __init__(
        self,
        llm_client,
        max_recent_messages: int = 20,
        summarize_every: int = 10,
        max_summary_tokens: int = 500,
    ):
        self.llm = llm_client
        self.max_recent = max_recent_messages
        self.summarize_every = summarize_every
        self.max_summary_tokens = max_summary_tokens
        self.memory = ConversationMemory()

    async def add_message(self, message: dict):
        self.memory.recent_messages.append(message)
        self.memory.total_messages_processed += 1

        # Check if we need to summarize
        if len(self.memory.recent_messages) > self.max_recent:
            await self._summarize_oldest()

    async def _summarize_oldest(self):
        # Take the oldest messages beyond the recent window
        to_summarize = self.memory.recent_messages[
            : -self.max_recent
        ]
        self.memory.recent_messages = self.memory.recent_messages[
            -self.max_recent :
        ]

        conversation_text = "\n".join(
            f"{m['role']}: {m['content']}" for m in to_summarize
        )

        response = await self.llm.chat(
            messages=[{
                "role": "user",
                "content": (
                    f"Summarize this conversation segment, preserving "
                    f"key facts, decisions, and unresolved items. "
                    f"Be concise but complete.\n\n"
                    f"Previous summary: {self.memory.summary}\n\n"
                    f"New conversation to summarize:\n"
                    f"{conversation_text}"
                ),
            }],
            max_tokens=self.max_summary_tokens,
        )

        self.memory.summary = response.content

        # Extract key facts for quick reference
        facts = await self._extract_key_facts(to_summarize)
        self.memory.key_facts.extend(facts)
        # Keep only the most recent 20 key facts
        self.memory.key_facts = self.memory.key_facts[-20:]

    async def _extract_key_facts(
        self, messages: list[dict]
    ) -> list[str]:
        conversation_text = "\n".join(
            f"{m['role']}: {m['content']}" for m in messages
        )

        response = await self.llm.chat(messages=[{
            "role": "user",
            "content": (
                f"Extract key facts from this conversation as a "
                f"bullet list. Include: names, numbers, decisions, "
                f"commitments, and unresolved questions.\n\n"
                f"{conversation_text}"
            ),
        }])

        facts = [
            line.strip().lstrip("- ")
            for line in response.content.split("\n")
            if line.strip().startswith("-")
        ]
        return facts

    def build_context(self) -> list[dict]:
        """Build the context to send to the LLM."""
        context = []

        if self.memory.summary:
            context.append({
                "role": "system",
                "content": (
                    f"CONVERSATION HISTORY SUMMARY:\n"
                    f"{self.memory.summary}\n\n"
                    f"KEY FACTS:\n"
                    + "\n".join(
                        f"- {f}" for f in self.memory.key_facts
                    )
                ),
            })

        context.extend(self.memory.recent_messages)
        return context

Strategy 2: Selective Pruning

Summarization compresses everything equally. Selective pruning is smarter: it identifies which parts of the context are most relevant to the current task and drops the rest. This is particularly useful for coding agents that need to reference specific files.

from dataclasses import dataclass
from typing import Optional

@dataclass
class ContextBlock:
    id: str
    content: str
    token_count: int
    relevance_score: float = 0.0
    category: str = "general"  # "code", "conversation", "tool_result"
    timestamp: float = 0.0
    pinned: bool = False  # pinned items are never pruned

class SelectivePruner:
    """Prunes context blocks based on relevance to current task."""

    def __init__(
        self,
        llm_client,
        embeddings_client,
        max_tokens: int = 100000,
        reserve_tokens: int = 4000,  # reserve for response
    ):
        self.llm = llm_client
        self.embeddings = embeddings_client
        self.max_tokens = max_tokens
        self.reserve = reserve_tokens
        self.blocks: list[ContextBlock] = []

    def add_block(self, block: ContextBlock):
        self.blocks.append(block)

    async def prune_for_query(
        self, query: str
    ) -> list[ContextBlock]:
        available_tokens = self.max_tokens - self.reserve

        # Always include pinned blocks
        pinned = [b for b in self.blocks if b.pinned]
        pinned_tokens = sum(b.token_count for b in pinned)

        if pinned_tokens > available_tokens:
            raise ValueError(
                "Pinned blocks alone exceed context limit"
            )

        remaining_tokens = available_tokens - pinned_tokens
        unpinned = [b for b in self.blocks if not b.pinned]

        # Score unpinned blocks by relevance
        scored = await self._score_relevance(query, unpinned)
        scored.sort(key=lambda b: b.relevance_score, reverse=True)

        # Greedily add blocks until we hit the token limit
        selected = list(pinned)
        tokens_used = pinned_tokens

        for block in scored:
            if tokens_used + block.token_count <= remaining_tokens:
                selected.append(block)
                tokens_used += block.token_count

        # Sort selected by original order (timestamp)
        selected.sort(key=lambda b: b.timestamp)
        return selected

    async def _score_relevance(
        self, query: str, blocks: list[ContextBlock]
    ) -> list[ContextBlock]:
        if not blocks:
            return blocks

        query_embedding = await self.embeddings.embed(query)

        for block in blocks:
            block_embedding = await self.embeddings.embed(
                block.content[:500]  # embed first 500 chars
            )
            # Cosine similarity
            dot = sum(
                a * b for a, b in zip(
                    query_embedding, block_embedding
                )
            )
            norm_q = sum(a ** 2 for a in query_embedding) ** 0.5
            norm_b = sum(b ** 2 for b in block_embedding) ** 0.5
            block.relevance_score = (
                dot / (norm_q * norm_b) if norm_q and norm_b else 0
            )

            # Boost recent blocks slightly
            recency_bonus = min(block.timestamp / 1e10, 0.1)
            block.relevance_score += recency_bonus

        return blocks

Strategy 3: Sliding Window with Memory Store

The sliding window approach maintains a fixed-size recent context window while persisting older information in an external memory store (database, vector store) that can be queried on demand.

from dataclasses import dataclass, field
from typing import Any

@dataclass
class MemoryEntry:
    id: str
    content: str
    embedding: list[float] = field(default_factory=list)
    metadata: dict = field(default_factory=dict)
    timestamp: float = 0.0

class SlidingWindowWithMemory:
    """Fixed-size context window backed by queryable memory store."""

    def __init__(
        self,
        llm_client,
        embeddings_client,
        vector_store,
        window_size: int = 20,
        memory_retrieval_k: int = 5,
    ):
        self.llm = llm_client
        self.embeddings = embeddings_client
        self.store = vector_store
        self.window_size = window_size
        self.retrieval_k = memory_retrieval_k
        self.window: list[dict] = []
        self._message_counter = 0

    async def add_message(self, message: dict):
        self.window.append(message)
        self._message_counter += 1

        # When window overflows, move oldest to memory store
        while len(self.window) > self.window_size:
            oldest = self.window.pop(0)
            await self._persist_to_memory(oldest)

    async def _persist_to_memory(self, message: dict):
        content = message.get("content", "")
        embedding = await self.embeddings.embed(content)

        entry = MemoryEntry(
            id=f"msg_{self._message_counter}",
            content=content,
            embedding=embedding,
            metadata={
                "role": message.get("role", "unknown"),
                "message_number": self._message_counter,
            },
            timestamp=self._message_counter,
        )

        await self.store.upsert({
            "id": entry.id,
            "embedding": entry.embedding,
            "text": entry.content,
            "metadata": entry.metadata,
        })

    async def build_context(
        self, current_query: str
    ) -> list[dict]:
        # Retrieve relevant memories
        query_embedding = await self.embeddings.embed(current_query)
        memories = await self.store.query(
            embedding=query_embedding,
            top_k=self.retrieval_k,
        )

        context = []

        # Add retrieved memories as system context
        if memories:
            memory_text = "\n".join(
                f"[{m['metadata']['role']}] {m['text']}"
                for m in memories
            )
            context.append({
                "role": "system",
                "content": (
                    f"RELEVANT CONTEXT FROM EARLIER:\n"
                    f"{memory_text}"
                ),
            })

        # Add the current sliding window
        context.extend(self.window)
        return context

When to Use 1M Context vs Optimization

Models with 1M token context windows (like Claude with extended context) change the calculus. But "can fit" does not mean "should fit."

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Use the full 1M context when:

  • The task genuinely requires cross-referencing information spread across a large corpus (entire codebase analysis, long document QA)
  • Accuracy on distant context references is critical (legal document review, compliance checking)
  • The cost of missing a detail outweighs the inference cost
  • The task is latency-insensitive (batch processing, async analysis)

Optimize context even with 1M available when:

  • The agent runs in a real-time conversational loop (latency matters)
  • The task processes many requests (cost scales with volume)
  • Most of the context is noise for any given query
  • The agent runs for extended periods generating massive context
class AdaptiveContextManager:
    """Automatically selects context strategy based on task."""

    def __init__(
        self,
        summarizer: SummarizationManager,
        pruner: SelectivePruner,
        sliding_window: SlidingWindowWithMemory,
        model_context_limit: int = 200000,
    ):
        self.summarizer = summarizer
        self.pruner = pruner
        self.sliding = sliding_window
        self.limit = model_context_limit

    async def build_context(
        self,
        query: str,
        total_context_tokens: int,
        latency_sensitive: bool = True,
        accuracy_critical: bool = False,
    ) -> list[dict]:
        # Decision tree
        if total_context_tokens < self.limit * 0.3:
            # Under 30% of limit: use everything
            return self.sliding.window

        if accuracy_critical and total_context_tokens < self.limit:
            # Accuracy critical and fits: use everything
            return self.sliding.window

        if latency_sensitive:
            # Real-time: use pruning for fast, relevant context
            blocks = await self.pruner.prune_for_query(query)
            return [
                {"role": "system", "content": b.content}
                for b in blocks
            ]

        # Default: summarization for older + recent window
        return self.summarizer.build_context()

Measuring Context Management Quality

How do you know if your context management strategy is working? Track these metrics:

  • Recall rate: When the agent needs information from earlier in the conversation, how often does the context management system provide it? Test by asking the agent about facts from messages that have been summarized or pruned.
  • Context utilization: What percentage of the context window is actively relevant to the current query? Low utilization means you are paying for tokens that do not help.
  • Summary accuracy: Periodically compare summaries against the original messages. Do they preserve the key facts? Automated evaluation can score this.
  • Latency impact: Measure the time difference between full-context and optimized-context requests. The optimization is only valuable if it saves meaningful latency.

FAQ

Does the "lost in the middle" problem affect all models equally?

No. The "lost in the middle" effect — where models attend less to information in the middle of long contexts compared to the beginning and end — varies significantly by model architecture and training. Models trained with long-context-specific objectives (like those using ALiBi positional encoding or trained on long documents) show less degradation. However, even the best models show some attention bias. For critical information, placing it near the beginning or end of the context (or repeating it) is a practical mitigation.

Should I always summarize or can I just use a larger context window?

Larger context windows are a valid strategy when cost and latency are acceptable. However, summarization provides benefits beyond fitting in the window: it forces information distillation, reduces noise, and can actually improve quality by removing irrelevant details that might confuse the model. The best approach is hybrid — use the full window for the current session and summarize across sessions.

How do you handle context management for multi-agent systems where agents share context?

In multi-agent systems, each agent should maintain its own context relevant to its specialization, plus a shared context layer that contains cross-agent information. The shared layer should use the selective pruning strategy — each agent retrieves from it based on its current task relevance. Avoid broadcasting all context to all agents, which wastes tokens and can confuse specialists with irrelevant information.

What is the cost difference between full context and optimized context for a high-volume agent?

For an agent processing 1,000 interactions per day at 50,000 tokens per interaction with full context: ~50M input tokens/day at $3/M tokens = $150/day. With context optimization reducing average input to 15,000 tokens: ~15M tokens/day = $45/day. That is $105/day saved, or $38,000/year — for a single agent deployment. At enterprise scale with hundreds of agents, context optimization is a significant cost lever.


#ContextWindow #MemoryManagement #Summarization #AIAgents #Optimization #TokenManagement

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.