Conversation History Management: Sliding Windows, Summarization, and Compaction

Why Conversation History Management Matters

Every LLM has a finite context window. GPT-4o supports 128K tokens, Claude supports up to 200K, and many open-source models cap at 8K or 32K. When your AI agent runs a multi-turn conversation or a long-running task, the raw message history can easily exceed these limits. Without a strategy, you either truncate blindly and lose critical context, or you hit token errors and the agent crashes.

Conversation history management is the discipline of deciding what stays in the context window, what gets compressed, and what gets discarded. There are three primary strategies: sliding windows, summarization, and compaction. Each has distinct tradeoffs between simplicity, fidelity, and compute cost.

Strategy 1: Sliding Window

The simplest approach keeps only the most recent N messages (or N tokens) and drops everything older. This works well for conversational agents where recent context matters most.

from typing import List, Dict

def sliding_window(
    messages: List[Dict[str, str]],
    max_tokens: int = 4000,
    token_counter=None
) -> List[Dict[str, str]]:
    """Keep the system message and the most recent messages that fit."""
    if token_counter is None:
        token_counter = lambda msg: len(msg["content"]) // 4  # rough estimate

    system_msgs = [m for m in messages if m["role"] == "system"]
    non_system = [m for m in messages if m["role"] != "system"]

    system_tokens = sum(token_counter(m) for m in system_msgs)
    budget = max_tokens - system_tokens
    kept = []
    running = 0

    for msg in reversed(non_system):
        cost = token_counter(msg)
        if running + cost > budget:
            break
        kept.append(msg)
        running += cost

    return system_msgs + list(reversed(kept))

The sliding window is fast and predictable, but it has a major flaw: once a message scrolls out of the window, the agent forgets it entirely. If a user stated their name or a key requirement 50 messages ago, that information vanishes.

Strategy 2: Summarization

Summarization compresses older history into a shorter summary that preserves key facts while reducing token count. You periodically call the LLM to summarize the oldest portion of the conversation, then replace those messages with the summary.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

import openai

async def summarize_history(
    messages: List[Dict[str, str]],
    threshold: int = 3000,
    keep_recent: int = 10,
    token_counter=None
) -> List[Dict[str, str]]:
    """Summarize old messages when total tokens exceed threshold."""
    if token_counter is None:
        token_counter = lambda msg: len(msg["content"]) // 4

    total = sum(token_counter(m) for m in messages)
    if total <= threshold:
        return messages

    system_msgs = [m for m in messages if m["role"] == "system"]
    non_system = [m for m in messages if m["role"] != "system"]

    old_messages = non_system[:-keep_recent]
    recent_messages = non_system[-keep_recent:]

    old_text = "\n".join(
        f"{m['role']}: {m['content']}" for m in old_messages
    )

    client = openai.AsyncOpenAI()
    summary_response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "user",
            "content": (
                "Summarize this conversation history. Preserve all key facts, "
                "decisions, user preferences, and action items:\n\n"
                f"{old_text}"
            ),
        }],
        max_tokens=500,
    )

    summary = summary_response.choices[0].message.content
    summary_msg = {
        "role": "system",
        "content": f"Summary of earlier conversation:\n{summary}",
    }

    return system_msgs + [summary_msg] + recent_messages

Summarization preserves long-range context at the cost of an extra LLM call and potential information loss during compression.

Strategy 3: Compaction (Hybrid)

Compaction combines both approaches. It maintains a rolling summary that gets updated incrementally as messages age out of the sliding window. Each time the window shifts, new messages are merged into the existing summary rather than re-summarizing the entire history.

class CompactionManager:
    def __init__(self, window_size: int = 20, summary: str = ""):
        self.window_size = window_size
        self.summary = summary
        self.messages: List[Dict[str, str]] = []

    def add_message(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})

    async def get_context(self, system_prompt: str) -> List[Dict[str, str]]:
        if len(self.messages) > self.window_size:
            overflow = self.messages[:-self.window_size]
            self.messages = self.messages[-self.window_size:]
            await self._update_summary(overflow)

        context = [{"role": "system", "content": system_prompt}]
        if self.summary:
            context.append({
                "role": "system",
                "content": f"Context from earlier: {self.summary}",
            })
        context.extend(self.messages)
        return context

    async def _update_summary(self, new_messages):
        new_text = "\n".join(
            f"{m['role']}: {m['content']}" for m in new_messages
        )
        client = openai.AsyncOpenAI()
        resp = await client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{
                "role": "user",
                "content": (
                    f"Existing summary: {self.summary}\n\n"
                    f"New messages to incorporate:\n{new_text}\n\n"
                    "Produce an updated summary preserving all key facts."
                ),
            }],
            max_tokens=400,
        )
        self.summary = resp.choices[0].message.content

Choosing the Right Strategy

Strategy	Complexity	Long-Range Memory	Extra LLM Calls	Best For
Sliding Window	Low	None	Zero	Short conversations, chatbots
Summarization	Medium	Good	Periodic	Customer support, assistants
Compaction	High	Best	Incremental	Long-running agents, research tasks

For most production agents, compaction provides the best balance. It keeps recent messages verbatim for accuracy while maintaining a compressed record of everything that came before.

FAQ

How do I count tokens accurately instead of estimating?

Use the tiktoken library for OpenAI models. Call tiktoken.encoding_for_model("gpt-4o") to get an encoder, then len(encoder.encode(text)) for exact token counts. For Claude, Anthropic provides a token counting API endpoint.

Should the system message ever be summarized?

No. The system message defines the agent's behavior and should always remain in full. Only user and assistant messages should be candidates for summarization or eviction. Treat the system prompt as immutable context.

Can I combine sliding windows with an external memory store?

Yes, and this is a common production pattern. Use a sliding window for the immediate context, but persist all messages to a database or vector store. When the agent needs old information, it queries the external store and injects relevant results into the current context.

#ConversationHistory #ContextWindow #TokenManagement #LLMMemory #AgenticAI #LearnAI #AIEngineering

Conversation History Management: Sliding Windows, Summarization, and Compaction

Why Conversation History Management Matters

Strategy 1: Sliding Window

Strategy 2: Summarization

Strategy 3: Compaction (Hybrid)

Choosing the Right Strategy

FAQ

How do I count tokens accurately instead of estimating?

Should the system message ever be summarized?

Can I combine sliding windows with an external memory store?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding