Conversation History Management: Sliding Windows, Summarization, and Compaction
Learn the three core strategies for managing conversation history in AI agents — sliding windows, summary-based compression, and compaction — to stay within context window limits while preserving critical information.
Why Conversation History Management Matters
Every LLM has a finite context window. GPT-4o supports 128K tokens, Claude supports up to 200K, and many open-source models cap at 8K or 32K. When your AI agent runs a multi-turn conversation or a long-running task, the raw message history can easily exceed these limits. Without a strategy, you either truncate blindly and lose critical context, or you hit token errors and the agent crashes.
Conversation history management is the discipline of deciding what stays in the context window, what gets compressed, and what gets discarded. There are three primary strategies: sliding windows, summarization, and compaction. Each has distinct tradeoffs between simplicity, fidelity, and compute cost.
Strategy 1: Sliding Window
The simplest approach keeps only the most recent N messages (or N tokens) and drops everything older. This works well for conversational agents where recent context matters most.
from typing import List, Dict
def sliding_window(
messages: List[Dict[str, str]],
max_tokens: int = 4000,
token_counter=None
) -> List[Dict[str, str]]:
"""Keep the system message and the most recent messages that fit."""
if token_counter is None:
token_counter = lambda msg: len(msg["content"]) // 4 # rough estimate
system_msgs = [m for m in messages if m["role"] == "system"]
non_system = [m for m in messages if m["role"] != "system"]
system_tokens = sum(token_counter(m) for m in system_msgs)
budget = max_tokens - system_tokens
kept = []
running = 0
for msg in reversed(non_system):
cost = token_counter(msg)
if running + cost > budget:
break
kept.append(msg)
running += cost
return system_msgs + list(reversed(kept))
The sliding window is fast and predictable, but it has a major flaw: once a message scrolls out of the window, the agent forgets it entirely. If a user stated their name or a key requirement 50 messages ago, that information vanishes.
Strategy 2: Summarization
Summarization compresses older history into a shorter summary that preserves key facts while reducing token count. You periodically call the LLM to summarize the oldest portion of the conversation, then replace those messages with the summary.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
import openai
async def summarize_history(
messages: List[Dict[str, str]],
threshold: int = 3000,
keep_recent: int = 10,
token_counter=None
) -> List[Dict[str, str]]:
"""Summarize old messages when total tokens exceed threshold."""
if token_counter is None:
token_counter = lambda msg: len(msg["content"]) // 4
total = sum(token_counter(m) for m in messages)
if total <= threshold:
return messages
system_msgs = [m for m in messages if m["role"] == "system"]
non_system = [m for m in messages if m["role"] != "system"]
old_messages = non_system[:-keep_recent]
recent_messages = non_system[-keep_recent:]
old_text = "\n".join(
f"{m['role']}: {m['content']}" for m in old_messages
)
client = openai.AsyncOpenAI()
summary_response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": (
"Summarize this conversation history. Preserve all key facts, "
"decisions, user preferences, and action items:\n\n"
f"{old_text}"
),
}],
max_tokens=500,
)
summary = summary_response.choices[0].message.content
summary_msg = {
"role": "system",
"content": f"Summary of earlier conversation:\n{summary}",
}
return system_msgs + [summary_msg] + recent_messages
Summarization preserves long-range context at the cost of an extra LLM call and potential information loss during compression.
Strategy 3: Compaction (Hybrid)
Compaction combines both approaches. It maintains a rolling summary that gets updated incrementally as messages age out of the sliding window. Each time the window shifts, new messages are merged into the existing summary rather than re-summarizing the entire history.
class CompactionManager:
def __init__(self, window_size: int = 20, summary: str = ""):
self.window_size = window_size
self.summary = summary
self.messages: List[Dict[str, str]] = []
def add_message(self, role: str, content: str):
self.messages.append({"role": role, "content": content})
async def get_context(self, system_prompt: str) -> List[Dict[str, str]]:
if len(self.messages) > self.window_size:
overflow = self.messages[:-self.window_size]
self.messages = self.messages[-self.window_size:]
await self._update_summary(overflow)
context = [{"role": "system", "content": system_prompt}]
if self.summary:
context.append({
"role": "system",
"content": f"Context from earlier: {self.summary}",
})
context.extend(self.messages)
return context
async def _update_summary(self, new_messages):
new_text = "\n".join(
f"{m['role']}: {m['content']}" for m in new_messages
)
client = openai.AsyncOpenAI()
resp = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": (
f"Existing summary: {self.summary}\n\n"
f"New messages to incorporate:\n{new_text}\n\n"
"Produce an updated summary preserving all key facts."
),
}],
max_tokens=400,
)
self.summary = resp.choices[0].message.content
Choosing the Right Strategy
| Strategy | Complexity | Long-Range Memory | Extra LLM Calls | Best For |
|---|---|---|---|---|
| Sliding Window | Low | None | Zero | Short conversations, chatbots |
| Summarization | Medium | Good | Periodic | Customer support, assistants |
| Compaction | High | Best | Incremental | Long-running agents, research tasks |
For most production agents, compaction provides the best balance. It keeps recent messages verbatim for accuracy while maintaining a compressed record of everything that came before.
FAQ
How do I count tokens accurately instead of estimating?
Use the tiktoken library for OpenAI models. Call tiktoken.encoding_for_model("gpt-4o") to get an encoder, then len(encoder.encode(text)) for exact token counts. For Claude, Anthropic provides a token counting API endpoint.
Should the system message ever be summarized?
No. The system message defines the agent's behavior and should always remain in full. Only user and assistant messages should be candidates for summarization or eviction. Treat the system prompt as immutable context.
Can I combine sliding windows with an external memory store?
Yes, and this is a common production pattern. Use a sliding window for the immediate context, but persist all messages to a database or vector store. When the agent needs old information, it queries the external store and injects relevant results into the current context.
#ConversationHistory #ContextWindow #TokenManagement #LLMMemory #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.