Chat Agent Context Management: Maintaining Coherent Multi-Turn Conversations
Master the techniques for managing conversation context in chat agents, including context window optimization, message pruning strategies, summarization, and topic tracking for coherent multi-turn interactions.
The Context Window Problem
Every LLM has a finite context window. GPT-4o supports 128K tokens, Claude supports up to 200K, but even these generous limits get consumed quickly in production chat agents. A busy customer support conversation with tool calls, system prompts, and previous messages can easily hit 50K tokens within 20 turns. Without active context management, your agent either crashes with a token limit error or starts losing track of earlier conversation details.
Context management is the discipline of deciding what information the model sees at each turn. Get it right, and your agent maintains coherent conversations across dozens of turns. Get it wrong, and users experience an agent that forgets what they said three messages ago.
Strategy 1: Sliding Window with Priority
The simplest approach is a sliding window — keep the last N messages and drop everything else. But naive truncation drops important context. A better approach assigns priority levels:
from dataclasses import dataclass, field
from enum import IntEnum
class Priority(IntEnum):
SYSTEM = 0 # Always keep
PINNED = 1 # User-critical context
RECENT = 2 # Last N messages
HISTORICAL = 3 # Older messages, drop first
@dataclass
class ContextMessage:
role: str
content: str
priority: Priority
token_count: int
class ContextManager:
def __init__(self, max_tokens: int = 8000):
self.max_tokens = max_tokens
self.messages: list[ContextMessage] = []
def add_message(self, role: str, content: str, priority: Priority = Priority.RECENT):
tokens = len(content.split()) * 1.3 # Rough estimate
self.messages.append(ContextMessage(role, content, priority, int(tokens)))
def build_context(self) -> list[dict]:
# Sort by priority (system first, historical last)
sorted_msgs = sorted(self.messages, key=lambda m: m.priority)
result = []
used_tokens = 0
for msg in sorted_msgs:
if used_tokens + msg.token_count <= self.max_tokens:
result.append({"role": msg.role, "content": msg.content})
used_tokens += msg.token_count
# Restore chronological order for the LLM
return sorted(result, key=lambda m: self.messages.index(
next(x for x in self.messages if x.content == m["content"])
))
The system prompt always stays. Pinned messages — things like the user's name, account number, or current issue — survive pruning. Recent messages form the active conversation. Historical messages get dropped first when space runs low.
Strategy 2: Conversation Summarization
When a conversation grows long, summarize older turns instead of dropping them entirely. This preserves context at a fraction of the token cost:
import openai
async def summarize_conversation(messages: list[dict]) -> str:
summary_prompt = (
"Summarize the following conversation history in 2-3 sentences. "
"Focus on: the user's main issue, any decisions made, "
"and any pending actions. Be factual and concise."
)
response = await openai.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": summary_prompt},
*messages,
],
max_tokens=200,
)
return response.choices[0].message.content
class SummarizingContextManager:
def __init__(self, max_tokens: int = 8000, summarize_threshold: int = 6000):
self.max_tokens = max_tokens
self.summarize_threshold = summarize_threshold
self.messages: list[dict] = []
self.summary: str | None = None
async def add_and_manage(self, role: str, content: str):
self.messages.append({"role": role, "content": content})
total_tokens = sum(len(m["content"].split()) for m in self.messages)
if total_tokens * 1.3 > self.summarize_threshold:
# Summarize older messages, keep last 4
old_messages = self.messages[:-4]
self.summary = await summarize_conversation(old_messages)
self.messages = self.messages[-4:]
def build_context(self, system_prompt: str) -> list[dict]:
context = [{"role": "system", "content": system_prompt}]
if self.summary:
context.append({
"role": "system",
"content": f"Previous conversation summary: {self.summary}",
})
context.extend(self.messages)
return context
The trick is choosing when to summarize. Set a threshold at roughly 75% of your token budget. When the conversation crosses that line, summarize everything except the last few messages.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Strategy 3: Topic Tracking
Track what topics have been discussed so the agent can reference earlier context without keeping every message:
from collections import defaultdict
class TopicTracker:
def __init__(self):
self.topics: dict[str, list[str]] = defaultdict(list)
self.current_topic: str | None = None
async def classify_topic(self, message: str) -> str:
response = await openai.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "system",
"content": (
"Classify this message into one topic category. "
"Return only the category name. Examples: "
"billing, technical_support, account, shipping, general"
),
}, {
"role": "user",
"content": message,
}],
max_tokens=20,
)
return response.choices[0].message.content.strip().lower()
async def track(self, role: str, content: str):
topic = await self.classify_topic(content)
self.topics[topic].append(f"{role}: {content}")
self.current_topic = topic
def get_relevant_context(self) -> str:
if not self.current_topic:
return ""
relevant = self.topics[self.current_topic][-6:]
return "\n".join(relevant)
Topic tracking is especially powerful for support agents where users switch between issues mid-conversation. The agent can pull in context about billing when the user returns to a billing question, even if several technical support messages intervened.
Combining Strategies in TypeScript
Here is a TypeScript implementation that combines sliding window with summarization:
interface ManagedMessage {
role: "user" | "assistant" | "system";
content: string;
timestamp: number;
pinned: boolean;
}
class ConversationContext {
private messages: ManagedMessage[] = [];
private summary: string | null = null;
private readonly maxTokens = 8000;
addMessage(role: ManagedMessage["role"], content: string, pinned = false) {
this.messages.push({
role, content, timestamp: Date.now(), pinned,
});
}
async compact(summarizer: (msgs: ManagedMessage[]) => Promise<string>) {
const tokenEstimate = this.messages
.reduce((sum, m) => sum + m.content.split(" ").length * 1.3, 0);
if (tokenEstimate > this.maxTokens * 0.75) {
const pinned = this.messages.filter((m) => m.pinned);
const recent = this.messages.filter((m) => !m.pinned).slice(-4);
const old = this.messages.filter(
(m) => !m.pinned && !recent.includes(m)
);
this.summary = await summarizer(old);
this.messages = [...pinned, ...recent];
}
}
build(systemPrompt: string): Array<{ role: string; content: string }> {
const ctx: Array<{ role: string; content: string }> = [
{ role: "system", content: systemPrompt },
];
if (this.summary) {
ctx.push({ role: "system", content: `Prior context: ${this.summary}` });
}
this.messages.forEach((m) => ctx.push({ role: m.role, content: m.content }));
return ctx;
}
}
FAQ
How do I count tokens accurately instead of estimating?
Use the tiktoken library for OpenAI models. Call tiktoken.encoding_for_model("gpt-4o") to get the tokenizer, then len(encoding.encode(text)) for exact counts. For Claude, use Anthropic's token counting API endpoint. Accurate counting prevents both wasted context space and unexpected truncation errors.
When should I summarize versus just truncate old messages?
Summarize when the conversation involves ongoing state — like a support ticket where the user described their problem early on and is now troubleshooting. Truncate when messages are mostly independent exchanges, like a FAQ bot where each question stands alone. The cost of a summarization call (latency and tokens) only pays off when the summary carries information the agent genuinely needs.
How do I handle tool call results in context management?
Tool call results can be verbose. Store the full result in your database but inject only a condensed version into the context. For example, if a database query returns 50 rows, summarize it as "Query returned 50 orders, most recent from March 15, total value $4,230." This preserves the key facts while saving thousands of tokens.
#ContextManagement #ConversationMemory #MultiTurn #LLM #ChatAgent #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.