Skip to content
Learn Agentic AI13 min read0 views

Chat Agent Context Management: Maintaining Coherent Multi-Turn Conversations

Master the techniques for managing conversation context in chat agents, including context window optimization, message pruning strategies, summarization, and topic tracking for coherent multi-turn interactions.

The Context Window Problem

Every LLM has a finite context window. GPT-4o supports 128K tokens, Claude supports up to 200K, but even these generous limits get consumed quickly in production chat agents. A busy customer support conversation with tool calls, system prompts, and previous messages can easily hit 50K tokens within 20 turns. Without active context management, your agent either crashes with a token limit error or starts losing track of earlier conversation details.

Context management is the discipline of deciding what information the model sees at each turn. Get it right, and your agent maintains coherent conversations across dozens of turns. Get it wrong, and users experience an agent that forgets what they said three messages ago.

Strategy 1: Sliding Window with Priority

The simplest approach is a sliding window — keep the last N messages and drop everything else. But naive truncation drops important context. A better approach assigns priority levels:

from dataclasses import dataclass, field
from enum import IntEnum

class Priority(IntEnum):
    SYSTEM = 0      # Always keep
    PINNED = 1      # User-critical context
    RECENT = 2      # Last N messages
    HISTORICAL = 3  # Older messages, drop first

@dataclass
class ContextMessage:
    role: str
    content: str
    priority: Priority
    token_count: int

class ContextManager:
    def __init__(self, max_tokens: int = 8000):
        self.max_tokens = max_tokens
        self.messages: list[ContextMessage] = []

    def add_message(self, role: str, content: str, priority: Priority = Priority.RECENT):
        tokens = len(content.split()) * 1.3  # Rough estimate
        self.messages.append(ContextMessage(role, content, priority, int(tokens)))

    def build_context(self) -> list[dict]:
        # Sort by priority (system first, historical last)
        sorted_msgs = sorted(self.messages, key=lambda m: m.priority)
        result = []
        used_tokens = 0

        for msg in sorted_msgs:
            if used_tokens + msg.token_count <= self.max_tokens:
                result.append({"role": msg.role, "content": msg.content})
                used_tokens += msg.token_count

        # Restore chronological order for the LLM
        return sorted(result, key=lambda m: self.messages.index(
            next(x for x in self.messages if x.content == m["content"])
        ))

The system prompt always stays. Pinned messages — things like the user's name, account number, or current issue — survive pruning. Recent messages form the active conversation. Historical messages get dropped first when space runs low.

Strategy 2: Conversation Summarization

When a conversation grows long, summarize older turns instead of dropping them entirely. This preserves context at a fraction of the token cost:

import openai

async def summarize_conversation(messages: list[dict]) -> str:
    summary_prompt = (
        "Summarize the following conversation history in 2-3 sentences. "
        "Focus on: the user's main issue, any decisions made, "
        "and any pending actions. Be factual and concise."
    )
    response = await openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": summary_prompt},
            *messages,
        ],
        max_tokens=200,
    )
    return response.choices[0].message.content

class SummarizingContextManager:
    def __init__(self, max_tokens: int = 8000, summarize_threshold: int = 6000):
        self.max_tokens = max_tokens
        self.summarize_threshold = summarize_threshold
        self.messages: list[dict] = []
        self.summary: str | None = None

    async def add_and_manage(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})
        total_tokens = sum(len(m["content"].split()) for m in self.messages)

        if total_tokens * 1.3 > self.summarize_threshold:
            # Summarize older messages, keep last 4
            old_messages = self.messages[:-4]
            self.summary = await summarize_conversation(old_messages)
            self.messages = self.messages[-4:]

    def build_context(self, system_prompt: str) -> list[dict]:
        context = [{"role": "system", "content": system_prompt}]
        if self.summary:
            context.append({
                "role": "system",
                "content": f"Previous conversation summary: {self.summary}",
            })
        context.extend(self.messages)
        return context

The trick is choosing when to summarize. Set a threshold at roughly 75% of your token budget. When the conversation crosses that line, summarize everything except the last few messages.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Strategy 3: Topic Tracking

Track what topics have been discussed so the agent can reference earlier context without keeping every message:

from collections import defaultdict

class TopicTracker:
    def __init__(self):
        self.topics: dict[str, list[str]] = defaultdict(list)
        self.current_topic: str | None = None

    async def classify_topic(self, message: str) -> str:
        response = await openai.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{
                "role": "system",
                "content": (
                    "Classify this message into one topic category. "
                    "Return only the category name. Examples: "
                    "billing, technical_support, account, shipping, general"
                ),
            }, {
                "role": "user",
                "content": message,
            }],
            max_tokens=20,
        )
        return response.choices[0].message.content.strip().lower()

    async def track(self, role: str, content: str):
        topic = await self.classify_topic(content)
        self.topics[topic].append(f"{role}: {content}")
        self.current_topic = topic

    def get_relevant_context(self) -> str:
        if not self.current_topic:
            return ""
        relevant = self.topics[self.current_topic][-6:]
        return "\n".join(relevant)

Topic tracking is especially powerful for support agents where users switch between issues mid-conversation. The agent can pull in context about billing when the user returns to a billing question, even if several technical support messages intervened.

Combining Strategies in TypeScript

Here is a TypeScript implementation that combines sliding window with summarization:

interface ManagedMessage {
  role: "user" | "assistant" | "system";
  content: string;
  timestamp: number;
  pinned: boolean;
}

class ConversationContext {
  private messages: ManagedMessage[] = [];
  private summary: string | null = null;
  private readonly maxTokens = 8000;

  addMessage(role: ManagedMessage["role"], content: string, pinned = false) {
    this.messages.push({
      role, content, timestamp: Date.now(), pinned,
    });
  }

  async compact(summarizer: (msgs: ManagedMessage[]) => Promise<string>) {
    const tokenEstimate = this.messages
      .reduce((sum, m) => sum + m.content.split(" ").length * 1.3, 0);

    if (tokenEstimate > this.maxTokens * 0.75) {
      const pinned = this.messages.filter((m) => m.pinned);
      const recent = this.messages.filter((m) => !m.pinned).slice(-4);
      const old = this.messages.filter(
        (m) => !m.pinned && !recent.includes(m)
      );
      this.summary = await summarizer(old);
      this.messages = [...pinned, ...recent];
    }
  }

  build(systemPrompt: string): Array<{ role: string; content: string }> {
    const ctx: Array<{ role: string; content: string }> = [
      { role: "system", content: systemPrompt },
    ];
    if (this.summary) {
      ctx.push({ role: "system", content: `Prior context: ${this.summary}` });
    }
    this.messages.forEach((m) => ctx.push({ role: m.role, content: m.content }));
    return ctx;
  }
}

FAQ

How do I count tokens accurately instead of estimating?

Use the tiktoken library for OpenAI models. Call tiktoken.encoding_for_model("gpt-4o") to get the tokenizer, then len(encoding.encode(text)) for exact counts. For Claude, use Anthropic's token counting API endpoint. Accurate counting prevents both wasted context space and unexpected truncation errors.

When should I summarize versus just truncate old messages?

Summarize when the conversation involves ongoing state — like a support ticket where the user described their problem early on and is now troubleshooting. Truncate when messages are mostly independent exchanges, like a FAQ bot where each question stands alone. The cost of a summarization call (latency and tokens) only pays off when the summary carries information the agent genuinely needs.

How do I handle tool call results in context management?

Tool call results can be verbose. Store the full result in your database but inject only a condensed version into the context. For example, if a database query returns 50 rows, summarize it as "Query returned 50 orders, most recent from March 15, total value $4,230." This preserves the key facts while saving thousands of tokens.


#ContextManagement #ConversationMemory #MultiTurn #LLM #ChatAgent #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.