Skip to content
Learn Agentic AI11 min read0 views

Context Windows Explained: Why Token Limits Matter for AI Applications

Understand context windows in LLMs — what they are, how they differ across models, and practical strategies for building applications that work within token limits.

What Is a Context Window?

The context window is the total amount of text (measured in tokens) that a language model can process in a single request. It includes everything: the system prompt, conversation history, any documents you provide, the user's question, and the model's response. Think of it as the model's working memory — anything outside the context window simply does not exist to the model.

This is fundamentally different from how humans read. A human can reference a book they read years ago. An LLM can only work with what is currently in its context window. Understanding this constraint is essential for building reliable AI applications.

Context Window Sizes Across Models

The context window landscape has expanded dramatically:

Model Context Window Approximate Pages of Text
GPT-3.5 Turbo 16K tokens ~24 pages
GPT-4o 128K tokens ~192 pages
Claude 3.5 Sonnet 200K tokens ~300 pages
Gemini 1.5 Pro 1M tokens ~1,500 pages
Llama 3.1 405B 128K tokens ~192 pages

Here is how to measure context window usage in practice:

import tiktoken

def analyze_context_budget(
    system_prompt: str,
    conversation_history: list[dict],
    retrieved_documents: list[str],
    max_context: int = 128_000,
    reserved_for_output: int = 4_096,
    model: str = "gpt-4o",
):
    """
    Analyze how your context budget is being spent.
    Returns a breakdown showing where tokens are going.
    """
    enc = tiktoken.encoding_for_model(model)

    system_tokens = len(enc.encode(system_prompt))

    history_tokens = 0
    for msg in conversation_history:
        # Each message has ~4 tokens of overhead for role and formatting
        history_tokens += len(enc.encode(msg["content"])) + 4

    doc_tokens = sum(len(enc.encode(doc)) for doc in retrieved_documents)

    total_input = system_tokens + history_tokens + doc_tokens
    available_for_output = max_context - total_input
    effective_output_limit = min(available_for_output, reserved_for_output)

    budget = {
        "system_prompt": system_tokens,
        "conversation_history": history_tokens,
        "retrieved_documents": doc_tokens,
        "total_input": total_input,
        "max_context": max_context,
        "utilization": f"{total_input / max_context * 100:.1f}%",
        "remaining_for_output": available_for_output,
        "effective_output_limit": effective_output_limit,
    }

    for key, value in budget.items():
        print(f"  {key}: {value:>10}" if isinstance(value, int) else f"  {key}: {value}")

    return budget

The Hidden Cost: Input vs Output

Context windows are shared between input and output. If you use 120K tokens of a 128K context window for input, the model can only generate an 8K token response. This is a common source of bugs — applications that stuff the context window with documents leave no room for a meaningful response:

def safe_document_loading(
    documents: list[str],
    system_prompt: str,
    user_query: str,
    max_context: int = 128_000,
    output_reserve: int = 4_096,
    model: str = "gpt-4o",
) -> list[str]:
    """
    Load as many documents as fit while reserving space for output.
    Returns the subset of documents that fit within the budget.
    """
    enc = tiktoken.encoding_for_model(model)

    # Calculate fixed costs
    fixed_tokens = (
        len(enc.encode(system_prompt))
        + len(enc.encode(user_query))
        + 20  # overhead for message formatting
    )

    available_for_docs = max_context - fixed_tokens - output_reserve
    print(f"Token budget for documents: {available_for_docs:,}")

    selected_docs = []
    used_tokens = 0

    for doc in documents:
        doc_tokens = len(enc.encode(doc))
        if used_tokens + doc_tokens <= available_for_docs:
            selected_docs.append(doc)
            used_tokens += doc_tokens
        else:
            print(f"Dropping document ({doc_tokens:,} tokens) — would exceed budget")
            break

    print(f"Loaded {len(selected_docs)}/{len(documents)} documents ({used_tokens:,} tokens)")
    return selected_docs

Strategy 1: Sliding Window Conversations

For chatbot applications, conversation history grows with every exchange. Without management, it will eventually exceed the context window. A sliding window keeps the most recent messages:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

def sliding_window_history(
    messages: list[dict],
    max_history_tokens: int = 8_000,
    model: str = "gpt-4o",
) -> list[dict]:
    """
    Keep recent messages that fit within the token budget.
    Always preserves the system message.
    """
    enc = tiktoken.encoding_for_model(model)

    # Always keep the system message
    system_msgs = [m for m in messages if m["role"] == "system"]
    non_system = [m for m in messages if m["role"] != "system"]

    # Count tokens from most recent backwards
    selected = []
    token_count = 0

    for msg in reversed(non_system):
        msg_tokens = len(enc.encode(msg["content"])) + 4
        if token_count + msg_tokens > max_history_tokens:
            break
        selected.insert(0, msg)
        token_count += msg_tokens

    return system_msgs + selected

Strategy 2: Summarize and Compress

Instead of dropping old messages entirely, summarize them. This preserves important context while reducing token usage:

from openai import OpenAI

client = OpenAI()

def summarize_old_history(
    messages: list[dict],
    keep_recent: int = 6,
) -> list[dict]:
    """
    Summarize older messages and keep recent ones verbatim.
    """
    if len(messages) <= keep_recent + 1:  # +1 for system message
        return messages

    system_msg = messages[0]
    old_messages = messages[1:-keep_recent]
    recent_messages = messages[-keep_recent:]

    # Create a summary of old messages
    old_text = "\n".join(
        f"{m['role']}: {m['content']}" for m in old_messages
    )

    summary_response = client.chat.completions.create(
        model="gpt-4o-mini",  # Use a cheap model for summarization
        messages=[{
            "role": "user",
            "content": f"Summarize this conversation in 2-3 sentences, "
                       f"preserving key facts and decisions:\n\n{old_text}",
        }],
        max_tokens=200,
    )

    summary = summary_response.choices[0].message.content
    summary_msg = {
        "role": "system",
        "content": f"Summary of earlier conversation: {summary}",
    }

    return [system_msg, summary_msg] + recent_messages

Strategy 3: Retrieval-Augmented Generation (RAG)

For applications that need access to large knowledge bases, RAG retrieves only the relevant documents instead of loading everything into the context:

def rag_query(
    user_question: str,
    vector_store,
    top_k: int = 5,
    max_doc_tokens: int = 4_000,
):
    """
    Retrieve relevant documents and query the LLM with only
    the most relevant context — not the entire knowledge base.
    """
    # Step 1: Find relevant documents using semantic search
    relevant_docs = vector_store.similarity_search(
        query=user_question,
        k=top_k,
    )

    # Step 2: Build context from retrieved documents
    enc = tiktoken.encoding_for_model("gpt-4o")
    context_parts = []
    token_count = 0

    for doc in relevant_docs:
        doc_tokens = len(enc.encode(doc.page_content))
        if token_count + doc_tokens > max_doc_tokens:
            break
        context_parts.append(doc.page_content)
        token_count += doc_tokens

    context = "\n\n---\n\n".join(context_parts)

    # Step 3: Query with focused context
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Answer based on the provided context. "
                                           "If the answer is not in the context, say so."},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {user_question}"},
        ],
    )

    return response.choices[0].message.content

The "Lost in the Middle" Problem

Research has shown that LLMs pay more attention to information at the beginning and end of the context window, with weaker recall for information in the middle. This is called the "lost in the middle" problem and it has practical implications:

def position_aware_context(documents: list[str], query: str) -> list[str]:
    """
    Reorder documents to place the most relevant ones at the
    beginning and end of the context, avoiding the weak middle.
    """
    # Assume documents are ranked by relevance (index 0 = most relevant)
    if len(documents) <= 2:
        return documents

    # Interleave: best at start, second-best at end, etc.
    reordered = []
    start = []
    end = []

    for i, doc in enumerate(documents):
        if i % 2 == 0:
            start.append(doc)
        else:
            end.append(doc)

    return start + list(reversed(end))

FAQ

What happens if my input exceeds the context window?

The API will return an error. It will not silently truncate your input. You must manage context size yourself. Always count tokens before making an API call and truncate or paginate as needed. Some models offer a truncation parameter that automatically trims the conversation from the beginning, but relying on this means losing potentially important context without awareness.

Does a larger context window always mean better results?

Not necessarily. Larger context windows let you include more information, but they come with trade-offs: higher cost (you pay for all input tokens), higher latency (more tokens to process), and the "lost in the middle" problem. In many cases, retrieving a focused 2,000-token context via RAG produces better results than dumping 50,000 tokens of loosely related documents into the prompt.

How do multi-turn conversations consume the context window?

Every message in the conversation — both user and assistant messages — is sent with every API call. A 20-turn conversation with detailed responses can easily consume 10,000 to 20,000 tokens of context before the user even asks their next question. This is why sliding window and summarization strategies are essential for production chatbots.


#ContextWindow #TokenLimits #LLM #RAG #PromptEngineering #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.