Context Windows Explained: Why Token Limits Matter for AI Applications
Understand context windows in LLMs — what they are, how they differ across models, and practical strategies for building applications that work within token limits.
What Is a Context Window?
The context window is the total amount of text (measured in tokens) that a language model can process in a single request. It includes everything: the system prompt, conversation history, any documents you provide, the user's question, and the model's response. Think of it as the model's working memory — anything outside the context window simply does not exist to the model.
This is fundamentally different from how humans read. A human can reference a book they read years ago. An LLM can only work with what is currently in its context window. Understanding this constraint is essential for building reliable AI applications.
Context Window Sizes Across Models
The context window landscape has expanded dramatically:
| Model | Context Window | Approximate Pages of Text |
|---|---|---|
| GPT-3.5 Turbo | 16K tokens | ~24 pages |
| GPT-4o | 128K tokens | ~192 pages |
| Claude 3.5 Sonnet | 200K tokens | ~300 pages |
| Gemini 1.5 Pro | 1M tokens | ~1,500 pages |
| Llama 3.1 405B | 128K tokens | ~192 pages |
Here is how to measure context window usage in practice:
import tiktoken
def analyze_context_budget(
system_prompt: str,
conversation_history: list[dict],
retrieved_documents: list[str],
max_context: int = 128_000,
reserved_for_output: int = 4_096,
model: str = "gpt-4o",
):
"""
Analyze how your context budget is being spent.
Returns a breakdown showing where tokens are going.
"""
enc = tiktoken.encoding_for_model(model)
system_tokens = len(enc.encode(system_prompt))
history_tokens = 0
for msg in conversation_history:
# Each message has ~4 tokens of overhead for role and formatting
history_tokens += len(enc.encode(msg["content"])) + 4
doc_tokens = sum(len(enc.encode(doc)) for doc in retrieved_documents)
total_input = system_tokens + history_tokens + doc_tokens
available_for_output = max_context - total_input
effective_output_limit = min(available_for_output, reserved_for_output)
budget = {
"system_prompt": system_tokens,
"conversation_history": history_tokens,
"retrieved_documents": doc_tokens,
"total_input": total_input,
"max_context": max_context,
"utilization": f"{total_input / max_context * 100:.1f}%",
"remaining_for_output": available_for_output,
"effective_output_limit": effective_output_limit,
}
for key, value in budget.items():
print(f" {key}: {value:>10}" if isinstance(value, int) else f" {key}: {value}")
return budget
The Hidden Cost: Input vs Output
Context windows are shared between input and output. If you use 120K tokens of a 128K context window for input, the model can only generate an 8K token response. This is a common source of bugs — applications that stuff the context window with documents leave no room for a meaningful response:
def safe_document_loading(
documents: list[str],
system_prompt: str,
user_query: str,
max_context: int = 128_000,
output_reserve: int = 4_096,
model: str = "gpt-4o",
) -> list[str]:
"""
Load as many documents as fit while reserving space for output.
Returns the subset of documents that fit within the budget.
"""
enc = tiktoken.encoding_for_model(model)
# Calculate fixed costs
fixed_tokens = (
len(enc.encode(system_prompt))
+ len(enc.encode(user_query))
+ 20 # overhead for message formatting
)
available_for_docs = max_context - fixed_tokens - output_reserve
print(f"Token budget for documents: {available_for_docs:,}")
selected_docs = []
used_tokens = 0
for doc in documents:
doc_tokens = len(enc.encode(doc))
if used_tokens + doc_tokens <= available_for_docs:
selected_docs.append(doc)
used_tokens += doc_tokens
else:
print(f"Dropping document ({doc_tokens:,} tokens) — would exceed budget")
break
print(f"Loaded {len(selected_docs)}/{len(documents)} documents ({used_tokens:,} tokens)")
return selected_docs
Strategy 1: Sliding Window Conversations
For chatbot applications, conversation history grows with every exchange. Without management, it will eventually exceed the context window. A sliding window keeps the most recent messages:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
def sliding_window_history(
messages: list[dict],
max_history_tokens: int = 8_000,
model: str = "gpt-4o",
) -> list[dict]:
"""
Keep recent messages that fit within the token budget.
Always preserves the system message.
"""
enc = tiktoken.encoding_for_model(model)
# Always keep the system message
system_msgs = [m for m in messages if m["role"] == "system"]
non_system = [m for m in messages if m["role"] != "system"]
# Count tokens from most recent backwards
selected = []
token_count = 0
for msg in reversed(non_system):
msg_tokens = len(enc.encode(msg["content"])) + 4
if token_count + msg_tokens > max_history_tokens:
break
selected.insert(0, msg)
token_count += msg_tokens
return system_msgs + selected
Strategy 2: Summarize and Compress
Instead of dropping old messages entirely, summarize them. This preserves important context while reducing token usage:
from openai import OpenAI
client = OpenAI()
def summarize_old_history(
messages: list[dict],
keep_recent: int = 6,
) -> list[dict]:
"""
Summarize older messages and keep recent ones verbatim.
"""
if len(messages) <= keep_recent + 1: # +1 for system message
return messages
system_msg = messages[0]
old_messages = messages[1:-keep_recent]
recent_messages = messages[-keep_recent:]
# Create a summary of old messages
old_text = "\n".join(
f"{m['role']}: {m['content']}" for m in old_messages
)
summary_response = client.chat.completions.create(
model="gpt-4o-mini", # Use a cheap model for summarization
messages=[{
"role": "user",
"content": f"Summarize this conversation in 2-3 sentences, "
f"preserving key facts and decisions:\n\n{old_text}",
}],
max_tokens=200,
)
summary = summary_response.choices[0].message.content
summary_msg = {
"role": "system",
"content": f"Summary of earlier conversation: {summary}",
}
return [system_msg, summary_msg] + recent_messages
Strategy 3: Retrieval-Augmented Generation (RAG)
For applications that need access to large knowledge bases, RAG retrieves only the relevant documents instead of loading everything into the context:
def rag_query(
user_question: str,
vector_store,
top_k: int = 5,
max_doc_tokens: int = 4_000,
):
"""
Retrieve relevant documents and query the LLM with only
the most relevant context — not the entire knowledge base.
"""
# Step 1: Find relevant documents using semantic search
relevant_docs = vector_store.similarity_search(
query=user_question,
k=top_k,
)
# Step 2: Build context from retrieved documents
enc = tiktoken.encoding_for_model("gpt-4o")
context_parts = []
token_count = 0
for doc in relevant_docs:
doc_tokens = len(enc.encode(doc.page_content))
if token_count + doc_tokens > max_doc_tokens:
break
context_parts.append(doc.page_content)
token_count += doc_tokens
context = "\n\n---\n\n".join(context_parts)
# Step 3: Query with focused context
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Answer based on the provided context. "
"If the answer is not in the context, say so."},
{"role": "user", "content": f"Context:\n{context}\n\nQuestion: {user_question}"},
],
)
return response.choices[0].message.content
The "Lost in the Middle" Problem
Research has shown that LLMs pay more attention to information at the beginning and end of the context window, with weaker recall for information in the middle. This is called the "lost in the middle" problem and it has practical implications:
def position_aware_context(documents: list[str], query: str) -> list[str]:
"""
Reorder documents to place the most relevant ones at the
beginning and end of the context, avoiding the weak middle.
"""
# Assume documents are ranked by relevance (index 0 = most relevant)
if len(documents) <= 2:
return documents
# Interleave: best at start, second-best at end, etc.
reordered = []
start = []
end = []
for i, doc in enumerate(documents):
if i % 2 == 0:
start.append(doc)
else:
end.append(doc)
return start + list(reversed(end))
FAQ
What happens if my input exceeds the context window?
The API will return an error. It will not silently truncate your input. You must manage context size yourself. Always count tokens before making an API call and truncate or paginate as needed. Some models offer a truncation parameter that automatically trims the conversation from the beginning, but relying on this means losing potentially important context without awareness.
Does a larger context window always mean better results?
Not necessarily. Larger context windows let you include more information, but they come with trade-offs: higher cost (you pay for all input tokens), higher latency (more tokens to process), and the "lost in the middle" problem. In many cases, retrieving a focused 2,000-token context via RAG produces better results than dumping 50,000 tokens of loosely related documents into the prompt.
How do multi-turn conversations consume the context window?
Every message in the conversation — both user and assistant messages — is sent with every API call. A 20-turn conversation with detailed responses can easily consume 10,000 to 20,000 tokens of context before the user even asks their next question. This is why sliding window and summarization strategies are essential for production chatbots.
#ContextWindow #TokenLimits #LLM #RAG #PromptEngineering #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.