Context Distillation: Compressing Long Contexts into Efficient Representations
Learn how context distillation compresses lengthy documents, conversation histories, and knowledge bases into compact representations that preserve essential information while dramatically reducing token costs.
The Long Context Problem
Modern agents often need to reason over massive contexts: entire codebases, long conversation histories, large document collections, or extensive knowledge bases. While newer models support 128K or even 1M token context windows, using them fully is expensive — API costs scale linearly with input tokens, and attention computation scales quadratically with sequence length in standard transformers.
Context distillation addresses this by compressing long contexts into shorter representations that preserve the essential information needed for downstream tasks, reducing both cost and latency.
What Is Context Distillation?
Context distillation is the process of converting a long, detailed context into a shorter form that retains the information most relevant to subsequent queries. This can happen at multiple levels:
Text-level distillation uses an LLM to summarize or extract key information from long documents, producing a shorter text that replaces the original in the context window.
Embedding-level distillation compresses text into dense vector representations that can be injected into the model's hidden states, bypassing the tokenization step entirely.
Soft-prompt distillation trains continuous vectors that encode the information content of a long context into a fixed number of virtual tokens.
Text-Level Context Compression
The simplest form of context distillation uses the model itself to compress information. This is practical, requires no special infrastructure, and works with any API-based model:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
from openai import OpenAI
class ContextDistiller:
"""Compresses long contexts into shorter, information-dense summaries."""
def __init__(self, client: OpenAI, model: str = "gpt-4"):
self.client = client
self.model = model
def distill(
self, long_context: str, task_description: str, target_tokens: int = 500
) -> str:
"""Compress context while preserving task-relevant information."""
response = self.client.chat.completions.create(
model=self.model,
messages=[{
"role": "user",
"content": f"""Compress the following context into approximately {target_tokens} tokens.
Preserve all information that would be relevant to this task: {task_description}
Rules:
- Keep specific numbers, names, dates, and technical details
- Remove redundant explanations and filler
- Use dense, information-rich language
- Maintain factual accuracy — never infer or add information
Context to compress:
{long_context}""",
}],
)
return response.choices[0].message.content
def hierarchical_distill(
self, documents: list[str], task_description: str, chunk_size: int = 4000
) -> str:
"""Distill multiple documents using a hierarchical approach."""
# Level 1: Distill each document individually
summaries = []
for doc in documents:
chunks = [doc[i:i + chunk_size] for i in range(0, len(doc), chunk_size)]
chunk_summaries = [
self.distill(chunk, task_description, target_tokens=200)
for chunk in chunks
]
summaries.append("\n".join(chunk_summaries))
# Level 2: Distill the combined summaries
combined = "\n---\n".join(summaries)
return self.distill(combined, task_description, target_tokens=800)
Selective Context: Keeping What Matters
Instead of summarizing everything, selective context identifies and retains only the portions of the context that are relevant to the current task. This preserves exact wording (important for quotation and code) while discarding irrelevant sections:
import numpy as np
from openai import OpenAI
class SelectiveContext:
"""Retains only task-relevant portions of a long context."""
def __init__(self, client: OpenAI):
self.client = client
def select(
self, paragraphs: list[str], query: str, budget: int = 10
) -> list[str]:
"""Select the most relevant paragraphs for a given query."""
# Get embeddings for query and all paragraphs
all_texts = [query] + paragraphs
response = self.client.embeddings.create(
model="text-embedding-3-small", input=all_texts,
)
embeddings = [np.array(e.embedding) for e in response.data]
query_emb = embeddings[0]
para_embs = embeddings[1:]
# Compute cosine similarity
similarities = []
for i, emb in enumerate(para_embs):
sim = np.dot(query_emb, emb) / (
np.linalg.norm(query_emb) * np.linalg.norm(emb)
)
similarities.append((i, sim))
# Select top-k most relevant paragraphs, maintaining original order
similarities.sort(key=lambda x: x[1], reverse=True)
selected_indices = sorted([idx for idx, _ in similarities[:budget]])
return [paragraphs[i] for i in selected_indices]
Quality Preservation Techniques
Context compression always risks losing important information. Several techniques help preserve quality:
Task-aware compression. Always compress with the downstream task in mind. A context compressed for question-answering should retain different details than one compressed for summarization.
Compression ratio monitoring. Track the ratio of original to compressed token counts. Ratios above 10:1 often show significant quality degradation. A 3:1 to 5:1 ratio is typically safe for most tasks.
Validation through reconstruction. After compression, test whether the compressed context supports answering the same questions as the original. If accuracy drops below a threshold, reduce the compression ratio.
def validate_compression(
original: str, compressed: str, validation_questions: list[str], client: OpenAI
) -> dict:
"""Measure information loss from context compression."""
results = {"questions": len(validation_questions), "matches": 0}
for question in validation_questions:
# Answer with original context
orig_answer = ask_with_context(original, question, client)
# Answer with compressed context
comp_answer = ask_with_context(compressed, question, client)
# Compare answers semantically
match = check_semantic_match(orig_answer, comp_answer, client)
if match:
results["matches"] += 1
results["retention_rate"] = results["matches"] / results["questions"]
return results
Practical Usage in Agent Pipelines
In multi-turn agent conversations, context distillation can be applied to conversation history. Instead of passing the full history (which grows with every turn), periodically compress older turns into a summary while keeping recent turns intact:
class ConversationCompressor:
"""Manages conversation history with rolling compression."""
def __init__(self, client: OpenAI, recent_turns: int = 5, max_summary_tokens: int = 500):
self.client = client
self.recent_turns = recent_turns
self.max_summary_tokens = max_summary_tokens
self.summary = ""
self.history: list[dict] = []
def add_turn(self, role: str, content: str):
self.history.append({"role": role, "content": content})
if len(self.history) > self.recent_turns * 2:
self._compress_old_turns()
def _compress_old_turns(self):
old = self.history[:-self.recent_turns]
self.history = self.history[-self.recent_turns:]
old_text = "\n".join(f"{t['role']}: {t['content']}" for t in old)
context = f"Previous summary: {self.summary}\n\nNew turns:\n{old_text}" if self.summary else old_text
distiller = ContextDistiller(self.client)
self.summary = distiller.distill(context, "ongoing conversation", self.max_summary_tokens)
def get_messages(self) -> list[dict]:
messages = []
if self.summary:
messages.append({
"role": "system",
"content": f"Summary of earlier conversation: {self.summary}",
})
messages.extend(self.history)
return messages
FAQ
How much can I compress without losing quality?
For factual question-answering tasks, 3-5x compression typically preserves 90%+ of answer accuracy. For tasks requiring exact details (code, legal language, numbers), keep compression ratios below 3x or use selective context instead of summarization. Always validate with task-specific benchmarks.
Is context distillation better than using a long-context model?
They are complementary. Long-context models eliminate the need for compression up to their window size, but costs scale linearly with context length. Distillation reduces those costs. For a 100K-token document where you need only specific facts, distilling to 5K tokens and using a standard model is both cheaper and often more accurate than stuffing the full document into a long-context window.
Does compression introduce hallucinations?
Yes, LLM-based text compression can introduce subtle hallucinations — the summarizer may infer connections or generalize details that change meaning. This is why selective context (retaining exact original text) is preferable for high-stakes applications. When using summarization-based distillation, always validate compressed outputs against the original source.
#ContextDistillation #ContextCompression #LongContext #TokenEfficiency #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.