Context Distillation: Compressing Long Contexts into Efficient Representations

The Long Context Problem

Modern agents often need to reason over massive contexts: entire codebases, long conversation histories, large document collections, or extensive knowledge bases. While newer models support 128K or even 1M token context windows, using them fully is expensive — API costs scale linearly with input tokens, and attention computation scales quadratically with sequence length in standard transformers.

Context distillation addresses this by compressing long contexts into shorter representations that preserve the essential information needed for downstream tasks, reducing both cost and latency.

What Is Context Distillation?

Context distillation is the process of converting a long, detailed context into a shorter form that retains the information most relevant to subsequent queries. This can happen at multiple levels:

Text-level distillation uses an LLM to summarize or extract key information from long documents, producing a shorter text that replaces the original in the context window.

Embedding-level distillation compresses text into dense vector representations that can be injected into the model's hidden states, bypassing the tokenization step entirely.

Soft-prompt distillation trains continuous vectors that encode the information content of a long context into a fixed number of virtual tokens.

Text-Level Context Compression

The simplest form of context distillation uses the model itself to compress information. This is practical, requires no special infrastructure, and works with any API-based model:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

from openai import OpenAI

class ContextDistiller:
    """Compresses long contexts into shorter, information-dense summaries."""

    def __init__(self, client: OpenAI, model: str = "gpt-4"):
        self.client = client
        self.model = model

    def distill(
        self, long_context: str, task_description: str, target_tokens: int = 500
    ) -> str:
        """Compress context while preserving task-relevant information."""
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{
                "role": "user",
                "content": f"""Compress the following context into approximately {target_tokens} tokens.
Preserve all information that would be relevant to this task: {task_description}

Rules:
- Keep specific numbers, names, dates, and technical details
- Remove redundant explanations and filler
- Use dense, information-rich language
- Maintain factual accuracy — never infer or add information

Context to compress:
{long_context}""",
            }],
        )
        return response.choices[0].message.content

    def hierarchical_distill(
        self, documents: list[str], task_description: str, chunk_size: int = 4000
    ) -> str:
        """Distill multiple documents using a hierarchical approach."""
        # Level 1: Distill each document individually
        summaries = []
        for doc in documents:
            chunks = [doc[i:i + chunk_size] for i in range(0, len(doc), chunk_size)]
            chunk_summaries = [
                self.distill(chunk, task_description, target_tokens=200)
                for chunk in chunks
            ]
            summaries.append("\n".join(chunk_summaries))

        # Level 2: Distill the combined summaries
        combined = "\n---\n".join(summaries)
        return self.distill(combined, task_description, target_tokens=800)

Selective Context: Keeping What Matters

Instead of summarizing everything, selective context identifies and retains only the portions of the context that are relevant to the current task. This preserves exact wording (important for quotation and code) while discarding irrelevant sections:

import numpy as np
from openai import OpenAI

class SelectiveContext:
    """Retains only task-relevant portions of a long context."""

    def __init__(self, client: OpenAI):
        self.client = client

    def select(
        self, paragraphs: list[str], query: str, budget: int = 10
    ) -> list[str]:
        """Select the most relevant paragraphs for a given query."""
        # Get embeddings for query and all paragraphs
        all_texts = [query] + paragraphs
        response = self.client.embeddings.create(
            model="text-embedding-3-small", input=all_texts,
        )
        embeddings = [np.array(e.embedding) for e in response.data]

        query_emb = embeddings[0]
        para_embs = embeddings[1:]

        # Compute cosine similarity
        similarities = []
        for i, emb in enumerate(para_embs):
            sim = np.dot(query_emb, emb) / (
                np.linalg.norm(query_emb) * np.linalg.norm(emb)
            )
            similarities.append((i, sim))

        # Select top-k most relevant paragraphs, maintaining original order
        similarities.sort(key=lambda x: x[1], reverse=True)
        selected_indices = sorted([idx for idx, _ in similarities[:budget]])

        return [paragraphs[i] for i in selected_indices]

Quality Preservation Techniques

Context compression always risks losing important information. Several techniques help preserve quality:

Task-aware compression. Always compress with the downstream task in mind. A context compressed for question-answering should retain different details than one compressed for summarization.

Compression ratio monitoring. Track the ratio of original to compressed token counts. Ratios above 10:1 often show significant quality degradation. A 3:1 to 5:1 ratio is typically safe for most tasks.

Validation through reconstruction. After compression, test whether the compressed context supports answering the same questions as the original. If accuracy drops below a threshold, reduce the compression ratio.

def validate_compression(
    original: str, compressed: str, validation_questions: list[str], client: OpenAI
) -> dict:
    """Measure information loss from context compression."""
    results = {"questions": len(validation_questions), "matches": 0}

    for question in validation_questions:
        # Answer with original context
        orig_answer = ask_with_context(original, question, client)
        # Answer with compressed context
        comp_answer = ask_with_context(compressed, question, client)

        # Compare answers semantically
        match = check_semantic_match(orig_answer, comp_answer, client)
        if match:
            results["matches"] += 1

    results["retention_rate"] = results["matches"] / results["questions"]
    return results

Practical Usage in Agent Pipelines

In multi-turn agent conversations, context distillation can be applied to conversation history. Instead of passing the full history (which grows with every turn), periodically compress older turns into a summary while keeping recent turns intact:

class ConversationCompressor:
    """Manages conversation history with rolling compression."""

    def __init__(self, client: OpenAI, recent_turns: int = 5, max_summary_tokens: int = 500):
        self.client = client
        self.recent_turns = recent_turns
        self.max_summary_tokens = max_summary_tokens
        self.summary = ""
        self.history: list[dict] = []

    def add_turn(self, role: str, content: str):
        self.history.append({"role": role, "content": content})

        if len(self.history) > self.recent_turns * 2:
            self._compress_old_turns()

    def _compress_old_turns(self):
        old = self.history[:-self.recent_turns]
        self.history = self.history[-self.recent_turns:]

        old_text = "\n".join(f"{t['role']}: {t['content']}" for t in old)
        context = f"Previous summary: {self.summary}\n\nNew turns:\n{old_text}" if self.summary else old_text

        distiller = ContextDistiller(self.client)
        self.summary = distiller.distill(context, "ongoing conversation", self.max_summary_tokens)

    def get_messages(self) -> list[dict]:
        messages = []
        if self.summary:
            messages.append({
                "role": "system",
                "content": f"Summary of earlier conversation: {self.summary}",
            })
        messages.extend(self.history)
        return messages

FAQ

How much can I compress without losing quality?

For factual question-answering tasks, 3-5x compression typically preserves 90%+ of answer accuracy. For tasks requiring exact details (code, legal language, numbers), keep compression ratios below 3x or use selective context instead of summarization. Always validate with task-specific benchmarks.

Is context distillation better than using a long-context model?

They are complementary. Long-context models eliminate the need for compression up to their window size, but costs scale linearly with context length. Distillation reduces those costs. For a 100K-token document where you need only specific facts, distilling to 5K tokens and using a standard model is both cheaper and often more accurate than stuffing the full document into a long-context window.

Does compression introduce hallucinations?

Yes, LLM-based text compression can introduce subtle hallucinations — the summarizer may infer connections or generalize details that change meaning. This is why selective context (retaining exact original text) is preferable for high-stakes applications. When using summarization-based distillation, always validate compressed outputs against the original source.

#ContextDistillation #ContextCompression #LongContext #TokenEfficiency #AgenticAI #LearnAI #AIEngineering

Context Distillation: Compressing Long Contexts into Efficient Representations

The Long Context Problem

What Is Context Distillation?

Text-Level Context Compression

Selective Context: Keeping What Matters

Quality Preservation Techniques

Practical Usage in Agent Pipelines

FAQ

How much can I compress without losing quality?

Is context distillation better than using a long-context model?

Does compression introduce hallucinations?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding