Contextual Compression for RAG: Reducing Retrieved Context to What Matters

The Retrieval Noise Problem

When you retrieve the top 5 chunks from a vector store, each chunk is typically 500-1000 tokens. That is 2,500-5,000 tokens of context passed to your LLM. But here is the critical insight: usually only 10-20% of those tokens are actually relevant to the specific question being asked.

A chunk might be retrieved because it contains a paragraph about your topic, but the rest of the chunk covers unrelated details. This noise dilutes the signal, increases token costs, and — most importantly — can confuse the LLM into generating responses that blend relevant and irrelevant information.

Contextual compression addresses this by extracting or summarizing only the question-relevant portions of each retrieved document before passing them to the generator.

Three Approaches to Compression

1. Extractive Compression

Extract only the sentences or passages that directly relate to the query. This preserves exact wording from the source, maintaining fidelity.

2. LLM-Based Abstractive Compression

Use a language model to rewrite each chunk, keeping only query-relevant information. More flexible but introduces the possibility of subtle distortion.

3. Cross-Encoder Reranking with Truncation

Score individual sentences within each chunk for relevance, then keep only the top-scoring sentences. A hybrid approach that balances precision and speed.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

Implementing Extractive Compression

from openai import OpenAI
import re

client = OpenAI()

def extractive_compress(
    query: str,
    documents: list[str],
) -> list[str]:
    """Extract only query-relevant sentences from each document."""
    compressed = []

    for doc in documents:
        # Split document into sentences
        sentences = re.split(r'(?<=[.!?])\s+', doc)

        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{
                "role": "system",
                "content": """Given a query and numbered sentences,
                return a JSON object with a "relevant_indices" key
                containing a list of sentence numbers (0-indexed)
                that are relevant to answering the query.
                Only include directly relevant sentences."""
            }, {
                "role": "user",
                "content": (
                    f"Query: {query}\n\nSentences:\n"
                    + "\n".join(
                        f"[{i}] {s}"
                        for i, s in enumerate(sentences)
                    )
                )
            }],
            response_format={"type": "json_object"}
        )
        import json
        result = json.loads(
            response.choices[0].message.content
        )
        indices = result.get("relevant_indices", [])

        relevant_text = " ".join(
            sentences[i] for i in indices
            if i < len(sentences)
        )
        if relevant_text.strip():
            compressed.append(relevant_text)

    return compressed

LLM-Based Abstractive Compression

When exact sentences are too fragmented, abstractive compression creates coherent summaries:

def abstractive_compress(
    query: str,
    documents: list[str],
    max_tokens_per_doc: int = 150,
) -> list[str]:
    """Compress each document to only query-relevant content."""
    compressed = []

    for doc in documents:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{
                "role": "system",
                "content": f"""Extract and summarize ONLY the
                information from this document that is relevant
                to answering the user's query. Omit everything
                else. Keep the summary under
                {max_tokens_per_doc} tokens. If nothing in the
                document is relevant, respond with 'NOT_RELEVANT'.
                """
            }, {
                "role": "user",
                "content": f"Query: {query}\n\nDocument: {doc}"
            }],
            max_tokens=max_tokens_per_doc,
        )
        result = response.choices[0].message.content.strip()
        if result != "NOT_RELEVANT":
            compressed.append(result)

    return compressed

Fast Compression with Cross-Encoders

For production systems where LLM compression is too slow, use a cross-encoder to score individual sentences:

from sentence_transformers import CrossEncoder
import re

# Load a small, fast cross-encoder model
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def cross_encoder_compress(
    query: str,
    documents: list[str],
    top_sentences: int = 10,
) -> str:
    """Use cross-encoder to select most relevant sentences."""
    all_sentences = []
    for doc in documents:
        sentences = re.split(r'(?<=[.!?])\s+', doc)
        all_sentences.extend(sentences)

    # Score all sentences against the query
    pairs = [[query, sent] for sent in all_sentences]
    scores = reranker.predict(pairs)

    # Rank and select top sentences
    scored = sorted(
        zip(all_sentences, scores),
        key=lambda x: x[1],
        reverse=True,
    )

    top = scored[:top_sentences]
    # Return in original order for coherence
    ordered = sorted(
        top,
        key=lambda x: all_sentences.index(x[0]),
    )
    return " ".join(sent for sent, _ in ordered)

Putting It All Together

A complete compression-augmented RAG pipeline:

def compressed_rag(
    query: str,
    retriever,
    compression: str = "extractive",
) -> str:
    """RAG pipeline with contextual compression."""
    # Retrieve more documents than usual since we will compress
    raw_docs = retriever.search(query, k=10)

    # Compress based on strategy
    if compression == "extractive":
        context_docs = extractive_compress(query, raw_docs)
    elif compression == "abstractive":
        context_docs = abstractive_compress(query, raw_docs)
    elif compression == "cross_encoder":
        context_docs = [cross_encoder_compress(query, raw_docs)]
    else:
        context_docs = raw_docs

    context = "\n\n".join(context_docs)

    # Generate with compressed context
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "system",
            "content": "Answer using the provided context."
        }, {
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {query}"
        }],
    )
    return response.choices[0].message.content

Compression Ratios in Practice

In our testing, extractive compression reduces context by 60-75% while retaining answer quality. Abstractive compression achieves 70-85% reduction. Cross-encoder sentence selection achieves 80-90% reduction. The sweet spot depends on your use case — higher compression saves tokens but risks dropping subtle details that matter for nuanced questions.

FAQ

Does compression hurt answer quality?

When done well, compression actually improves answer quality because the LLM sees less noise. The risk is over-compression — removing context that seems irrelevant to a simple classifier but contains nuances the LLM needs. Monitor your answer quality metrics when tuning compression aggressiveness.

Which compression method should I use in production?

Cross-encoder compression is the best starting point for production. It runs in milliseconds (no LLM call required), provides good compression ratios, and scales well. Graduate to LLM-based compression only if cross-encoder results are insufficient for your quality requirements.

Can I combine compression with reranking?

Yes, and this is a powerful pattern. First rerank your retrieved documents to get the best ordering, then apply compression to the top-ranked results. This ensures you compress the most relevant documents rather than wasting compression effort on documents that would have been discarded anyway.

#ContextualCompression #RAG #TokenOptimization #LLMContext #Retrieval #AgenticAI #LearnAI #AIEngineering