Contextual Compression for RAG: Reducing Retrieved Context to What Matters
Learn how contextual compression techniques strip irrelevant information from retrieved chunks before they reach the LLM, improving both answer quality and token efficiency.
The Retrieval Noise Problem
When you retrieve the top 5 chunks from a vector store, each chunk is typically 500-1000 tokens. That is 2,500-5,000 tokens of context passed to your LLM. But here is the critical insight: usually only 10-20% of those tokens are actually relevant to the specific question being asked.
A chunk might be retrieved because it contains a paragraph about your topic, but the rest of the chunk covers unrelated details. This noise dilutes the signal, increases token costs, and — most importantly — can confuse the LLM into generating responses that blend relevant and irrelevant information.
Contextual compression addresses this by extracting or summarizing only the question-relevant portions of each retrieved document before passing them to the generator.
Three Approaches to Compression
1. Extractive Compression
Extract only the sentences or passages that directly relate to the query. This preserves exact wording from the source, maintaining fidelity.
2. LLM-Based Abstractive Compression
Use a language model to rewrite each chunk, keeping only query-relevant information. More flexible but introduces the possibility of subtle distortion.
3. Cross-Encoder Reranking with Truncation
Score individual sentences within each chunk for relevance, then keep only the top-scoring sentences. A hybrid approach that balances precision and speed.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Implementing Extractive Compression
from openai import OpenAI
import re
client = OpenAI()
def extractive_compress(
query: str,
documents: list[str],
) -> list[str]:
"""Extract only query-relevant sentences from each document."""
compressed = []
for doc in documents:
# Split document into sentences
sentences = re.split(r'(?<=[.!?])\s+', doc)
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "system",
"content": """Given a query and numbered sentences,
return a JSON object with a "relevant_indices" key
containing a list of sentence numbers (0-indexed)
that are relevant to answering the query.
Only include directly relevant sentences."""
}, {
"role": "user",
"content": (
f"Query: {query}\n\nSentences:\n"
+ "\n".join(
f"[{i}] {s}"
for i, s in enumerate(sentences)
)
)
}],
response_format={"type": "json_object"}
)
import json
result = json.loads(
response.choices[0].message.content
)
indices = result.get("relevant_indices", [])
relevant_text = " ".join(
sentences[i] for i in indices
if i < len(sentences)
)
if relevant_text.strip():
compressed.append(relevant_text)
return compressed
LLM-Based Abstractive Compression
When exact sentences are too fragmented, abstractive compression creates coherent summaries:
def abstractive_compress(
query: str,
documents: list[str],
max_tokens_per_doc: int = 150,
) -> list[str]:
"""Compress each document to only query-relevant content."""
compressed = []
for doc in documents:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "system",
"content": f"""Extract and summarize ONLY the
information from this document that is relevant
to answering the user's query. Omit everything
else. Keep the summary under
{max_tokens_per_doc} tokens. If nothing in the
document is relevant, respond with 'NOT_RELEVANT'.
"""
}, {
"role": "user",
"content": f"Query: {query}\n\nDocument: {doc}"
}],
max_tokens=max_tokens_per_doc,
)
result = response.choices[0].message.content.strip()
if result != "NOT_RELEVANT":
compressed.append(result)
return compressed
Fast Compression with Cross-Encoders
For production systems where LLM compression is too slow, use a cross-encoder to score individual sentences:
from sentence_transformers import CrossEncoder
import re
# Load a small, fast cross-encoder model
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def cross_encoder_compress(
query: str,
documents: list[str],
top_sentences: int = 10,
) -> str:
"""Use cross-encoder to select most relevant sentences."""
all_sentences = []
for doc in documents:
sentences = re.split(r'(?<=[.!?])\s+', doc)
all_sentences.extend(sentences)
# Score all sentences against the query
pairs = [[query, sent] for sent in all_sentences]
scores = reranker.predict(pairs)
# Rank and select top sentences
scored = sorted(
zip(all_sentences, scores),
key=lambda x: x[1],
reverse=True,
)
top = scored[:top_sentences]
# Return in original order for coherence
ordered = sorted(
top,
key=lambda x: all_sentences.index(x[0]),
)
return " ".join(sent for sent, _ in ordered)
Putting It All Together
A complete compression-augmented RAG pipeline:
def compressed_rag(
query: str,
retriever,
compression: str = "extractive",
) -> str:
"""RAG pipeline with contextual compression."""
# Retrieve more documents than usual since we will compress
raw_docs = retriever.search(query, k=10)
# Compress based on strategy
if compression == "extractive":
context_docs = extractive_compress(query, raw_docs)
elif compression == "abstractive":
context_docs = abstractive_compress(query, raw_docs)
elif compression == "cross_encoder":
context_docs = [cross_encoder_compress(query, raw_docs)]
else:
context_docs = raw_docs
context = "\n\n".join(context_docs)
# Generate with compressed context
response = client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "system",
"content": "Answer using the provided context."
}, {
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {query}"
}],
)
return response.choices[0].message.content
Compression Ratios in Practice
In our testing, extractive compression reduces context by 60-75% while retaining answer quality. Abstractive compression achieves 70-85% reduction. Cross-encoder sentence selection achieves 80-90% reduction. The sweet spot depends on your use case — higher compression saves tokens but risks dropping subtle details that matter for nuanced questions.
FAQ
Does compression hurt answer quality?
When done well, compression actually improves answer quality because the LLM sees less noise. The risk is over-compression — removing context that seems irrelevant to a simple classifier but contains nuances the LLM needs. Monitor your answer quality metrics when tuning compression aggressiveness.
Which compression method should I use in production?
Cross-encoder compression is the best starting point for production. It runs in milliseconds (no LLM call required), provides good compression ratios, and scales well. Graduate to LLM-based compression only if cross-encoder results are insufficient for your quality requirements.
Can I combine compression with reranking?
Yes, and this is a powerful pattern. First rerank your retrieved documents to get the best ordering, then apply compression to the top-ranked results. This ensures you compress the most relevant documents rather than wasting compression effort on documents that would have been discarded anyway.
#ContextualCompression #RAG #TokenOptimization #LLMContext #Retrieval #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.