Prompt Compression: Reducing Token Count Without Losing Quality
Learn practical prompt compression techniques including LLMLingua, selective context pruning, and abstractive compression to cut token costs while preserving output quality.
Why Prompt Compression Matters
Token costs add up fast in production systems. A RAG pipeline that retrieves 8000 tokens of context for each query, processes 10,000 queries per day, and uses GPT-4o spends significantly on input tokens alone. Prompt compression aims to reduce this cost by delivering the same essential information in fewer tokens.
But compression is not just about cost. Shorter prompts also reduce latency — time-to-first-token scales with prompt length. And there is a quality argument: LLMs attend better to concise, relevant context than to verbose, padded text. Compression can actually improve output quality when it removes noise.
Technique 1: Selective Context Pruning
The simplest compression strategy removes low-value content from retrieved context before inserting it into the prompt:
import tiktoken
def prune_context(
chunks: list[dict],
max_tokens: int = 4000,
min_relevance: float = 0.3,
) -> list[dict]:
"""Remove low-relevance chunks and trim to budget."""
encoder = tiktoken.encoding_for_model("gpt-4o")
# Filter by relevance threshold
relevant = [c for c in chunks if c["score"] >= min_relevance]
relevant.sort(key=lambda c: c["score"], reverse=True)
# Remove redundant content via simple overlap detection
selected = []
seen_sentences = set()
token_count = 0
for chunk in relevant:
sentences = chunk["text"].split(". ")
unique_sentences = []
for s in sentences:
normalized = s.strip().lower()[:80]
if normalized not in seen_sentences:
seen_sentences.add(normalized)
unique_sentences.append(s)
if not unique_sentences:
continue
deduped_text = ". ".join(unique_sentences)
chunk_tokens = len(encoder.encode(deduped_text))
if token_count + chunk_tokens > max_tokens:
break
selected.append({**chunk, "text": deduped_text})
token_count += chunk_tokens
return selected
This approach typically achieves 20 to 40 percent token reduction by eliminating duplicate information and low-relevance content.
Technique 2: Abstractive Compression
Instead of cutting content, abstractive compression rewrites the context into a shorter form that preserves the key information:
import openai
client = openai.OpenAI()
def compress_context_abstractive(
context: str,
query: str,
target_ratio: float = 0.4,
) -> str:
"""Compress context to target ratio using abstractive summary."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": (
"You are a context compression engine. Rewrite the "
"provided context to be much shorter while preserving "
"ALL facts and details relevant to the given query. "
"Remove tangential information, redundancy, and "
"verbose phrasing. Keep technical terms and numbers exact. "
"Do not add any information not present in the original."
)},
{"role": "user", "content": (
f"Query: {query}\n\n"
f"Context to compress (target ~{target_ratio:.0%} of "
f"original length):\n\n{context}"
)},
],
temperature=0,
)
return response.choices[0].message.content
The tradeoff: abstractive compression uses an additional LLM call, but the cost of that call (using a cheaper model) is often far less than the savings from reduced tokens in the main prompt to the more expensive model.
Technique 3: Extractive Compression with Sentence Scoring
This technique scores each sentence by relevance to the query and keeps only the top-scoring ones:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
from sentence_transformers import SentenceTransformer, util
import numpy as np
model = SentenceTransformer("all-MiniLM-L6-v2")
def extractive_compress(
context: str,
query: str,
keep_ratio: float = 0.5,
) -> str:
"""Keep only the most relevant sentences from context."""
sentences = [s.strip() for s in context.split(". ") if len(s.strip()) > 10]
if not sentences:
return context
query_embedding = model.encode(query, convert_to_tensor=True)
sentence_embeddings = model.encode(sentences, convert_to_tensor=True)
scores = util.cos_sim(query_embedding, sentence_embeddings)[0]
scores = scores.cpu().numpy()
# Keep top sentences by relevance, but maintain original order
n_keep = max(1, int(len(sentences) * keep_ratio))
top_indices = np.argsort(scores)[-n_keep:]
top_indices_sorted = sorted(top_indices)
compressed = ". ".join(sentences[i] for i in top_indices_sorted) + "."
return compressed
Extractive compression has the advantage of never introducing errors — every sentence in the output existed verbatim in the input. The risk is that removing sentences can break coherence between the remaining ones.
Technique 4: Instruction Compression
Often the biggest compression gains come from the system prompt itself, not the retrieved context:
# Verbose (87 tokens)
VERBOSE_PROMPT = (
"You are a highly knowledgeable and experienced customer support "
"assistant who works for our company. Your role is to help "
"customers with their questions and issues. You should always "
"be polite, professional, and helpful in your responses. If you "
"do not know the answer to a question, you should let the "
"customer know honestly rather than making something up."
)
# Compressed (34 tokens)
COMPRESSED_PROMPT = (
"Customer support agent. Be helpful and professional. "
"If unsure, say so honestly rather than guessing."
)
For most models, the compressed version produces virtually identical behavior. LLMs are good at inferring expected behavior from brief instructions. The verbose version wastes tokens on obvious implications that the model already understands from the role description.
Measuring Compression Quality
Always validate that compression does not degrade output quality:
def evaluate_compression(
original_context: str,
compressed_context: str,
test_queries: list[dict],
) -> dict:
"""Compare answer quality between original and compressed context."""
results = {"original_score": 0, "compressed_score": 0}
for test in test_queries:
for context_type, context in [
("original", original_context),
("compressed", compressed_context),
]:
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": f"Context: {context}"},
{"role": "user", "content": test["query"]},
],
temperature=0,
)
answer = response.choices[0].message.content
score = 1.0 if test["expected"] in answer.lower() else 0.0
results[f"{context_type}_score"] += score
n = len(test_queries)
results["original_score"] /= n
results["compressed_score"] /= n
results["quality_retained"] = (
results["compressed_score"] / max(results["original_score"], 0.01)
)
return results
A good compression retains 95 percent or more of answer quality. If quality drops below 90 percent, the compression is too aggressive for that use case.
FAQ
How much compression is safe without quality loss?
For most tasks, you can compress context by 30 to 50 percent without measurable quality degradation. Beyond 50 percent, you need to evaluate carefully. The safe ratio depends on information density — highly technical content with precise numbers tolerates less compression than narrative or descriptive text.
Should I compress the prompt or use a model with a larger context window?
Both. Larger context windows reduce the urgency of compression, but cost scales linearly with token count. Compressing a 12,000-token context to 6,000 tokens halves the input cost regardless of the context window size. Compression and larger windows are complementary strategies.
Does LLMLingua work with all models?
LLMLingua is a research tool that uses a small language model to score token importance and drop unimportant tokens. It works well as a pre-processing step for any model since the compressed text is still natural language. However, aggressive LLMLingua compression can produce text that looks unnatural, which some models handle better than others. Test with your specific model before deploying.
#PromptEngineering #Compression #CostOptimization #Tokens #Python #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.