Prompt Compression: Reducing Token Count Without Losing Quality

Why Prompt Compression Matters

Token costs add up fast in production systems. A RAG pipeline that retrieves 8000 tokens of context for each query, processes 10,000 queries per day, and uses GPT-4o spends significantly on input tokens alone. Prompt compression aims to reduce this cost by delivering the same essential information in fewer tokens.

But compression is not just about cost. Shorter prompts also reduce latency — time-to-first-token scales with prompt length. And there is a quality argument: LLMs attend better to concise, relevant context than to verbose, padded text. Compression can actually improve output quality when it removes noise.

Technique 1: Selective Context Pruning

The simplest compression strategy removes low-value content from retrieved context before inserting it into the prompt:

import tiktoken

def prune_context(
    chunks: list[dict],
    max_tokens: int = 4000,
    min_relevance: float = 0.3,
) -> list[dict]:
    """Remove low-relevance chunks and trim to budget."""
    encoder = tiktoken.encoding_for_model("gpt-4o")

    # Filter by relevance threshold
    relevant = [c for c in chunks if c["score"] >= min_relevance]
    relevant.sort(key=lambda c: c["score"], reverse=True)

    # Remove redundant content via simple overlap detection
    selected = []
    seen_sentences = set()
    token_count = 0

    for chunk in relevant:
        sentences = chunk["text"].split(". ")
        unique_sentences = []
        for s in sentences:
            normalized = s.strip().lower()[:80]
            if normalized not in seen_sentences:
                seen_sentences.add(normalized)
                unique_sentences.append(s)

        if not unique_sentences:
            continue

        deduped_text = ". ".join(unique_sentences)
        chunk_tokens = len(encoder.encode(deduped_text))

        if token_count + chunk_tokens > max_tokens:
            break

        selected.append({**chunk, "text": deduped_text})
        token_count += chunk_tokens

    return selected

This approach typically achieves 20 to 40 percent token reduction by eliminating duplicate information and low-relevance content.

Technique 2: Abstractive Compression

Instead of cutting content, abstractive compression rewrites the context into a shorter form that preserves the key information:

import openai

client = openai.OpenAI()

def compress_context_abstractive(
    context: str,
    query: str,
    target_ratio: float = 0.4,
) -> str:
    """Compress context to target ratio using abstractive summary."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": (
                "You are a context compression engine. Rewrite the "
                "provided context to be much shorter while preserving "
                "ALL facts and details relevant to the given query. "
                "Remove tangential information, redundancy, and "
                "verbose phrasing. Keep technical terms and numbers exact. "
                "Do not add any information not present in the original."
            )},
            {"role": "user", "content": (
                f"Query: {query}\n\n"
                f"Context to compress (target ~{target_ratio:.0%} of "
                f"original length):\n\n{context}"
            )},
        ],
        temperature=0,
    )
    return response.choices[0].message.content

The tradeoff: abstractive compression uses an additional LLM call, but the cost of that call (using a cheaper model) is often far less than the savings from reduced tokens in the main prompt to the more expensive model.

Technique 3: Extractive Compression with Sentence Scoring

This technique scores each sentence by relevance to the query and keeps only the top-scoring ones:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

from sentence_transformers import SentenceTransformer, util
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

def extractive_compress(
    context: str,
    query: str,
    keep_ratio: float = 0.5,
) -> str:
    """Keep only the most relevant sentences from context."""
    sentences = [s.strip() for s in context.split(". ") if len(s.strip()) > 10]
    if not sentences:
        return context

    query_embedding = model.encode(query, convert_to_tensor=True)
    sentence_embeddings = model.encode(sentences, convert_to_tensor=True)

    scores = util.cos_sim(query_embedding, sentence_embeddings)[0]
    scores = scores.cpu().numpy()

    # Keep top sentences by relevance, but maintain original order
    n_keep = max(1, int(len(sentences) * keep_ratio))
    top_indices = np.argsort(scores)[-n_keep:]
    top_indices_sorted = sorted(top_indices)

    compressed = ". ".join(sentences[i] for i in top_indices_sorted) + "."
    return compressed

Extractive compression has the advantage of never introducing errors — every sentence in the output existed verbatim in the input. The risk is that removing sentences can break coherence between the remaining ones.

Technique 4: Instruction Compression

Often the biggest compression gains come from the system prompt itself, not the retrieved context:

# Verbose (87 tokens)
VERBOSE_PROMPT = (
    "You are a highly knowledgeable and experienced customer support "
    "assistant who works for our company. Your role is to help "
    "customers with their questions and issues. You should always "
    "be polite, professional, and helpful in your responses. If you "
    "do not know the answer to a question, you should let the "
    "customer know honestly rather than making something up."
)

# Compressed (34 tokens)
COMPRESSED_PROMPT = (
    "Customer support agent. Be helpful and professional. "
    "If unsure, say so honestly rather than guessing."
)

For most models, the compressed version produces virtually identical behavior. LLMs are good at inferring expected behavior from brief instructions. The verbose version wastes tokens on obvious implications that the model already understands from the role description.

Measuring Compression Quality

Always validate that compression does not degrade output quality:

def evaluate_compression(
    original_context: str,
    compressed_context: str,
    test_queries: list[dict],
) -> dict:
    """Compare answer quality between original and compressed context."""
    results = {"original_score": 0, "compressed_score": 0}

    for test in test_queries:
        for context_type, context in [
            ("original", original_context),
            ("compressed", compressed_context),
        ]:
            response = client.chat.completions.create(
                model="gpt-4o",
                messages=[
                    {"role": "system", "content": f"Context: {context}"},
                    {"role": "user", "content": test["query"]},
                ],
                temperature=0,
            )
            answer = response.choices[0].message.content
            score = 1.0 if test["expected"] in answer.lower() else 0.0
            results[f"{context_type}_score"] += score

    n = len(test_queries)
    results["original_score"] /= n
    results["compressed_score"] /= n
    results["quality_retained"] = (
        results["compressed_score"] / max(results["original_score"], 0.01)
    )
    return results

A good compression retains 95 percent or more of answer quality. If quality drops below 90 percent, the compression is too aggressive for that use case.

FAQ

How much compression is safe without quality loss?

For most tasks, you can compress context by 30 to 50 percent without measurable quality degradation. Beyond 50 percent, you need to evaluate carefully. The safe ratio depends on information density — highly technical content with precise numbers tolerates less compression than narrative or descriptive text.

Should I compress the prompt or use a model with a larger context window?

Both. Larger context windows reduce the urgency of compression, but cost scales linearly with token count. Compressing a 12,000-token context to 6,000 tokens halves the input cost regardless of the context window size. Compression and larger windows are complementary strategies.

Does LLMLingua work with all models?

LLMLingua is a research tool that uses a small language model to score token importance and drop unimportant tokens. It works well as a pre-processing step for any model since the compressed text is still natural language. However, aggressive LLMLingua compression can produce text that looks unnatural, which some models handle better than others. Test with your specific model before deploying.

#PromptEngineering #Compression #CostOptimization #Tokens #Python #AgenticAI #LearnAI #AIEngineering

Prompt Compression: Reducing Token Count Without Losing Quality

Why Prompt Compression Matters

Technique 1: Selective Context Pruning

Technique 2: Abstractive Compression

Technique 3: Extractive Compression with Sentence Scoring

Technique 4: Instruction Compression

Measuring Compression Quality

FAQ

How much compression is safe without quality loss?

Should I compress the prompt or use a model with a larger context window?

Does LLMLingua work with all models?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding