Skip to content
Learn Agentic AI10 min read0 views

Prompt Compression Techniques: Reducing Token Count by 50% Without Quality Loss

Master prompt compression methods including LLMLingua, selective context pruning, and abstractive compression to halve your token costs while maintaining output quality. Practical Python implementations included.

The Token Cost Problem

Every token in your prompt costs money. For agents that include conversation history, RAG context, tool outputs, and system instructions, prompts routinely hit 10,000–50,000 tokens. At GPT-4o’s input pricing, a 30,000-token prompt costs about $0.075 per request. Serve 100,000 requests per day and that is $7,500 monthly just for input tokens.

Prompt compression reduces token count while preserving the information the model needs. Done well, you can cut token counts by 40–60% with negligible quality impact.

Technique 1: Selective Context Pruning

Not all context is equally important. Prune low-relevance content before sending it to the model.

from typing import List, Tuple
import numpy as np

class SelectiveContextPruner:
    """Prune context passages by relevance score."""

    def __init__(self, max_tokens: int = 4000):
        self.max_tokens = max_tokens

    def estimate_tokens(self, text: str) -> int:
        return len(text.split()) * 4 // 3  # rough approximation

    def prune_by_relevance(
        self,
        passages: List[Tuple[str, float]],  # (text, relevance_score)
    ) -> List[str]:
        sorted_passages = sorted(passages, key=lambda x: x[1], reverse=True)
        selected = []
        total_tokens = 0
        for text, score in sorted_passages:
            tokens = self.estimate_tokens(text)
            if total_tokens + tokens <= self.max_tokens:
                selected.append(text)
                total_tokens += tokens
            else:
                break
        return selected

    def prune_conversation_history(
        self,
        messages: List[dict],
        keep_last_n: int = 4,
        keep_system: bool = True,
    ) -> List[dict]:
        system_msgs = [m for m in messages if m["role"] == "system"] if keep_system else []
        non_system = [m for m in messages if m["role"] != "system"]
        recent = non_system[-keep_last_n:] if len(non_system) > keep_last_n else non_system
        return system_msgs + recent

pruner = SelectiveContextPruner(max_tokens=3000)
passages = [
    ("The product supports SSO via SAML 2.0 and OIDC.", 0.92),
    ("Our office is located in San Francisco.", 0.15),
    ("Pricing starts at $49/month per seat.", 0.88),
    ("The company was founded in 2019.", 0.20),
    ("API rate limits are 1000 req/min on the Pro plan.", 0.85),
]
selected = pruner.prune_by_relevance(passages)
print(f"Kept {len(selected)} of {len(passages)} passages")

Technique 2: Abstractive Compression

Use a cheap model to summarize verbose context before passing it to the main model. This trades a small cheap-model call for significant token savings on the expensive call.

import openai

class AbstractiveCompressor:
    def __init__(self, client: openai.OpenAI, model: str = "gpt-4o-mini"):
        self.client = client
        self.model = model

    def compress_context(self, context: str, max_summary_tokens: int = 500) -> str:
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[
                {
                    "role": "system",
                    "content": (
                        "Compress the following context into a dense summary. "
                        "Preserve all facts, numbers, names, and relationships. "
                        "Remove filler words, redundancies, and formatting. "
                        "Output only the compressed version."
                    ),
                },
                {"role": "user", "content": context},
            ],
            max_tokens=max_summary_tokens,
            temperature=0,
        )
        return response.choices[0].message.content

    def compress_if_beneficial(
        self,
        context: str,
        threshold_tokens: int = 2000,
    ) -> Tuple[str, dict]:
        est_tokens = len(context.split()) * 4 // 3
        if est_tokens <= threshold_tokens:
            return context, {"compressed": False, "original_tokens": est_tokens}
        compressed = self.compress_context(context)
        compressed_tokens = len(compressed.split()) * 4 // 3
        return compressed, {
            "compressed": True,
            "original_tokens": est_tokens,
            "compressed_tokens": compressed_tokens,
            "reduction_pct": round((1 - compressed_tokens / est_tokens) * 100, 1),
        }

Technique 3: Structural Compression

Remove formatting that consumes tokens without adding information value.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

import re

def compress_structural(text: str) -> str:
    text = re.sub(r'\n{3,}', '\n\n', text)
    text = re.sub(r' {2,}', ' ', text)
    text = re.sub(r'#{1,6} ', '', text)  # remove markdown headers
    text = re.sub(r'\*{1,2}([^*]+)\*{1,2}', r'\1', text)  # remove bold/italic
    text = re.sub(r'^[-*] ', '', text, flags=re.MULTILINE)  # remove list markers
    return text.strip()

def compress_json_output(json_str: str) -> str:
    """Remove whitespace from JSON tool outputs."""
    import json
    try:
        data = json.loads(json_str)
        return json.dumps(data, separators=(',', ':'))
    except json.JSONDecodeError:
        return json_str

Measuring Compression Quality

Always validate that compression does not degrade response quality. Run an A/B test comparing full-context and compressed-context responses.

@dataclass
class CompressionResult:
    original_tokens: int
    compressed_tokens: int
    quality_score: float  # 0.0 to 1.0
    cost_saved_per_request: float

    @property
    def compression_ratio(self) -> float:
        return 1 - (self.compressed_tokens / self.original_tokens)

    @property
    def is_acceptable(self) -> bool:
        return self.quality_score >= 0.85 and self.compression_ratio >= 0.25

FAQ

How much quality degradation should I accept from compression?

Target less than 5% quality degradation as measured by automated evaluation or human review. If your quality score drops below 0.85 on a 0–1 scale, the compression is too aggressive. Start conservative and increase compression gradually while monitoring quality metrics.

Is it worth using a paid API call just to compress the context?

Yes, when the context is large enough. If compressing 10,000 tokens of context down to 3,000 tokens costs $0.001 with GPT-4o-mini but saves $0.017 in GPT-4o input costs, the net saving is $0.016 per request. At scale, this compounds significantly.

Should I compress system prompts or just user context?

System prompts are usually already concise and carefully tuned, so compressing them risks degrading the model’s behavior. Focus compression on RAG context, conversation history, and tool outputs — these are the sources of token bloat in most agent systems.


#PromptCompression #TokenOptimization #CostReduction #LLMLingua #ContextManagement #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.