Prompt Compression Techniques: Reducing Token Count by 50% Without Quality Loss
Master prompt compression methods including LLMLingua, selective context pruning, and abstractive compression to halve your token costs while maintaining output quality. Practical Python implementations included.
The Token Cost Problem
Every token in your prompt costs money. For agents that include conversation history, RAG context, tool outputs, and system instructions, prompts routinely hit 10,000–50,000 tokens. At GPT-4o’s input pricing, a 30,000-token prompt costs about $0.075 per request. Serve 100,000 requests per day and that is $7,500 monthly just for input tokens.
Prompt compression reduces token count while preserving the information the model needs. Done well, you can cut token counts by 40–60% with negligible quality impact.
Technique 1: Selective Context Pruning
Not all context is equally important. Prune low-relevance content before sending it to the model.
from typing import List, Tuple
import numpy as np
class SelectiveContextPruner:
"""Prune context passages by relevance score."""
def __init__(self, max_tokens: int = 4000):
self.max_tokens = max_tokens
def estimate_tokens(self, text: str) -> int:
return len(text.split()) * 4 // 3 # rough approximation
def prune_by_relevance(
self,
passages: List[Tuple[str, float]], # (text, relevance_score)
) -> List[str]:
sorted_passages = sorted(passages, key=lambda x: x[1], reverse=True)
selected = []
total_tokens = 0
for text, score in sorted_passages:
tokens = self.estimate_tokens(text)
if total_tokens + tokens <= self.max_tokens:
selected.append(text)
total_tokens += tokens
else:
break
return selected
def prune_conversation_history(
self,
messages: List[dict],
keep_last_n: int = 4,
keep_system: bool = True,
) -> List[dict]:
system_msgs = [m for m in messages if m["role"] == "system"] if keep_system else []
non_system = [m for m in messages if m["role"] != "system"]
recent = non_system[-keep_last_n:] if len(non_system) > keep_last_n else non_system
return system_msgs + recent
pruner = SelectiveContextPruner(max_tokens=3000)
passages = [
("The product supports SSO via SAML 2.0 and OIDC.", 0.92),
("Our office is located in San Francisco.", 0.15),
("Pricing starts at $49/month per seat.", 0.88),
("The company was founded in 2019.", 0.20),
("API rate limits are 1000 req/min on the Pro plan.", 0.85),
]
selected = pruner.prune_by_relevance(passages)
print(f"Kept {len(selected)} of {len(passages)} passages")
Technique 2: Abstractive Compression
Use a cheap model to summarize verbose context before passing it to the main model. This trades a small cheap-model call for significant token savings on the expensive call.
import openai
class AbstractiveCompressor:
def __init__(self, client: openai.OpenAI, model: str = "gpt-4o-mini"):
self.client = client
self.model = model
def compress_context(self, context: str, max_summary_tokens: int = 500) -> str:
response = self.client.chat.completions.create(
model=self.model,
messages=[
{
"role": "system",
"content": (
"Compress the following context into a dense summary. "
"Preserve all facts, numbers, names, and relationships. "
"Remove filler words, redundancies, and formatting. "
"Output only the compressed version."
),
},
{"role": "user", "content": context},
],
max_tokens=max_summary_tokens,
temperature=0,
)
return response.choices[0].message.content
def compress_if_beneficial(
self,
context: str,
threshold_tokens: int = 2000,
) -> Tuple[str, dict]:
est_tokens = len(context.split()) * 4 // 3
if est_tokens <= threshold_tokens:
return context, {"compressed": False, "original_tokens": est_tokens}
compressed = self.compress_context(context)
compressed_tokens = len(compressed.split()) * 4 // 3
return compressed, {
"compressed": True,
"original_tokens": est_tokens,
"compressed_tokens": compressed_tokens,
"reduction_pct": round((1 - compressed_tokens / est_tokens) * 100, 1),
}
Technique 3: Structural Compression
Remove formatting that consumes tokens without adding information value.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
import re
def compress_structural(text: str) -> str:
text = re.sub(r'\n{3,}', '\n\n', text)
text = re.sub(r' {2,}', ' ', text)
text = re.sub(r'#{1,6} ', '', text) # remove markdown headers
text = re.sub(r'\*{1,2}([^*]+)\*{1,2}', r'\1', text) # remove bold/italic
text = re.sub(r'^[-*] ', '', text, flags=re.MULTILINE) # remove list markers
return text.strip()
def compress_json_output(json_str: str) -> str:
"""Remove whitespace from JSON tool outputs."""
import json
try:
data = json.loads(json_str)
return json.dumps(data, separators=(',', ':'))
except json.JSONDecodeError:
return json_str
Measuring Compression Quality
Always validate that compression does not degrade response quality. Run an A/B test comparing full-context and compressed-context responses.
@dataclass
class CompressionResult:
original_tokens: int
compressed_tokens: int
quality_score: float # 0.0 to 1.0
cost_saved_per_request: float
@property
def compression_ratio(self) -> float:
return 1 - (self.compressed_tokens / self.original_tokens)
@property
def is_acceptable(self) -> bool:
return self.quality_score >= 0.85 and self.compression_ratio >= 0.25
FAQ
How much quality degradation should I accept from compression?
Target less than 5% quality degradation as measured by automated evaluation or human review. If your quality score drops below 0.85 on a 0–1 scale, the compression is too aggressive. Start conservative and increase compression gradually while monitoring quality metrics.
Is it worth using a paid API call just to compress the context?
Yes, when the context is large enough. If compressing 10,000 tokens of context down to 3,000 tokens costs $0.001 with GPT-4o-mini but saves $0.017 in GPT-4o input costs, the net saving is $0.016 per request. At scale, this compounds significantly.
Should I compress system prompts or just user context?
System prompts are usually already concise and carefully tuned, so compressing them risks degrading the model’s behavior. Focus compression on RAG context, conversation history, and tool outputs — these are the sources of token bloat in most agent systems.
#PromptCompression #TokenOptimization #CostReduction #LLMLingua #ContextManagement #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.