Text Preprocessing for AI Agents: Cleaning, Normalizing, and Preparing Input Data
Build robust text preprocessing pipelines for AI agents covering HTML stripping, Unicode normalization, tokenization, length management, and input sanitization with production-ready Python code.
Why Preprocessing Is the Agent's First Line of Defense
Every message that enters an AI agent's pipeline carries noise: HTML tags from web scraping, invisible Unicode characters from copy-paste, excessive whitespace from formatting, or text that exceeds the model's context window. If unprocessed, this noise wastes tokens, confuses models, and produces unreliable outputs. Text preprocessing transforms raw input into a clean, consistent format that downstream NLP components can handle reliably.
Good preprocessing is invisible when it works. You only notice it when it is missing — when an agent chokes on an emoji, truncates a message mid-sentence, or treats "cafe" and "caf\u00e9" as different words.
HTML and Markup Stripping
Agents that process web content, emails, or rich-text inputs encounter HTML regularly. Stripping it cleanly requires handling nested tags, entities, and preserving meaningful structure.
import re
from html import unescape
def strip_html(text: str) -> str:
"""Remove HTML tags while preserving text content and structure."""
# Replace block-level tags with newlines
block_tags = r"</?(?:p|div|br|h[1-6]|li|tr|blockquote)[^>]*>"
text = re.sub(block_tags, "\n", text, flags=re.IGNORECASE)
# Remove all remaining HTML tags
text = re.sub(r"<[^>]+>", "", text)
# Decode HTML entities
text = unescape(text)
# Clean up excessive whitespace
text = re.sub(r"\n{3,}", "\n\n", text)
text = re.sub(r"[ \t]+", " ", text)
return text.strip()
html_input = """<div class="msg">
<p>Hello <b>World</b>!</p>
<p>Visit us at & learn more.</p>
</div>"""
print(strip_html(html_input))
# "Hello World!
Visit us at & learn more."
Unicode Normalization
The same character can have multiple Unicode representations. "caf\u00e9" can be stored as a single code point (\u00e9) or as "e" followed by a combining acute accent. This causes exact-match failures, inconsistent embeddings, and search misses.
import unicodedata
import re
def normalize_unicode(text: str) -> str:
"""Normalize Unicode to a consistent form and remove control characters."""
# NFC normalization: compose characters where possible
text = unicodedata.normalize("NFC", text)
# Remove zero-width characters and other invisible formatting
invisible_chars = re.compile(
"[\u200b\u200c\u200d\u200e\u200f"
"\ufeff\u00ad\u2060\u2061\u2062\u2063]"
)
text = invisible_chars.sub("", text)
# Remove control characters except newline and tab
text = "".join(
ch for ch in text
if unicodedata.category(ch) != "Cc" or ch in "\n\t"
)
return text
# Invisible characters often appear in copy-pasted text
messy = "Hello\u200b \u200dWorld\ufeff" # Contains zero-width chars
print(repr(normalize_unicode(messy)))
# 'Hello World'
Smart Whitespace Cleaning
Whitespace issues come in many forms: tabs mixed with spaces, multiple consecutive newlines, trailing spaces, and non-breaking spaces masquerading as regular spaces.
import re
def clean_whitespace(text: str) -> str:
"""Normalize all whitespace to standard forms."""
# Replace non-breaking spaces and other space variants
text = re.sub(r"[\u00a0\u2000-\u200a\u202f\u205f\u3000]", " ", text)
# Replace tabs with spaces
text = text.replace("\t", " ")
# Collapse multiple spaces into one
text = re.sub(r" {2,}", " ", text)
# Normalize line endings
text = text.replace("\r\n", "\n").replace("\r", "\n")
# Collapse more than 2 consecutive newlines
text = re.sub(r"\n{3,}", "\n\n", text)
# Strip leading/trailing whitespace from each line
lines = [line.strip() for line in text.split("\n")]
text = "\n".join(lines)
return text.strip()
Tokenization and Length Management
LLMs have fixed context windows measured in tokens, not characters. Preprocessing must ensure inputs fit within limits while preserving meaning.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
import tiktoken
class TokenManager:
def __init__(self, model: str = "gpt-4o"):
self.encoder = tiktoken.encoding_for_model(model)
def count_tokens(self, text: str) -> int:
return len(self.encoder.encode(text))
def truncate_to_tokens(
self,
text: str,
max_tokens: int,
strategy: str = "end",
) -> str:
"""Truncate text to fit within a token limit."""
tokens = self.encoder.encode(text)
if len(tokens) <= max_tokens:
return text
if strategy == "end":
truncated = tokens[:max_tokens]
elif strategy == "start":
truncated = tokens[-max_tokens:]
elif strategy == "middle":
half = max_tokens // 2
truncated = tokens[:half] + tokens[-half:]
else:
raise ValueError(f"Unknown strategy: {strategy}")
return self.encoder.decode(truncated)
def smart_truncate(
self,
text: str,
max_tokens: int,
) -> str:
"""Truncate at sentence boundaries to avoid mid-sentence cuts."""
tokens = self.encoder.encode(text)
if len(tokens) <= max_tokens:
return text
# Binary search for the last complete sentence within the limit
sentences = text.split(". ")
result = ""
for sentence in sentences:
candidate = result + sentence + ". "
if self.count_tokens(candidate) > max_tokens:
break
result = candidate
return result.strip() or self.truncate_to_tokens(text, max_tokens)
token_mgr = TokenManager()
print(token_mgr.count_tokens("Hello, world!")) # 4
Building a Complete Preprocessing Pipeline
Combine all preprocessing steps into a configurable pipeline that agents invoke before processing any input.
from dataclasses import dataclass, field
from typing import Callable
@dataclass
class PreprocessConfig:
strip_html: bool = True
normalize_unicode: bool = True
clean_whitespace: bool = True
max_tokens: int = 4000
lowercase: bool = False
remove_urls: bool = False
class TextPreprocessor:
def __init__(self, config: PreprocessConfig):
self.config = config
self.token_mgr = TokenManager()
self.steps: list[Callable[[str], str]] = self._build_steps()
def _build_steps(self) -> list[Callable[[str], str]]:
steps = []
if self.config.strip_html:
steps.append(strip_html)
if self.config.normalize_unicode:
steps.append(normalize_unicode)
if self.config.remove_urls:
steps.append(self._remove_urls)
if self.config.lowercase:
steps.append(str.lower)
if self.config.clean_whitespace:
steps.append(clean_whitespace)
return steps
def _remove_urls(self, text: str) -> str:
return re.sub(
r"https?://\S+|www\.\S+", "[URL]", text
)
def process(self, text: str) -> dict:
"""Run the full preprocessing pipeline."""
original_length = len(text)
for step in self.steps:
text = step(text)
original_tokens = self.token_mgr.count_tokens(text)
if original_tokens > self.config.max_tokens:
text = self.token_mgr.smart_truncate(
text, self.config.max_tokens
)
was_truncated = True
else:
was_truncated = False
return {
"text": text,
"original_chars": original_length,
"processed_chars": len(text),
"token_count": self.token_mgr.count_tokens(text),
"was_truncated": was_truncated,
}
# Usage
preprocessor = TextPreprocessor(PreprocessConfig(
max_tokens=2000,
remove_urls=True,
))
result = preprocessor.process(raw_user_input)
clean_text = result["text"]
Input Sanitization for Security
Agents that process user input must guard against prompt injection and other adversarial inputs.
import re
def sanitize_for_agent(text: str) -> str:
"""Remove potential prompt injection patterns."""
# Remove common injection patterns
injection_patterns = [
r"ignore (?:all )?previous instructions",
r"you are now",
r"system:\s*",
r"\[INST\]",
r"<\|(?:im_start|im_end|system|user|assistant)\|>",
]
sanitized = text
for pattern in injection_patterns:
sanitized = re.sub(pattern, "[FILTERED]", sanitized, flags=re.IGNORECASE)
return sanitized
This is a heuristic defense — not a complete solution. Always combine input sanitization with proper system prompt hardening and output validation for defense in depth.
FAQ
Should I preprocess text before or after language detection?
Preprocess before language detection, but only apply language-agnostic steps: HTML stripping, Unicode normalization, and whitespace cleaning. Do not apply lowercase normalization or stop word removal before detection, as these can degrade detection accuracy. Language-specific preprocessing (stemming, lemmatization) should happen after detection confirms the language.
How do I handle emojis and special characters in agent input?
Do not remove emojis by default. Modern LLMs understand emojis and they carry sentiment and intent information. Remove emojis only if they cause issues with a specific downstream model. Replace them with text descriptions (using the emoji library's demojize() function) if you need to preserve the meaning while using models that do not handle emojis well.
What is the right max token limit for preprocessing?
It depends on your prompt design. If your system prompt uses 1,000 tokens and the model has an 8,000 token context window, your user input budget is roughly 7,000 tokens minus whatever you reserve for the model's response (typically 1,000 to 2,000 tokens). Calculate it as: max_input_tokens = context_window - system_prompt_tokens - max_output_tokens - safety_margin. Track these budgets explicitly in your preprocessing configuration.
#TextPreprocessing #NLP #Tokenization #DataCleaning #AIAgents #Python #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.