Skip to content
Learn Agentic AI9 min read0 views

Text Preprocessing for AI Agents: Cleaning, Normalizing, and Preparing Input Data

Build robust text preprocessing pipelines for AI agents covering HTML stripping, Unicode normalization, tokenization, length management, and input sanitization with production-ready Python code.

Why Preprocessing Is the Agent's First Line of Defense

Every message that enters an AI agent's pipeline carries noise: HTML tags from web scraping, invisible Unicode characters from copy-paste, excessive whitespace from formatting, or text that exceeds the model's context window. If unprocessed, this noise wastes tokens, confuses models, and produces unreliable outputs. Text preprocessing transforms raw input into a clean, consistent format that downstream NLP components can handle reliably.

Good preprocessing is invisible when it works. You only notice it when it is missing — when an agent chokes on an emoji, truncates a message mid-sentence, or treats "cafe" and "caf\u00e9" as different words.

HTML and Markup Stripping

Agents that process web content, emails, or rich-text inputs encounter HTML regularly. Stripping it cleanly requires handling nested tags, entities, and preserving meaningful structure.

import re
from html import unescape

def strip_html(text: str) -> str:
    """Remove HTML tags while preserving text content and structure."""
    # Replace block-level tags with newlines
    block_tags = r"</?(?:p|div|br|h[1-6]|li|tr|blockquote)[^>]*>"
    text = re.sub(block_tags, "\n", text, flags=re.IGNORECASE)

    # Remove all remaining HTML tags
    text = re.sub(r"<[^>]+>", "", text)

    # Decode HTML entities
    text = unescape(text)

    # Clean up excessive whitespace
    text = re.sub(r"\n{3,}", "\n\n", text)
    text = re.sub(r"[ \t]+", " ", text)

    return text.strip()

html_input = """<div class="msg">
  <p>Hello <b>World</b>!</p>
  <p>Visit us at &amp; learn more.</p>
</div>"""

print(strip_html(html_input))
# "Hello World!

Visit us at & learn more."

Unicode Normalization

The same character can have multiple Unicode representations. "caf\u00e9" can be stored as a single code point (\u00e9) or as "e" followed by a combining acute accent. This causes exact-match failures, inconsistent embeddings, and search misses.

import unicodedata
import re

def normalize_unicode(text: str) -> str:
    """Normalize Unicode to a consistent form and remove control characters."""
    # NFC normalization: compose characters where possible
    text = unicodedata.normalize("NFC", text)

    # Remove zero-width characters and other invisible formatting
    invisible_chars = re.compile(
        "[\u200b\u200c\u200d\u200e\u200f"
        "\ufeff\u00ad\u2060\u2061\u2062\u2063]"
    )
    text = invisible_chars.sub("", text)

    # Remove control characters except newline and tab
    text = "".join(
        ch for ch in text
        if unicodedata.category(ch) != "Cc" or ch in "\n\t"
    )

    return text

# Invisible characters often appear in copy-pasted text
messy = "Hello\u200b \u200dWorld\ufeff"  # Contains zero-width chars
print(repr(normalize_unicode(messy)))
# 'Hello World'

Smart Whitespace Cleaning

Whitespace issues come in many forms: tabs mixed with spaces, multiple consecutive newlines, trailing spaces, and non-breaking spaces masquerading as regular spaces.

import re

def clean_whitespace(text: str) -> str:
    """Normalize all whitespace to standard forms."""
    # Replace non-breaking spaces and other space variants
    text = re.sub(r"[\u00a0\u2000-\u200a\u202f\u205f\u3000]", " ", text)

    # Replace tabs with spaces
    text = text.replace("\t", " ")

    # Collapse multiple spaces into one
    text = re.sub(r" {2,}", " ", text)

    # Normalize line endings
    text = text.replace("\r\n", "\n").replace("\r", "\n")

    # Collapse more than 2 consecutive newlines
    text = re.sub(r"\n{3,}", "\n\n", text)

    # Strip leading/trailing whitespace from each line
    lines = [line.strip() for line in text.split("\n")]
    text = "\n".join(lines)

    return text.strip()

Tokenization and Length Management

LLMs have fixed context windows measured in tokens, not characters. Preprocessing must ensure inputs fit within limits while preserving meaning.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

import tiktoken

class TokenManager:
    def __init__(self, model: str = "gpt-4o"):
        self.encoder = tiktoken.encoding_for_model(model)

    def count_tokens(self, text: str) -> int:
        return len(self.encoder.encode(text))

    def truncate_to_tokens(
        self,
        text: str,
        max_tokens: int,
        strategy: str = "end",
    ) -> str:
        """Truncate text to fit within a token limit."""
        tokens = self.encoder.encode(text)
        if len(tokens) <= max_tokens:
            return text

        if strategy == "end":
            truncated = tokens[:max_tokens]
        elif strategy == "start":
            truncated = tokens[-max_tokens:]
        elif strategy == "middle":
            half = max_tokens // 2
            truncated = tokens[:half] + tokens[-half:]
        else:
            raise ValueError(f"Unknown strategy: {strategy}")

        return self.encoder.decode(truncated)

    def smart_truncate(
        self,
        text: str,
        max_tokens: int,
    ) -> str:
        """Truncate at sentence boundaries to avoid mid-sentence cuts."""
        tokens = self.encoder.encode(text)
        if len(tokens) <= max_tokens:
            return text

        # Binary search for the last complete sentence within the limit
        sentences = text.split(". ")
        result = ""
        for sentence in sentences:
            candidate = result + sentence + ". "
            if self.count_tokens(candidate) > max_tokens:
                break
            result = candidate

        return result.strip() or self.truncate_to_tokens(text, max_tokens)

token_mgr = TokenManager()
print(token_mgr.count_tokens("Hello, world!"))  # 4

Building a Complete Preprocessing Pipeline

Combine all preprocessing steps into a configurable pipeline that agents invoke before processing any input.

from dataclasses import dataclass, field
from typing import Callable

@dataclass
class PreprocessConfig:
    strip_html: bool = True
    normalize_unicode: bool = True
    clean_whitespace: bool = True
    max_tokens: int = 4000
    lowercase: bool = False
    remove_urls: bool = False

class TextPreprocessor:
    def __init__(self, config: PreprocessConfig):
        self.config = config
        self.token_mgr = TokenManager()
        self.steps: list[Callable[[str], str]] = self._build_steps()

    def _build_steps(self) -> list[Callable[[str], str]]:
        steps = []
        if self.config.strip_html:
            steps.append(strip_html)
        if self.config.normalize_unicode:
            steps.append(normalize_unicode)
        if self.config.remove_urls:
            steps.append(self._remove_urls)
        if self.config.lowercase:
            steps.append(str.lower)
        if self.config.clean_whitespace:
            steps.append(clean_whitespace)
        return steps

    def _remove_urls(self, text: str) -> str:
        return re.sub(
            r"https?://\S+|www\.\S+", "[URL]", text
        )

    def process(self, text: str) -> dict:
        """Run the full preprocessing pipeline."""
        original_length = len(text)
        for step in self.steps:
            text = step(text)

        original_tokens = self.token_mgr.count_tokens(text)
        if original_tokens > self.config.max_tokens:
            text = self.token_mgr.smart_truncate(
                text, self.config.max_tokens
            )
            was_truncated = True
        else:
            was_truncated = False

        return {
            "text": text,
            "original_chars": original_length,
            "processed_chars": len(text),
            "token_count": self.token_mgr.count_tokens(text),
            "was_truncated": was_truncated,
        }

# Usage
preprocessor = TextPreprocessor(PreprocessConfig(
    max_tokens=2000,
    remove_urls=True,
))

result = preprocessor.process(raw_user_input)
clean_text = result["text"]

Input Sanitization for Security

Agents that process user input must guard against prompt injection and other adversarial inputs.

import re

def sanitize_for_agent(text: str) -> str:
    """Remove potential prompt injection patterns."""
    # Remove common injection patterns
    injection_patterns = [
        r"ignore (?:all )?previous instructions",
        r"you are now",
        r"system:\s*",
        r"\[INST\]",
        r"<\|(?:im_start|im_end|system|user|assistant)\|>",
    ]

    sanitized = text
    for pattern in injection_patterns:
        sanitized = re.sub(pattern, "[FILTERED]", sanitized, flags=re.IGNORECASE)

    return sanitized

This is a heuristic defense — not a complete solution. Always combine input sanitization with proper system prompt hardening and output validation for defense in depth.

FAQ

Should I preprocess text before or after language detection?

Preprocess before language detection, but only apply language-agnostic steps: HTML stripping, Unicode normalization, and whitespace cleaning. Do not apply lowercase normalization or stop word removal before detection, as these can degrade detection accuracy. Language-specific preprocessing (stemming, lemmatization) should happen after detection confirms the language.

How do I handle emojis and special characters in agent input?

Do not remove emojis by default. Modern LLMs understand emojis and they carry sentiment and intent information. Remove emojis only if they cause issues with a specific downstream model. Replace them with text descriptions (using the emoji library's demojize() function) if you need to preserve the meaning while using models that do not handle emojis well.

What is the right max token limit for preprocessing?

It depends on your prompt design. If your system prompt uses 1,000 tokens and the model has an 8,000 token context window, your user input budget is roughly 7,000 tokens minus whatever you reserve for the model's response (typically 1,000 to 2,000 tokens). Calculate it as: max_input_tokens = context_window - system_prompt_tokens - max_output_tokens - safety_margin. Track these budgets explicitly in your preprocessing configuration.


#TextPreprocessing #NLP #Tokenization #DataCleaning #AIAgents #Python #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.