Content Security Policies for AI Agents: Preventing Malicious Output Generation

Why Output Filtering Is Non-Negotiable

An AI agent can generate any text the underlying LLM is capable of producing. Without output filtering, agents can leak private data, generate harmful instructions, produce policy-violating content, or output executable code that acts as a cross-site scripting payload when rendered in a browser.

Content security for AI agents operates on a different model than traditional web content security policies. Instead of restricting which resources a browser can load, agent content security restricts what the agent can say. The enforcement point sits between the LLM's raw output and the delivery layer that sends responses to users.

Layered Filtering Architecture

Build your content security as a pipeline of filters that each response must pass through. If any filter rejects the response, it is blocked or sanitized before delivery:

from abc import ABC, abstractmethod
from dataclasses import dataclass
from enum import Enum


class FilterVerdict(Enum):
    PASS = "pass"
    BLOCK = "block"
    SANITIZE = "sanitize"


@dataclass
class FilterResult:
    verdict: FilterVerdict
    filter_name: str
    reason: str
    sanitized_content: str | None = None


class ContentFilter(ABC):
    """Base class for content security filters."""

    @abstractmethod
    def evaluate(self, content: str, context: dict) -> FilterResult:
        ...


class ContentSecurityPipeline:
    """Runs agent output through a chain of content filters."""

    def __init__(self):
        self.filters: list[ContentFilter] = []

    def add_filter(self, f: ContentFilter) -> None:
        self.filters.append(f)

    def process(self, content: str, context: dict | None = None) -> tuple[str, list[FilterResult]]:
        """Process content through all filters.
        Returns (final_content, filter_results)."""
        ctx = context or {}
        results = []
        current_content = content

        for f in self.filters:
            result = f.evaluate(current_content, ctx)
            results.append(result)

            if result.verdict == FilterVerdict.BLOCK:
                return (
                    "I cannot provide that information.",
                    results,
                )

            if result.verdict == FilterVerdict.SANITIZE and result.sanitized_content:
                current_content = result.sanitized_content

        return current_content, results

Pattern-Based Filtering

Use regex patterns to catch common dangerous outputs like PII, credentials, and code injection attempts:

import re


class PatternFilter(ContentFilter):
    """Blocks or sanitizes content matching dangerous patterns."""

    PATTERNS = {
        "ssn": {
            "pattern": r"\b\d{3}-\d{2}-\d{4}\b",
            "action": FilterVerdict.SANITIZE,
            "replacement": "[SSN REDACTED]",
            "reason": "Social Security Number detected",
        },
        "credit_card": {
            "pattern": r"\b(?:\d{4}[- ]?){3}\d{4}\b",
            "action": FilterVerdict.SANITIZE,
            "replacement": "[CARD REDACTED]",
            "reason": "Credit card number detected",
        },
        "api_key": {
            "pattern": r"\b(sk-[a-zA-Z0-9]{32,}|AKIA[0-9A-Z]{16})\b",
            "action": FilterVerdict.BLOCK,
            "replacement": "",
            "reason": "API key or credential detected",
        },
        "script_injection": {
            "pattern": r"<script[^>]*>.*?</script>",
            "action": FilterVerdict.SANITIZE,
            "replacement": "[SCRIPT REMOVED]",
            "reason": "Script injection detected",
        },
    }

    def evaluate(self, content: str, context: dict) -> FilterResult:
        for name, config in self.PATTERNS.items():
            match = re.search(config["pattern"], content, re.IGNORECASE | re.DOTALL)
            if match:
                if config["action"] == FilterVerdict.BLOCK:
                    return FilterResult(
                        verdict=FilterVerdict.BLOCK,
                        filter_name=f"pattern:{name}",
                        reason=config["reason"],
                    )

                sanitized = re.sub(
                    config["pattern"],
                    config["replacement"],
                    content,
                    flags=re.IGNORECASE | re.DOTALL,
                )
                return FilterResult(
                    verdict=FilterVerdict.SANITIZE,
                    filter_name=f"pattern:{name}",
                    reason=config["reason"],
                    sanitized_content=sanitized,
                )

        return FilterResult(
            verdict=FilterVerdict.PASS,
            filter_name="pattern",
            reason="No dangerous patterns detected",
        )

Allowlist-Based Output Control

For high-security environments, define exactly what the agent is allowed to output rather than trying to block everything dangerous:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

class TopicAllowlistFilter(ContentFilter):
    """Restricts agent output to pre-approved topics."""

    def __init__(self, allowed_topics: list[str], classifier_fn=None):
        self.allowed_topics = set(allowed_topics)
        self.classifier_fn = classifier_fn or self._default_classifier

    def _default_classifier(self, content: str) -> list[str]:
        """Simple keyword-based topic classification."""
        topic_keywords = {
            "product_info": ["product", "feature", "pricing", "plan"],
            "support": ["help", "issue", "error", "troubleshoot"],
            "billing": ["invoice", "payment", "subscription", "charge"],
        }
        detected = []
        content_lower = content.lower()
        for topic, keywords in topic_keywords.items():
            if any(kw in content_lower for kw in keywords):
                detected.append(topic)
        return detected if detected else ["unknown"]

    def evaluate(self, content: str, context: dict) -> FilterResult:
        detected_topics = self.classifier_fn(content)

        for topic in detected_topics:
            if topic not in self.allowed_topics:
                return FilterResult(
                    verdict=FilterVerdict.BLOCK,
                    filter_name="topic_allowlist",
                    reason=f"Topic '{topic}' not in allowlist",
                )

        return FilterResult(
            verdict=FilterVerdict.PASS,
            filter_name="topic_allowlist",
            reason="All topics within allowed set",
        )

Structured Output Validation

Enforce output schemas that make it structurally impossible for the agent to produce certain types of content:

from pydantic import BaseModel, field_validator


class SafeAgentResponse(BaseModel):
    """Validated agent response that prevents dangerous outputs."""
    message: str
    sources: list[str]
    confidence: float

    @field_validator("message")
    @classmethod
    def validate_message(cls, v: str) -> str:
        # Reject responses containing HTML tags
        if re.search(r"<[a-zA-Z][^>]*>", v):
            raise ValueError("Response must not contain HTML tags")

        # Reject responses exceeding length limit
        if len(v) > 5000:
            raise ValueError("Response exceeds maximum length")

        return v

    @field_validator("confidence")
    @classmethod
    def validate_confidence(cls, v: float) -> float:
        if not 0.0 <= v <= 1.0:
            raise ValueError("Confidence must be between 0 and 1")
        return v


# Usage in pipeline
pipeline = ContentSecurityPipeline()
pipeline.add_filter(PatternFilter())
pipeline.add_filter(TopicAllowlistFilter(
    allowed_topics=["product_info", "support", "billing"]
))

raw_output = "Your API key is sk-abc123def456. Your next bill is $49."
safe_output, results = pipeline.process(raw_output)

FAQ

How do I handle false positives in pattern-based filtering?

Track your false positive rate by logging all filter verdicts and reviewing blocked responses. Tune your patterns to be more specific — for example, use a Luhn check for credit card numbers rather than just matching digit patterns. Implement a review queue where blocked responses can be manually approved, and feed those approvals back into pattern refinement.

Should I filter tool call outputs or only final responses?

Filter both. Tool call outputs can contain injected content that influences the agent's subsequent reasoning. Final responses are what users see. Apply the full security pipeline to tool outputs as they are ingested, and apply it again to the agent's final response before delivery.

How does output filtering interact with streaming responses?

Streaming complicates content security because you cannot analyze the full response before sending tokens to the user. Buffer a configurable amount of text (for example, sentence boundaries) and run filters on each buffer before flushing to the client. For pattern-based filters, maintain state across buffers to detect patterns that span chunk boundaries.

#ContentSecurity #OutputFiltering #AISafety #ContentModeration #AgentGuardrails #AgenticAI #LearnAI #AIEngineering

Content Security Policies for AI Agents: Preventing Malicious Output Generation

Why Output Filtering Is Non-Negotiable

Layered Filtering Architecture

Pattern-Based Filtering

Allowlist-Based Output Control

Structured Output Validation

FAQ

How do I handle false positives in pattern-based filtering?

Should I filter tool call outputs or only final responses?

How does output filtering interact with streaming responses?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding