Content Security Policies for AI Agents: Preventing Malicious Output Generation
Build robust output filtering systems for AI agents using allowlists, blocklists, regex patterns, ML classifiers, and structured output validation to prevent harmful, toxic, or policy-violating content from reaching end users.
Why Output Filtering Is Non-Negotiable
An AI agent can generate any text the underlying LLM is capable of producing. Without output filtering, agents can leak private data, generate harmful instructions, produce policy-violating content, or output executable code that acts as a cross-site scripting payload when rendered in a browser.
Content security for AI agents operates on a different model than traditional web content security policies. Instead of restricting which resources a browser can load, agent content security restricts what the agent can say. The enforcement point sits between the LLM's raw output and the delivery layer that sends responses to users.
Layered Filtering Architecture
Build your content security as a pipeline of filters that each response must pass through. If any filter rejects the response, it is blocked or sanitized before delivery:
from abc import ABC, abstractmethod
from dataclasses import dataclass
from enum import Enum
class FilterVerdict(Enum):
PASS = "pass"
BLOCK = "block"
SANITIZE = "sanitize"
@dataclass
class FilterResult:
verdict: FilterVerdict
filter_name: str
reason: str
sanitized_content: str | None = None
class ContentFilter(ABC):
"""Base class for content security filters."""
@abstractmethod
def evaluate(self, content: str, context: dict) -> FilterResult:
...
class ContentSecurityPipeline:
"""Runs agent output through a chain of content filters."""
def __init__(self):
self.filters: list[ContentFilter] = []
def add_filter(self, f: ContentFilter) -> None:
self.filters.append(f)
def process(self, content: str, context: dict | None = None) -> tuple[str, list[FilterResult]]:
"""Process content through all filters.
Returns (final_content, filter_results)."""
ctx = context or {}
results = []
current_content = content
for f in self.filters:
result = f.evaluate(current_content, ctx)
results.append(result)
if result.verdict == FilterVerdict.BLOCK:
return (
"I cannot provide that information.",
results,
)
if result.verdict == FilterVerdict.SANITIZE and result.sanitized_content:
current_content = result.sanitized_content
return current_content, results
Pattern-Based Filtering
Use regex patterns to catch common dangerous outputs like PII, credentials, and code injection attempts:
import re
class PatternFilter(ContentFilter):
"""Blocks or sanitizes content matching dangerous patterns."""
PATTERNS = {
"ssn": {
"pattern": r"\b\d{3}-\d{2}-\d{4}\b",
"action": FilterVerdict.SANITIZE,
"replacement": "[SSN REDACTED]",
"reason": "Social Security Number detected",
},
"credit_card": {
"pattern": r"\b(?:\d{4}[- ]?){3}\d{4}\b",
"action": FilterVerdict.SANITIZE,
"replacement": "[CARD REDACTED]",
"reason": "Credit card number detected",
},
"api_key": {
"pattern": r"\b(sk-[a-zA-Z0-9]{32,}|AKIA[0-9A-Z]{16})\b",
"action": FilterVerdict.BLOCK,
"replacement": "",
"reason": "API key or credential detected",
},
"script_injection": {
"pattern": r"<script[^>]*>.*?</script>",
"action": FilterVerdict.SANITIZE,
"replacement": "[SCRIPT REMOVED]",
"reason": "Script injection detected",
},
}
def evaluate(self, content: str, context: dict) -> FilterResult:
for name, config in self.PATTERNS.items():
match = re.search(config["pattern"], content, re.IGNORECASE | re.DOTALL)
if match:
if config["action"] == FilterVerdict.BLOCK:
return FilterResult(
verdict=FilterVerdict.BLOCK,
filter_name=f"pattern:{name}",
reason=config["reason"],
)
sanitized = re.sub(
config["pattern"],
config["replacement"],
content,
flags=re.IGNORECASE | re.DOTALL,
)
return FilterResult(
verdict=FilterVerdict.SANITIZE,
filter_name=f"pattern:{name}",
reason=config["reason"],
sanitized_content=sanitized,
)
return FilterResult(
verdict=FilterVerdict.PASS,
filter_name="pattern",
reason="No dangerous patterns detected",
)
Allowlist-Based Output Control
For high-security environments, define exactly what the agent is allowed to output rather than trying to block everything dangerous:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
class TopicAllowlistFilter(ContentFilter):
"""Restricts agent output to pre-approved topics."""
def __init__(self, allowed_topics: list[str], classifier_fn=None):
self.allowed_topics = set(allowed_topics)
self.classifier_fn = classifier_fn or self._default_classifier
def _default_classifier(self, content: str) -> list[str]:
"""Simple keyword-based topic classification."""
topic_keywords = {
"product_info": ["product", "feature", "pricing", "plan"],
"support": ["help", "issue", "error", "troubleshoot"],
"billing": ["invoice", "payment", "subscription", "charge"],
}
detected = []
content_lower = content.lower()
for topic, keywords in topic_keywords.items():
if any(kw in content_lower for kw in keywords):
detected.append(topic)
return detected if detected else ["unknown"]
def evaluate(self, content: str, context: dict) -> FilterResult:
detected_topics = self.classifier_fn(content)
for topic in detected_topics:
if topic not in self.allowed_topics:
return FilterResult(
verdict=FilterVerdict.BLOCK,
filter_name="topic_allowlist",
reason=f"Topic '{topic}' not in allowlist",
)
return FilterResult(
verdict=FilterVerdict.PASS,
filter_name="topic_allowlist",
reason="All topics within allowed set",
)
Structured Output Validation
Enforce output schemas that make it structurally impossible for the agent to produce certain types of content:
from pydantic import BaseModel, field_validator
class SafeAgentResponse(BaseModel):
"""Validated agent response that prevents dangerous outputs."""
message: str
sources: list[str]
confidence: float
@field_validator("message")
@classmethod
def validate_message(cls, v: str) -> str:
# Reject responses containing HTML tags
if re.search(r"<[a-zA-Z][^>]*>", v):
raise ValueError("Response must not contain HTML tags")
# Reject responses exceeding length limit
if len(v) > 5000:
raise ValueError("Response exceeds maximum length")
return v
@field_validator("confidence")
@classmethod
def validate_confidence(cls, v: float) -> float:
if not 0.0 <= v <= 1.0:
raise ValueError("Confidence must be between 0 and 1")
return v
# Usage in pipeline
pipeline = ContentSecurityPipeline()
pipeline.add_filter(PatternFilter())
pipeline.add_filter(TopicAllowlistFilter(
allowed_topics=["product_info", "support", "billing"]
))
raw_output = "Your API key is sk-abc123def456. Your next bill is $49."
safe_output, results = pipeline.process(raw_output)
FAQ
How do I handle false positives in pattern-based filtering?
Track your false positive rate by logging all filter verdicts and reviewing blocked responses. Tune your patterns to be more specific — for example, use a Luhn check for credit card numbers rather than just matching digit patterns. Implement a review queue where blocked responses can be manually approved, and feed those approvals back into pattern refinement.
Should I filter tool call outputs or only final responses?
Filter both. Tool call outputs can contain injected content that influences the agent's subsequent reasoning. Final responses are what users see. Apply the full security pipeline to tool outputs as they are ingested, and apply it again to the agent's final response before delivery.
How does output filtering interact with streaming responses?
Streaming complicates content security because you cannot analyze the full response before sending tokens to the user. Buffer a configurable amount of text (for example, sentence boundaries) and run filters on each buffer before flushing to the client. For pattern-based filters, maintain state across buffers to detect patterns that span chunk boundaries.
#ContentSecurity #OutputFiltering #AISafety #ContentModeration #AgentGuardrails #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.