Output Guardrails: Preventing AI Agents from Returning Harmful Content
Build output scanning systems that detect PII leaks, toxic content, format violations, and off-topic responses before they reach your users, with practical Python implementations for each guardrail type.
Why Input Validation Is Not Enough
Even with robust input validation, an AI agent can still produce harmful outputs. The model might hallucinate sensitive data, generate toxic content from benign prompts, leak system prompt details, or return responses that violate your application's business rules. Output guardrails are the last line of defense between the agent and your users.
This post builds a complete output guardrail system in Python with four types of checks: PII detection, toxicity filtering, format validation, and topic adherence.
Output Guardrail Architecture
The guardrail system mirrors the input validation pipeline but runs on the agent's response before it is delivered:
from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
import re
class GuardrailAction(Enum):
ALLOW = "allow"
REDACT = "redact"
BLOCK = "block"
@dataclass
class GuardrailResult:
action: GuardrailAction
output: str
violations: list[str] = field(default_factory=list)
blocked_reason: Optional[str] = None
class OutputGuardrailPipeline:
def __init__(self, guardrails: list):
self.guardrails = guardrails
def evaluate(self, agent_output: str) -> GuardrailResult:
current_output = agent_output
all_violations = []
for guardrail in self.guardrails:
result = guardrail.check(current_output)
all_violations.extend(result.violations)
if result.action == GuardrailAction.BLOCK:
return GuardrailResult(
action=GuardrailAction.BLOCK,
output="",
violations=all_violations,
blocked_reason=result.blocked_reason,
)
if result.action == GuardrailAction.REDACT:
current_output = result.output
action = (
GuardrailAction.REDACT if all_violations
else GuardrailAction.ALLOW
)
return GuardrailResult(
action=action,
output=current_output,
violations=all_violations,
)
Guardrail 1: PII Detection and Redaction
PII leaks are one of the highest-risk output failures. An agent might include email addresses, phone numbers, or social security numbers from its training data or retrieved documents:
class PIIGuardrail:
"""Detect and redact personally identifiable information."""
PII_PATTERNS = {
"email": r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}",
"phone_us": r"\b(?:\+1[-.]?)?\(?\d{3}\)?[-.]?\d{3}[-.]?\d{4}\b",
"ssn": r"\b\d{3}-\d{2}-\d{4}\b",
"credit_card": r"\b(?:\d{4}[-\s]?){3}\d{4}\b",
"ip_address": r"\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b",
}
REDACTION_MAP = {
"email": "[EMAIL REDACTED]",
"phone_us": "[PHONE REDACTED]",
"ssn": "[SSN REDACTED]",
"credit_card": "[CARD REDACTED]",
"ip_address": "[IP REDACTED]",
}
def check(self, text: str) -> GuardrailResult:
violations = []
redacted = text
for pii_type, pattern in self.PII_PATTERNS.items():
matches = re.findall(pattern, redacted)
if matches:
violations.append(f"pii_{pii_type}:{len(matches)}_instances")
replacement = self.REDACTION_MAP[pii_type]
redacted = re.sub(pattern, replacement, redacted)
if violations:
return GuardrailResult(
action=GuardrailAction.REDACT,
output=redacted,
violations=violations,
)
return GuardrailResult(
action=GuardrailAction.ALLOW,
output=text,
)
Guardrail 2: Toxicity and Harmful Content Filter
Toxicity detection prevents the agent from outputting offensive, violent, or otherwise harmful content:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
class ToxicityGuardrail:
"""Detect toxic or harmful content in agent output."""
def __init__(self, threshold: float = 0.7):
self.threshold = threshold
def check(self, text: str) -> GuardrailResult:
from openai import OpenAI
client = OpenAI()
response = client.moderations.create(
model="omni-moderation-latest",
input=text,
)
result = response.results[0]
if result.flagged:
flagged_categories = [
cat for cat, flagged in result.categories.__dict__.items()
if flagged
]
return GuardrailResult(
action=GuardrailAction.BLOCK,
output="",
violations=[f"toxicity:{cat}" for cat in flagged_categories],
blocked_reason="Response contained harmful content",
)
return GuardrailResult(
action=GuardrailAction.ALLOW,
output=text,
)
Guardrail 3: Format and Schema Validation
When agents return structured data, format validation ensures correctness:
import json
from typing import Any
class FormatGuardrail:
"""Validate that agent output conforms to expected schema."""
def __init__(self, expected_format: str = "text", schema: dict | None = None):
self.expected_format = expected_format
self.schema = schema
def check(self, text: str) -> GuardrailResult:
if self.expected_format == "json":
return self._validate_json(text)
elif self.expected_format == "no_code":
return self._validate_no_code(text)
return GuardrailResult(action=GuardrailAction.ALLOW, output=text)
def _validate_json(self, text: str) -> GuardrailResult:
try:
parsed = json.loads(text)
if self.schema:
missing = [k for k in self.schema.get("required", []) if k not in parsed]
if missing:
return GuardrailResult(
action=GuardrailAction.BLOCK,
output=text,
violations=[f"missing_fields:{missing}"],
blocked_reason="Response missing required fields",
)
except json.JSONDecodeError:
return GuardrailResult(
action=GuardrailAction.BLOCK,
output=text,
violations=["invalid_json"],
blocked_reason="Response is not valid JSON",
)
return GuardrailResult(action=GuardrailAction.ALLOW, output=text)
def _validate_no_code(self, text: str) -> GuardrailResult:
code_patterns = [r"~~~", r"import\s+\w+", r"def\s+\w+\("]
for pattern in code_patterns:
if re.search(pattern, text):
return GuardrailResult(
action=GuardrailAction.BLOCK,
output=text,
violations=["contains_code"],
blocked_reason="Response contains code blocks",
)
return GuardrailResult(action=GuardrailAction.ALLOW, output=text)
Guardrail 4: Topic Adherence
Ensure the agent stays on topic and does not reveal system internals:
class TopicAdherenceGuardrail:
"""Block responses that leak system prompts or go off-topic."""
SYSTEM_LEAK_PATTERNS = [
r"my (system |initial )?instructions (are|say|tell)",
r"I was (told|instructed|programmed) to",
r"my (system )?prompt (is|says|contains)",
]
def __init__(self, allowed_topics: list[str] | None = None):
self.allowed_topics = allowed_topics
def check(self, text: str) -> GuardrailResult:
violations = []
for pattern in self.SYSTEM_LEAK_PATTERNS:
if re.search(pattern, text, re.IGNORECASE):
violations.append("system_prompt_leak")
return GuardrailResult(
action=GuardrailAction.BLOCK,
output="",
violations=violations,
blocked_reason="Response may reveal system instructions",
)
return GuardrailResult(
action=GuardrailAction.ALLOW,
output=text,
violations=violations,
)
Assembling the Pipeline
guardrails = OutputGuardrailPipeline(guardrails=[
PIIGuardrail(),
ToxicityGuardrail(),
TopicAdherenceGuardrail(),
FormatGuardrail(expected_format="text"),
])
def deliver_response(agent_output: str) -> str:
result = guardrails.evaluate(agent_output)
if result.action == GuardrailAction.BLOCK:
return "I'm unable to provide that response. Please rephrase your question."
return result.output
FAQ
Do output guardrails add noticeable latency?
Regex-based checks like PII detection add microseconds. LLM-based checks like toxicity scoring and topic classification add 200-500ms per call. The best strategy is to run fast regex checks first and only invoke LLM-based guardrails when the fast checks pass. For latency-sensitive applications, you can run guardrail checks in parallel with response streaming and cancel the stream if a violation is detected.
Should I block or redact PII in agent outputs?
It depends on context. For customer-facing applications, redaction is often better because it preserves the useful parts of the response while removing sensitive data. For internal tools where the user might need the data, logging the PII detection and alerting is better than silently redacting. Always log PII detections regardless of the action taken.
How do I handle false positives in output guardrails?
Log every guardrail trigger with the original output, the violation type, and whether the action was block or redact. Review these logs weekly to tune your patterns and thresholds. Build a test suite of known-good outputs that should pass all guardrails and run it as part of your CI pipeline to catch regressions.
#OutputGuardrails #AISafety #PIIDetection #ContentModeration #Python #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.