Output Guardrails: Preventing AI Agents from Returning Harmful Content

Why Input Validation Is Not Enough

Even with robust input validation, an AI agent can still produce harmful outputs. The model might hallucinate sensitive data, generate toxic content from benign prompts, leak system prompt details, or return responses that violate your application's business rules. Output guardrails are the last line of defense between the agent and your users.

This post builds a complete output guardrail system in Python with four types of checks: PII detection, toxicity filtering, format validation, and topic adherence.

Output Guardrail Architecture

The guardrail system mirrors the input validation pipeline but runs on the agent's response before it is delivered:

from dataclasses import dataclass, field
from enum import Enum
from typing import Optional
import re

class GuardrailAction(Enum):
    ALLOW = "allow"
    REDACT = "redact"
    BLOCK = "block"

@dataclass
class GuardrailResult:
    action: GuardrailAction
    output: str
    violations: list[str] = field(default_factory=list)
    blocked_reason: Optional[str] = None

class OutputGuardrailPipeline:
    def __init__(self, guardrails: list):
        self.guardrails = guardrails

    def evaluate(self, agent_output: str) -> GuardrailResult:
        current_output = agent_output
        all_violations = []

        for guardrail in self.guardrails:
            result = guardrail.check(current_output)
            all_violations.extend(result.violations)

            if result.action == GuardrailAction.BLOCK:
                return GuardrailResult(
                    action=GuardrailAction.BLOCK,
                    output="",
                    violations=all_violations,
                    blocked_reason=result.blocked_reason,
                )

            if result.action == GuardrailAction.REDACT:
                current_output = result.output

        action = (
            GuardrailAction.REDACT if all_violations
            else GuardrailAction.ALLOW
        )
        return GuardrailResult(
            action=action,
            output=current_output,
            violations=all_violations,
        )

Guardrail 1: PII Detection and Redaction

PII leaks are one of the highest-risk output failures. An agent might include email addresses, phone numbers, or social security numbers from its training data or retrieved documents:

class PIIGuardrail:
    """Detect and redact personally identifiable information."""

    PII_PATTERNS = {
        "email": r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}",
        "phone_us": r"\b(?:\+1[-.]?)?\(?\d{3}\)?[-.]?\d{3}[-.]?\d{4}\b",
        "ssn": r"\b\d{3}-\d{2}-\d{4}\b",
        "credit_card": r"\b(?:\d{4}[-\s]?){3}\d{4}\b",
        "ip_address": r"\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b",
    }

    REDACTION_MAP = {
        "email": "[EMAIL REDACTED]",
        "phone_us": "[PHONE REDACTED]",
        "ssn": "[SSN REDACTED]",
        "credit_card": "[CARD REDACTED]",
        "ip_address": "[IP REDACTED]",
    }

    def check(self, text: str) -> GuardrailResult:
        violations = []
        redacted = text

        for pii_type, pattern in self.PII_PATTERNS.items():
            matches = re.findall(pattern, redacted)
            if matches:
                violations.append(f"pii_{pii_type}:{len(matches)}_instances")
                replacement = self.REDACTION_MAP[pii_type]
                redacted = re.sub(pattern, replacement, redacted)

        if violations:
            return GuardrailResult(
                action=GuardrailAction.REDACT,
                output=redacted,
                violations=violations,
            )

        return GuardrailResult(
            action=GuardrailAction.ALLOW,
            output=text,
        )

Guardrail 2: Toxicity and Harmful Content Filter

Toxicity detection prevents the agent from outputting offensive, violent, or otherwise harmful content:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

class ToxicityGuardrail:
    """Detect toxic or harmful content in agent output."""

    def __init__(self, threshold: float = 0.7):
        self.threshold = threshold

    def check(self, text: str) -> GuardrailResult:
        from openai import OpenAI

        client = OpenAI()
        response = client.moderations.create(
            model="omni-moderation-latest",
            input=text,
        )

        result = response.results[0]

        if result.flagged:
            flagged_categories = [
                cat for cat, flagged in result.categories.__dict__.items()
                if flagged
            ]
            return GuardrailResult(
                action=GuardrailAction.BLOCK,
                output="",
                violations=[f"toxicity:{cat}" for cat in flagged_categories],
                blocked_reason="Response contained harmful content",
            )

        return GuardrailResult(
            action=GuardrailAction.ALLOW,
            output=text,
        )

Guardrail 3: Format and Schema Validation

When agents return structured data, format validation ensures correctness:

import json
from typing import Any

class FormatGuardrail:
    """Validate that agent output conforms to expected schema."""

    def __init__(self, expected_format: str = "text", schema: dict | None = None):
        self.expected_format = expected_format
        self.schema = schema

    def check(self, text: str) -> GuardrailResult:
        if self.expected_format == "json":
            return self._validate_json(text)
        elif self.expected_format == "no_code":
            return self._validate_no_code(text)
        return GuardrailResult(action=GuardrailAction.ALLOW, output=text)

    def _validate_json(self, text: str) -> GuardrailResult:
        try:
            parsed = json.loads(text)
            if self.schema:
                missing = [k for k in self.schema.get("required", []) if k not in parsed]
                if missing:
                    return GuardrailResult(
                        action=GuardrailAction.BLOCK,
                        output=text,
                        violations=[f"missing_fields:{missing}"],
                        blocked_reason="Response missing required fields",
                    )
        except json.JSONDecodeError:
            return GuardrailResult(
                action=GuardrailAction.BLOCK,
                output=text,
                violations=["invalid_json"],
                blocked_reason="Response is not valid JSON",
            )

        return GuardrailResult(action=GuardrailAction.ALLOW, output=text)

    def _validate_no_code(self, text: str) -> GuardrailResult:
        code_patterns = [r"~~~", r"import\s+\w+", r"def\s+\w+\("]
        for pattern in code_patterns:
            if re.search(pattern, text):
                return GuardrailResult(
                    action=GuardrailAction.BLOCK,
                    output=text,
                    violations=["contains_code"],
                    blocked_reason="Response contains code blocks",
                )
        return GuardrailResult(action=GuardrailAction.ALLOW, output=text)

Guardrail 4: Topic Adherence

Ensure the agent stays on topic and does not reveal system internals:

class TopicAdherenceGuardrail:
    """Block responses that leak system prompts or go off-topic."""

    SYSTEM_LEAK_PATTERNS = [
        r"my (system |initial )?instructions (are|say|tell)",
        r"I was (told|instructed|programmed) to",
        r"my (system )?prompt (is|says|contains)",
    ]

    def __init__(self, allowed_topics: list[str] | None = None):
        self.allowed_topics = allowed_topics

    def check(self, text: str) -> GuardrailResult:
        violations = []

        for pattern in self.SYSTEM_LEAK_PATTERNS:
            if re.search(pattern, text, re.IGNORECASE):
                violations.append("system_prompt_leak")
                return GuardrailResult(
                    action=GuardrailAction.BLOCK,
                    output="",
                    violations=violations,
                    blocked_reason="Response may reveal system instructions",
                )

        return GuardrailResult(
            action=GuardrailAction.ALLOW,
            output=text,
            violations=violations,
        )

Assembling the Pipeline

guardrails = OutputGuardrailPipeline(guardrails=[
    PIIGuardrail(),
    ToxicityGuardrail(),
    TopicAdherenceGuardrail(),
    FormatGuardrail(expected_format="text"),
])

def deliver_response(agent_output: str) -> str:
    result = guardrails.evaluate(agent_output)

    if result.action == GuardrailAction.BLOCK:
        return "I'm unable to provide that response. Please rephrase your question."

    return result.output

FAQ

Do output guardrails add noticeable latency?

Regex-based checks like PII detection add microseconds. LLM-based checks like toxicity scoring and topic classification add 200-500ms per call. The best strategy is to run fast regex checks first and only invoke LLM-based guardrails when the fast checks pass. For latency-sensitive applications, you can run guardrail checks in parallel with response streaming and cancel the stream if a violation is detected.

Should I block or redact PII in agent outputs?

It depends on context. For customer-facing applications, redaction is often better because it preserves the useful parts of the response while removing sensitive data. For internal tools where the user might need the data, logging the PII detection and alerting is better than silently redacting. Always log PII detections regardless of the action taken.

How do I handle false positives in output guardrails?

Log every guardrail trigger with the original output, the violation type, and whether the action was block or redact. Review these logs weekly to tune your patterns and thresholds. Build a test suite of known-good outputs that should pass all guardrails and run it as part of your CI pipeline to catch regressions.

#OutputGuardrails #AISafety #PIIDetection #ContentModeration #Python #AgenticAI #LearnAI #AIEngineering

Output Guardrails: Preventing AI Agents from Returning Harmful Content

Why Input Validation Is Not Enough

Output Guardrail Architecture

Guardrail 1: PII Detection and Redaction

Guardrail 2: Toxicity and Harmful Content Filter

Guardrail 3: Format and Schema Validation

Guardrail 4: Topic Adherence

Assembling the Pipeline

FAQ

Do output guardrails add noticeable latency?

Should I block or redact PII in agent outputs?

How do I handle false positives in output guardrails?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding