Hallucination Detection and Mitigation in AI Agent Systems

The Hallucination Problem in Agentic Systems

When a chatbot hallucinates, a user gets wrong information. When an AI agent hallucinates, it takes wrong actions — booking fake appointments, citing nonexistent regulations, or executing tool calls based on fabricated data. In agentic systems, hallucination is not just an accuracy problem, it is a safety problem.

Hallucinations in agents fall into three categories: factual errors (stating incorrect facts), fabrication (inventing data, URLs, or citations that do not exist), and reasoning errors (drawing wrong conclusions from correct data). Each requires different detection and mitigation strategies.

Technique 1: Source Grounding with Citation Verification

The most effective hallucination mitigation is grounding agent responses in retrieved source documents and verifying that claims map back to those sources:

from dataclasses import dataclass
from typing import Optional

@dataclass
class SourceDocument:
    id: str
    content: str
    metadata: dict

@dataclass
class CitedClaim:
    claim: str
    source_id: Optional[str]
    source_text: Optional[str]
    verified: bool
    confidence: float

class GroundedResponseGenerator:
    """Generate responses grounded in source documents with citation tracking."""

    def __init__(self, llm_client):
        self.llm = llm_client

    def generate_grounded_response(
        self,
        query: str,
        sources: list[SourceDocument],
    ) -> tuple[str, list[CitedClaim]]:
        source_context = "\n\n".join(
            f"[Source {s.id}]: {s.content}" for s in sources
        )

        response = self.llm.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": """Answer the user's question using ONLY the provided sources.
For every factual claim, include a citation in the format [Source X].
If the sources do not contain enough information to answer,
say "I don't have enough information to answer that."
Never make claims that are not supported by the provided sources.""",
                },
                {
                    "role": "user",
                    "content": f"Sources:\n{source_context}\n\nQuestion: {query}",
                },
            ],
            temperature=0,
        )

        answer = response.choices[0].message.content or ""
        claims = self._extract_and_verify_claims(answer, sources)
        return answer, claims

    def _extract_and_verify_claims(
        self,
        response: str,
        sources: list[SourceDocument],
    ) -> list[CitedClaim]:
        verification_response = self.llm.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": """Extract each factual claim from the response.
For each claim, output a JSON array with objects:
{"claim": "...", "source_id": "...", "verified": true/false, "confidence": 0.0-1.0}
Set verified=true only if the claim is directly supported by the cited source.""",
                },
                {
                    "role": "user",
                    "content": f"Response: {response}\n\nSources: {[s.content for s in sources]}",
                },
            ],
            temperature=0,
        )

        import json
        claims_data = json.loads(
            verification_response.choices[0].message.content or "[]"
        )

        return [
            CitedClaim(
                claim=c["claim"],
                source_id=c.get("source_id"),
                source_text=self._get_source_text(c.get("source_id"), sources),
                verified=c["verified"],
                confidence=c["confidence"],
            )
            for c in claims_data
        ]

    def _get_source_text(
        self,
        source_id: Optional[str],
        sources: list[SourceDocument],
    ) -> Optional[str]:
        if not source_id:
            return None
        for s in sources:
            if s.id == source_id:
                return s.content
        return None

Technique 2: Confidence Scoring

Confidence scoring estimates how likely the agent's output is to be correct, enabling conditional handling of low-confidence responses:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

class ConfidenceScorer:
    """Score the confidence of agent responses using multiple signals."""

    def __init__(self, llm_client):
        self.llm = llm_client

    def score_response(
        self,
        query: str,
        response: str,
        sources: list[SourceDocument] | None = None,
    ) -> dict:
        signals = {}

        # Signal 1: Self-consistency (generate multiple responses, check agreement)
        signals["self_consistency"] = self._check_self_consistency(query)

        # Signal 2: Source coverage
        if sources:
            signals["source_coverage"] = self._check_source_coverage(
                response, sources
            )

        # Signal 3: Hedging language detection
        signals["hedging_score"] = self._detect_hedging(response)

        # Weighted average
        weights = {"self_consistency": 0.4, "source_coverage": 0.4, "hedging_score": 0.2}
        total = sum(
            signals.get(k, 0.5) * v
            for k, v in weights.items()
        )
        signals["overall_confidence"] = round(total, 3)

        return signals

    def _check_self_consistency(self, query: str, n_samples: int = 3) -> float:
        """Generate multiple responses and measure agreement."""
        responses = []
        for _ in range(n_samples):
            result = self.llm.chat.completions.create(
                model="gpt-4o-mini",
                messages=[{"role": "user", "content": query}],
                temperature=0.7,
                max_tokens=200,
            )
            responses.append(result.choices[0].message.content)

        agreement_check = self.llm.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{
                "role": "user",
                "content": f"""Rate the factual agreement between these responses
from 0.0 (completely contradictory) to 1.0 (fully consistent).
Respond with ONLY a number.

Responses: {responses}""",
            }],
            temperature=0,
        )
        try:
            return float(agreement_check.choices[0].message.content.strip())
        except ValueError:
            return 0.5

    def _check_source_coverage(
        self,
        response: str,
        sources: list[SourceDocument],
    ) -> float:
        """Check what fraction of response claims are covered by sources."""
        source_text = " ".join(s.content for s in sources)
        check = self.llm.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{
                "role": "user",
                "content": f"""What fraction of factual claims in the Response
are supported by the Source text? Respond with ONLY a number between 0.0 and 1.0.

Response: {response}
Source: {source_text}""",
            }],
            temperature=0,
        )
        try:
            return float(check.choices[0].message.content.strip())
        except ValueError:
            return 0.5

    def _detect_hedging(self, response: str) -> float:
        """Detect hedging language as a proxy for uncertainty."""
        hedging_phrases = [
            "I think", "probably", "might be", "I'm not sure",
            "it's possible", "approximately", "roughly",
            "I believe", "as far as I know", "it seems",
        ]
        lower_resp = response.lower()
        hedge_count = sum(1 for p in hedging_phrases if p.lower() in lower_resp)
        # More hedging means lower confidence
        return max(0.0, 1.0 - (hedge_count * 0.15))

Technique 3: Human-in-the-Loop Escalation

When confidence is low, escalate to a human reviewer instead of delivering a potentially hallucinated response:

import asyncio
from enum import Enum

class EscalationLevel(Enum):
    AUTO_APPROVE = "auto_approve"
    FLAG_FOR_REVIEW = "flag_for_review"
    REQUIRE_APPROVAL = "require_approval"
    BLOCK = "block"

class HumanInTheLoopEscalation:
    def __init__(
        self,
        auto_approve_threshold: float = 0.85,
        review_threshold: float = 0.6,
        block_threshold: float = 0.3,
    ):
        self.auto_approve = auto_approve_threshold
        self.review = review_threshold
        self.block = block_threshold

    def determine_escalation(self, confidence: float) -> EscalationLevel:
        if confidence >= self.auto_approve:
            return EscalationLevel.AUTO_APPROVE
        elif confidence >= self.review:
            return EscalationLevel.FLAG_FOR_REVIEW
        elif confidence >= self.block:
            return EscalationLevel.REQUIRE_APPROVAL
        else:
            return EscalationLevel.BLOCK

    async def handle_response(
        self,
        response: str,
        confidence: float,
        query: str,
    ) -> str:
        level = self.determine_escalation(confidence)

        if level == EscalationLevel.AUTO_APPROVE:
            return response

        if level == EscalationLevel.FLAG_FOR_REVIEW:
            await self._queue_for_review(query, response, confidence)
            return response + "\n\n_This response has been flagged for review._"

        if level == EscalationLevel.REQUIRE_APPROVAL:
            await self._notify_reviewer(query, response, confidence)
            return ("This question requires human verification. "
                    "A team member will respond shortly.")

        return "I don't have enough reliable information to answer this question."

    async def _queue_for_review(self, query, response, confidence):
        """Add to async review queue — reviewer checks later."""
        pass  # Integrate with your task queue

    async def _notify_reviewer(self, query, response, confidence):
        """Send real-time notification to reviewer."""
        pass  # Integrate with Slack, email, etc.

Putting It All Together

async def handle_agent_query(query: str, sources: list[SourceDocument]) -> str:
    grounded_gen = GroundedResponseGenerator(llm_client)
    scorer = ConfidenceScorer(llm_client)
    escalation = HumanInTheLoopEscalation()

    response, claims = grounded_gen.generate_grounded_response(query, sources)

    unverified = [c for c in claims if not c.verified]
    if len(unverified) > len(claims) * 0.3:
        response += "\n\nNote: Some claims could not be verified against sources."

    scores = scorer.score_response(query, response, sources)
    confidence = scores["overall_confidence"]

    return await escalation.handle_response(response, confidence, query)

FAQ

How much does hallucination detection add to latency and cost?

Self-consistency checking multiplies your LLM calls by the number of samples (typically 3-5x). Citation verification adds one additional LLM call. For latency-sensitive applications, run these checks asynchronously — deliver the initial response immediately and update it if verification fails. For high-stakes applications (medical, legal, financial), the additional 1-3 seconds and cost are well justified.

Can I fine-tune a model to hallucinate less?

Fine-tuning on high-quality, factually verified data can reduce hallucinations in a specific domain. However, fine-tuning cannot eliminate hallucinations entirely because they are an inherent property of how language models generate text. The detection and mitigation strategies in this post provide defense regardless of the model's base hallucination rate. Use fine-tuning to reduce the rate, and use these techniques to catch what remains.

What is the difference between RAG grounding and the citation verification shown here?

RAG (Retrieval-Augmented Generation) provides relevant source documents to the model as context. Citation verification goes a step further by checking that the model's claims actually match what those sources say. RAG reduces hallucination by giving the model correct information to reference, but the model can still hallucinate claims that are not in the retrieved documents. Citation verification catches those cases.

#HallucinationDetection #AISafety #Grounding #RAG #Python #AgenticAI #LearnAI #AIEngineering

Hallucination Detection and Mitigation in AI Agent Systems

The Hallucination Problem in Agentic Systems

Technique 1: Source Grounding with Citation Verification

Technique 2: Confidence Scoring

Technique 3: Human-in-the-Loop Escalation

Putting It All Together

FAQ

How much does hallucination detection add to latency and cost?

Can I fine-tune a model to hallucinate less?

What is the difference between RAG grounding and the citation verification shown here?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding