Hallucination Detection and Mitigation in AI Agent Systems
Learn practical techniques to detect and reduce LLM hallucinations in AI agents, including grounding with source documents, citation verification, confidence scoring, and human-in-the-loop escalation patterns.
The Hallucination Problem in Agentic Systems
When a chatbot hallucinates, a user gets wrong information. When an AI agent hallucinates, it takes wrong actions — booking fake appointments, citing nonexistent regulations, or executing tool calls based on fabricated data. In agentic systems, hallucination is not just an accuracy problem, it is a safety problem.
Hallucinations in agents fall into three categories: factual errors (stating incorrect facts), fabrication (inventing data, URLs, or citations that do not exist), and reasoning errors (drawing wrong conclusions from correct data). Each requires different detection and mitigation strategies.
Technique 1: Source Grounding with Citation Verification
The most effective hallucination mitigation is grounding agent responses in retrieved source documents and verifying that claims map back to those sources:
from dataclasses import dataclass
from typing import Optional
@dataclass
class SourceDocument:
id: str
content: str
metadata: dict
@dataclass
class CitedClaim:
claim: str
source_id: Optional[str]
source_text: Optional[str]
verified: bool
confidence: float
class GroundedResponseGenerator:
"""Generate responses grounded in source documents with citation tracking."""
def __init__(self, llm_client):
self.llm = llm_client
def generate_grounded_response(
self,
query: str,
sources: list[SourceDocument],
) -> tuple[str, list[CitedClaim]]:
source_context = "\n\n".join(
f"[Source {s.id}]: {s.content}" for s in sources
)
response = self.llm.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": """Answer the user's question using ONLY the provided sources.
For every factual claim, include a citation in the format [Source X].
If the sources do not contain enough information to answer,
say "I don't have enough information to answer that."
Never make claims that are not supported by the provided sources.""",
},
{
"role": "user",
"content": f"Sources:\n{source_context}\n\nQuestion: {query}",
},
],
temperature=0,
)
answer = response.choices[0].message.content or ""
claims = self._extract_and_verify_claims(answer, sources)
return answer, claims
def _extract_and_verify_claims(
self,
response: str,
sources: list[SourceDocument],
) -> list[CitedClaim]:
verification_response = self.llm.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": """Extract each factual claim from the response.
For each claim, output a JSON array with objects:
{"claim": "...", "source_id": "...", "verified": true/false, "confidence": 0.0-1.0}
Set verified=true only if the claim is directly supported by the cited source.""",
},
{
"role": "user",
"content": f"Response: {response}\n\nSources: {[s.content for s in sources]}",
},
],
temperature=0,
)
import json
claims_data = json.loads(
verification_response.choices[0].message.content or "[]"
)
return [
CitedClaim(
claim=c["claim"],
source_id=c.get("source_id"),
source_text=self._get_source_text(c.get("source_id"), sources),
verified=c["verified"],
confidence=c["confidence"],
)
for c in claims_data
]
def _get_source_text(
self,
source_id: Optional[str],
sources: list[SourceDocument],
) -> Optional[str]:
if not source_id:
return None
for s in sources:
if s.id == source_id:
return s.content
return None
Technique 2: Confidence Scoring
Confidence scoring estimates how likely the agent's output is to be correct, enabling conditional handling of low-confidence responses:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
class ConfidenceScorer:
"""Score the confidence of agent responses using multiple signals."""
def __init__(self, llm_client):
self.llm = llm_client
def score_response(
self,
query: str,
response: str,
sources: list[SourceDocument] | None = None,
) -> dict:
signals = {}
# Signal 1: Self-consistency (generate multiple responses, check agreement)
signals["self_consistency"] = self._check_self_consistency(query)
# Signal 2: Source coverage
if sources:
signals["source_coverage"] = self._check_source_coverage(
response, sources
)
# Signal 3: Hedging language detection
signals["hedging_score"] = self._detect_hedging(response)
# Weighted average
weights = {"self_consistency": 0.4, "source_coverage": 0.4, "hedging_score": 0.2}
total = sum(
signals.get(k, 0.5) * v
for k, v in weights.items()
)
signals["overall_confidence"] = round(total, 3)
return signals
def _check_self_consistency(self, query: str, n_samples: int = 3) -> float:
"""Generate multiple responses and measure agreement."""
responses = []
for _ in range(n_samples):
result = self.llm.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": query}],
temperature=0.7,
max_tokens=200,
)
responses.append(result.choices[0].message.content)
agreement_check = self.llm.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"""Rate the factual agreement between these responses
from 0.0 (completely contradictory) to 1.0 (fully consistent).
Respond with ONLY a number.
Responses: {responses}""",
}],
temperature=0,
)
try:
return float(agreement_check.choices[0].message.content.strip())
except ValueError:
return 0.5
def _check_source_coverage(
self,
response: str,
sources: list[SourceDocument],
) -> float:
"""Check what fraction of response claims are covered by sources."""
source_text = " ".join(s.content for s in sources)
check = self.llm.chat.completions.create(
model="gpt-4o-mini",
messages=[{
"role": "user",
"content": f"""What fraction of factual claims in the Response
are supported by the Source text? Respond with ONLY a number between 0.0 and 1.0.
Response: {response}
Source: {source_text}""",
}],
temperature=0,
)
try:
return float(check.choices[0].message.content.strip())
except ValueError:
return 0.5
def _detect_hedging(self, response: str) -> float:
"""Detect hedging language as a proxy for uncertainty."""
hedging_phrases = [
"I think", "probably", "might be", "I'm not sure",
"it's possible", "approximately", "roughly",
"I believe", "as far as I know", "it seems",
]
lower_resp = response.lower()
hedge_count = sum(1 for p in hedging_phrases if p.lower() in lower_resp)
# More hedging means lower confidence
return max(0.0, 1.0 - (hedge_count * 0.15))
Technique 3: Human-in-the-Loop Escalation
When confidence is low, escalate to a human reviewer instead of delivering a potentially hallucinated response:
import asyncio
from enum import Enum
class EscalationLevel(Enum):
AUTO_APPROVE = "auto_approve"
FLAG_FOR_REVIEW = "flag_for_review"
REQUIRE_APPROVAL = "require_approval"
BLOCK = "block"
class HumanInTheLoopEscalation:
def __init__(
self,
auto_approve_threshold: float = 0.85,
review_threshold: float = 0.6,
block_threshold: float = 0.3,
):
self.auto_approve = auto_approve_threshold
self.review = review_threshold
self.block = block_threshold
def determine_escalation(self, confidence: float) -> EscalationLevel:
if confidence >= self.auto_approve:
return EscalationLevel.AUTO_APPROVE
elif confidence >= self.review:
return EscalationLevel.FLAG_FOR_REVIEW
elif confidence >= self.block:
return EscalationLevel.REQUIRE_APPROVAL
else:
return EscalationLevel.BLOCK
async def handle_response(
self,
response: str,
confidence: float,
query: str,
) -> str:
level = self.determine_escalation(confidence)
if level == EscalationLevel.AUTO_APPROVE:
return response
if level == EscalationLevel.FLAG_FOR_REVIEW:
await self._queue_for_review(query, response, confidence)
return response + "\n\n_This response has been flagged for review._"
if level == EscalationLevel.REQUIRE_APPROVAL:
await self._notify_reviewer(query, response, confidence)
return ("This question requires human verification. "
"A team member will respond shortly.")
return "I don't have enough reliable information to answer this question."
async def _queue_for_review(self, query, response, confidence):
"""Add to async review queue — reviewer checks later."""
pass # Integrate with your task queue
async def _notify_reviewer(self, query, response, confidence):
"""Send real-time notification to reviewer."""
pass # Integrate with Slack, email, etc.
Putting It All Together
async def handle_agent_query(query: str, sources: list[SourceDocument]) -> str:
grounded_gen = GroundedResponseGenerator(llm_client)
scorer = ConfidenceScorer(llm_client)
escalation = HumanInTheLoopEscalation()
response, claims = grounded_gen.generate_grounded_response(query, sources)
unverified = [c for c in claims if not c.verified]
if len(unverified) > len(claims) * 0.3:
response += "\n\nNote: Some claims could not be verified against sources."
scores = scorer.score_response(query, response, sources)
confidence = scores["overall_confidence"]
return await escalation.handle_response(response, confidence, query)
FAQ
How much does hallucination detection add to latency and cost?
Self-consistency checking multiplies your LLM calls by the number of samples (typically 3-5x). Citation verification adds one additional LLM call. For latency-sensitive applications, run these checks asynchronously — deliver the initial response immediately and update it if verification fails. For high-stakes applications (medical, legal, financial), the additional 1-3 seconds and cost are well justified.
Can I fine-tune a model to hallucinate less?
Fine-tuning on high-quality, factually verified data can reduce hallucinations in a specific domain. However, fine-tuning cannot eliminate hallucinations entirely because they are an inherent property of how language models generate text. The detection and mitigation strategies in this post provide defense regardless of the model's base hallucination rate. Use fine-tuning to reduce the rate, and use these techniques to catch what remains.
What is the difference between RAG grounding and the citation verification shown here?
RAG (Retrieval-Augmented Generation) provides relevant source documents to the model as context. Citation verification goes a step further by checking that the model's claims actually match what those sources say. RAG reduces hallucination by giving the model correct information to reference, but the model can still hallucinate claims that are not in the retrieved documents. Citation verification catches those cases.
#HallucinationDetection #AISafety #Grounding #RAG #Python #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.