Conversation Quality Metrics: Coherence, Relevance, and Helpfulness Scoring
Learn how to measure AI agent conversation quality using automated scoring rubrics, LLM-as-judge patterns, and human correlation validation to ensure your agent produces coherent and helpful responses.
Beyond Task Completion: Measuring How Well Agents Communicate
An agent can solve the user's problem yet still deliver a terrible experience. Rambling answers, contradicting itself mid-conversation, ignoring context from three messages ago, or answering a question nobody asked — these quality failures erode trust even when the final outcome is correct. Conversation quality metrics capture the how of agent communication, not just the what.
The three pillars of conversation quality are coherence (does the response make logical sense and flow naturally?), relevance (does it address what the user actually asked?), and helpfulness (does it provide actionable, useful information?).
Designing Scoring Rubrics
Rubrics convert subjective quality into structured, repeatable measurements. Each dimension gets a clear scale with concrete anchors.
from dataclasses import dataclass
from enum import IntEnum
class QualityScore(IntEnum):
POOR = 1
BELOW_AVERAGE = 2
ADEQUATE = 3
GOOD = 4
EXCELLENT = 5
@dataclass
class QualityRubric:
dimension: str
anchors: dict[int, str]
COHERENCE_RUBRIC = QualityRubric(
dimension="coherence",
anchors={
1: "Response contradicts itself or is incoherent",
2: "Mostly understandable but has logical gaps",
3: "Logically consistent, minor flow issues",
4: "Well-structured with clear logical flow",
5: "Perfectly coherent, natural conversational flow",
},
)
RELEVANCE_RUBRIC = QualityRubric(
dimension="relevance",
anchors={
1: "Completely off-topic or ignores the question",
2: "Partially addresses the question with filler",
3: "Addresses the question but includes irrelevant info",
4: "Directly addresses the question with minimal extras",
5: "Precisely targets the question with no wasted content",
},
)
HELPFULNESS_RUBRIC = QualityRubric(
dimension="helpfulness",
anchors={
1: "Provides no useful information or next steps",
2: "Provides vague or generic information",
3: "Provides some useful information",
4: "Provides clear, actionable information",
5: "Provides exceptional value, anticipates follow-ups",
},
)
Concrete anchors are essential. Without them, a score of 3 means something different to every evaluator — human or LLM.
Implementing LLM-as-Judge
Using a language model to evaluate another language model's output is surprisingly effective when done carefully. The key is a structured prompt with explicit rubric criteria.
import json
from typing import Optional
async def llm_judge_quality(
llm_client,
user_message: str,
agent_response: str,
conversation_history: list[dict],
rubrics: list[QualityRubric],
model: str = "gpt-4o-mini",
) -> dict:
rubric_text = ""
for rubric in rubrics:
rubric_text += f"\n### {rubric.dimension.title()}\n"
for score, desc in sorted(rubric.anchors.items()):
rubric_text += f" {score}: {desc}\n"
history_text = "\n".join(
f"{m['role']}: {m['content']}"
for m in conversation_history[-6:]
)
prompt = f"""You are an expert evaluator for AI agent responses.
Score the agent response on each dimension using the rubric.
## Conversation History (last 6 messages)
{history_text}
## Current User Message
{user_message}
## Agent Response
{agent_response}
## Rubrics
{rubric_text}
Return JSON with this exact structure:
{{
"coherence": {{"score": <1-5>, "reasoning": "<1 sentence>"}},
"relevance": {{"score": <1-5>, "reasoning": "<1 sentence>"}},
"helpfulness": {{"score": <1-5>, "reasoning": "<1 sentence>"}}
}}"""
response = await llm_client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
temperature=0.0,
)
return json.loads(response.choices[0].message.content)
Setting temperature to 0 improves scoring consistency. Including conversation history is critical — relevance cannot be judged without knowing what came before.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Deterministic Quality Checks
Not every quality signal requires an LLM. Some checks are simple and fast.
import re
def deterministic_quality_checks(
user_message: str,
agent_response: str,
conversation_history: list[dict],
) -> dict:
checks = {}
# Response length check
word_count = len(agent_response.split())
checks["response_length"] = word_count
checks["too_short"] = word_count < 10
checks["too_long"] = word_count > 500
# Repetition detection
sentences = re.split(r'[.!?]+', agent_response)
sentences = [s.strip().lower() for s in sentences if s.strip()]
unique_ratio = (
len(set(sentences)) / len(sentences) if sentences else 1.0
)
checks["repetition_score"] = round(unique_ratio, 3)
# Self-contradiction signal (simple heuristic)
contradiction_phrases = [
"actually, I was wrong",
"I apologize, that's incorrect",
"let me correct myself",
]
checks["self_correction"] = any(
phrase in agent_response.lower()
for phrase in contradiction_phrases
)
# Question echo detection
if user_message.strip().endswith("?"):
user_words = set(user_message.lower().split())
response_first_sentence = (
sentences[0] if sentences else ""
)
overlap = len(
user_words & set(response_first_sentence.split())
)
checks["question_echo_ratio"] = round(
overlap / max(len(user_words), 1), 3
)
return checks
These fast checks flag obvious problems — extremely short responses, excessive repetition, or the agent echoing the question back instead of answering it. Run them on every response before invoking the more expensive LLM judge.
Validating LLM-Judge Correlation with Humans
An LLM judge is only useful if it agrees with human evaluators. Measure this with inter-annotator agreement.
from scipy import stats
def compute_judge_correlation(
human_scores: list[int],
llm_scores: list[int],
) -> dict:
if len(human_scores) != len(llm_scores):
raise ValueError("Score lists must be same length")
spearman_corr, spearman_p = stats.spearmanr(
human_scores, llm_scores
)
exact_agreement = sum(
1 for h, l in zip(human_scores, llm_scores) if h == l
) / len(human_scores)
within_one = sum(
1 for h, l in zip(human_scores, llm_scores)
if abs(h - l) <= 1
) / len(human_scores)
return {
"spearman_correlation": round(spearman_corr, 3),
"spearman_p_value": round(spearman_p, 4),
"exact_agreement": round(exact_agreement, 3),
"within_one_agreement": round(within_one, 3),
"sample_size": len(human_scores),
}
Target a Spearman correlation of 0.7 or higher and within-one agreement of 85 percent or better. If your LLM judge falls below these thresholds, refine your rubric anchors or switch to a more capable judge model.
FAQ
Which LLM should I use as a judge?
Use the strongest model you can afford for judging. GPT-4o or Claude work well. Do not use the same model as your agent — this introduces self-preference bias. A smaller, cheaper model like GPT-4o-mini works surprisingly well for structured rubric scoring where the anchors are clear and concrete.
How many human annotations do I need to validate the judge?
Annotate at least 100 samples with two independent human raters per sample. Compute inter-rater agreement between humans first to establish your ceiling — the LLM judge cannot be more reliable than humans agree with each other. If human agreement is only 60 percent, do not expect 90 percent from the LLM.
Should I score each turn individually or the whole conversation?
Both. Turn-level scoring catches individual bad responses. Conversation-level scoring catches patterns like the agent losing context over time or becoming repetitive across turns. Aggregate turn-level scores to get conversation-level trends, but also run a separate holistic evaluation on the full conversation.
#ConversationQuality #LLMasJudge #AgentEvaluation #NLPMetrics #Python #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.