Conversation Quality Metrics: Coherence, Relevance, and Helpfulness Scoring

Beyond Task Completion: Measuring How Well Agents Communicate

An agent can solve the user's problem yet still deliver a terrible experience. Rambling answers, contradicting itself mid-conversation, ignoring context from three messages ago, or answering a question nobody asked — these quality failures erode trust even when the final outcome is correct. Conversation quality metrics capture the how of agent communication, not just the what.

The three pillars of conversation quality are coherence (does the response make logical sense and flow naturally?), relevance (does it address what the user actually asked?), and helpfulness (does it provide actionable, useful information?).

Designing Scoring Rubrics

Rubrics convert subjective quality into structured, repeatable measurements. Each dimension gets a clear scale with concrete anchors.

from dataclasses import dataclass
from enum import IntEnum

class QualityScore(IntEnum):
    POOR = 1
    BELOW_AVERAGE = 2
    ADEQUATE = 3
    GOOD = 4
    EXCELLENT = 5

@dataclass
class QualityRubric:
    dimension: str
    anchors: dict[int, str]

COHERENCE_RUBRIC = QualityRubric(
    dimension="coherence",
    anchors={
        1: "Response contradicts itself or is incoherent",
        2: "Mostly understandable but has logical gaps",
        3: "Logically consistent, minor flow issues",
        4: "Well-structured with clear logical flow",
        5: "Perfectly coherent, natural conversational flow",
    },
)

RELEVANCE_RUBRIC = QualityRubric(
    dimension="relevance",
    anchors={
        1: "Completely off-topic or ignores the question",
        2: "Partially addresses the question with filler",
        3: "Addresses the question but includes irrelevant info",
        4: "Directly addresses the question with minimal extras",
        5: "Precisely targets the question with no wasted content",
    },
)

HELPFULNESS_RUBRIC = QualityRubric(
    dimension="helpfulness",
    anchors={
        1: "Provides no useful information or next steps",
        2: "Provides vague or generic information",
        3: "Provides some useful information",
        4: "Provides clear, actionable information",
        5: "Provides exceptional value, anticipates follow-ups",
    },
)

Concrete anchors are essential. Without them, a score of 3 means something different to every evaluator — human or LLM.

Implementing LLM-as-Judge

Using a language model to evaluate another language model's output is surprisingly effective when done carefully. The key is a structured prompt with explicit rubric criteria.

import json
from typing import Optional

async def llm_judge_quality(
    llm_client,
    user_message: str,
    agent_response: str,
    conversation_history: list[dict],
    rubrics: list[QualityRubric],
    model: str = "gpt-4o-mini",
) -> dict:
    rubric_text = ""
    for rubric in rubrics:
        rubric_text += f"\n### {rubric.dimension.title()}\n"
        for score, desc in sorted(rubric.anchors.items()):
            rubric_text += f"  {score}: {desc}\n"

    history_text = "\n".join(
        f"{m['role']}: {m['content']}"
        for m in conversation_history[-6:]
    )

    prompt = f"""You are an expert evaluator for AI agent responses.
Score the agent response on each dimension using the rubric.

## Conversation History (last 6 messages)
{history_text}

## Current User Message
{user_message}

## Agent Response
{agent_response}

## Rubrics
{rubric_text}

Return JSON with this exact structure:
{{
  "coherence": {{"score": <1-5>, "reasoning": "<1 sentence>"}},
  "relevance": {{"score": <1-5>, "reasoning": "<1 sentence>"}},
  "helpfulness": {{"score": <1-5>, "reasoning": "<1 sentence>"}}
}}"""

    response = await llm_client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
        temperature=0.0,
    )
    return json.loads(response.choices[0].message.content)

Setting temperature to 0 improves scoring consistency. Including conversation history is critical — relevance cannot be judged without knowing what came before.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

Deterministic Quality Checks

Not every quality signal requires an LLM. Some checks are simple and fast.

import re

def deterministic_quality_checks(
    user_message: str,
    agent_response: str,
    conversation_history: list[dict],
) -> dict:
    checks = {}

    # Response length check
    word_count = len(agent_response.split())
    checks["response_length"] = word_count
    checks["too_short"] = word_count < 10
    checks["too_long"] = word_count > 500

    # Repetition detection
    sentences = re.split(r'[.!?]+', agent_response)
    sentences = [s.strip().lower() for s in sentences if s.strip()]
    unique_ratio = (
        len(set(sentences)) / len(sentences) if sentences else 1.0
    )
    checks["repetition_score"] = round(unique_ratio, 3)

    # Self-contradiction signal (simple heuristic)
    contradiction_phrases = [
        "actually, I was wrong",
        "I apologize, that's incorrect",
        "let me correct myself",
    ]
    checks["self_correction"] = any(
        phrase in agent_response.lower()
        for phrase in contradiction_phrases
    )

    # Question echo detection
    if user_message.strip().endswith("?"):
        user_words = set(user_message.lower().split())
        response_first_sentence = (
            sentences[0] if sentences else ""
        )
        overlap = len(
            user_words & set(response_first_sentence.split())
        )
        checks["question_echo_ratio"] = round(
            overlap / max(len(user_words), 1), 3
        )

    return checks

These fast checks flag obvious problems — extremely short responses, excessive repetition, or the agent echoing the question back instead of answering it. Run them on every response before invoking the more expensive LLM judge.

Validating LLM-Judge Correlation with Humans

An LLM judge is only useful if it agrees with human evaluators. Measure this with inter-annotator agreement.

from scipy import stats

def compute_judge_correlation(
    human_scores: list[int],
    llm_scores: list[int],
) -> dict:
    if len(human_scores) != len(llm_scores):
        raise ValueError("Score lists must be same length")

    spearman_corr, spearman_p = stats.spearmanr(
        human_scores, llm_scores
    )
    exact_agreement = sum(
        1 for h, l in zip(human_scores, llm_scores) if h == l
    ) / len(human_scores)
    within_one = sum(
        1 for h, l in zip(human_scores, llm_scores)
        if abs(h - l) <= 1
    ) / len(human_scores)

    return {
        "spearman_correlation": round(spearman_corr, 3),
        "spearman_p_value": round(spearman_p, 4),
        "exact_agreement": round(exact_agreement, 3),
        "within_one_agreement": round(within_one, 3),
        "sample_size": len(human_scores),
    }

Target a Spearman correlation of 0.7 or higher and within-one agreement of 85 percent or better. If your LLM judge falls below these thresholds, refine your rubric anchors or switch to a more capable judge model.

FAQ

Which LLM should I use as a judge?

Use the strongest model you can afford for judging. GPT-4o or Claude work well. Do not use the same model as your agent — this introduces self-preference bias. A smaller, cheaper model like GPT-4o-mini works surprisingly well for structured rubric scoring where the anchors are clear and concrete.

How many human annotations do I need to validate the judge?

Annotate at least 100 samples with two independent human raters per sample. Compute inter-rater agreement between humans first to establish your ceiling — the LLM judge cannot be more reliable than humans agree with each other. If human agreement is only 60 percent, do not expect 90 percent from the LLM.

Should I score each turn individually or the whole conversation?

Both. Turn-level scoring catches individual bad responses. Conversation-level scoring catches patterns like the agent losing context over time or becoming repetitive across turns. Aggregate turn-level scores to get conversation-level trends, but also run a separate holistic evaluation on the full conversation.

#ConversationQuality #LLMasJudge #AgentEvaluation #NLPMetrics #Python #AgenticAI #LearnAI #AIEngineering

Conversation Quality Metrics: Coherence, Relevance, and Helpfulness Scoring

Beyond Task Completion: Measuring How Well Agents Communicate

Designing Scoring Rubrics

Implementing LLM-as-Judge

Deterministic Quality Checks

Validating LLM-Judge Correlation with Humans

FAQ

Which LLM should I use as a judge?

How many human annotations do I need to validate the judge?

Should I score each turn individually or the whole conversation?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding