LLM-as-Judge: Using AI to Evaluate AI Agent Responses Automatically

Why Use an LLM to Judge Another LLM

Human evaluation is the gold standard for assessing agent quality, but it does not scale. Reviewing 500 agent responses manually takes days. LLM-as-Judge is a technique where you use a strong language model to score the outputs of your agent automatically, giving you scalable evaluation that correlates well with human judgment when calibrated correctly.

Research from teams at Google, Anthropic, and OpenAI shows that GPT-4-class models achieve 80-90% agreement with human raters on well-defined criteria. The key is writing precise rubrics and calibrating the judge against human labels.

Basic Judge Implementation

A judge is simply an LLM call with a structured prompt that asks for a score and justification.

import openai
import json
from dataclasses import dataclass

@dataclass
class JudgeResult:
    score: int           # 1-5
    reasoning: str
    criteria_scores: dict[str, int]

def evaluate_response(
    question: str,
    agent_response: str,
    reference_answer: str,
    client: openai.OpenAI,
    model: str = "gpt-4o",
) -> JudgeResult:
    prompt = f"""You are an expert evaluator. Score the following agent response.

Question: {question}
Reference Answer: {reference_answer}
Agent Response: {agent_response}

Score each criterion from 1 (poor) to 5 (excellent):
1. Correctness: Is the information accurate?
2. Completeness: Does it address all parts of the question?
3. Clarity: Is the response well-organized and easy to understand?

Return JSON:
{{"correctness": <int>, "completeness": <int>, "clarity": <int>, "overall": <int>, "reasoning": "<explanation>"}}
"""
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
    )
    data = json.loads(response.choices[0].message.content)
    return JudgeResult(
        score=data["overall"],
        reasoning=data["reasoning"],
        criteria_scores={
            k: data[k] for k in ["correctness", "completeness", "clarity"]
        },
    )

Designing Scoring Rubrics

Vague rubrics produce inconsistent scores. Anchor each score level to concrete behaviors.

RUBRIC = """
## Correctness Rubric
- 5: All facts are accurate, no hallucinations
- 4: Minor inaccuracy that does not affect the main answer
- 3: One significant error, but the core answer is correct
- 2: Multiple errors or a critical factual mistake
- 1: The answer is fundamentally wrong or fabricated

## Completeness Rubric
- 5: Addresses every part of the question with sufficient detail
- 4: Addresses all parts but one lacks detail
- 3: Misses one part of a multi-part question
- 2: Only partially addresses the question
- 1: Fails to address the question at all
"""

def build_judge_prompt(question: str, response: str, rubric: str = RUBRIC) -> str:
    return f"""Evaluate this agent response using the rubric below.

{rubric}

Question: {question}
Response: {response}

Return JSON with scores and reasoning for each criterion."""

Calibration Against Human Labels

Before trusting LLM-as-Judge scores, calibrate by comparing judge scores to human ratings on a labeled subset.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

import numpy as np
from scipy import stats

def calibrate_judge(
    human_scores: list[int],
    judge_scores: list[int],
) -> dict:
    """Compare judge scores against human ground truth."""
    correlation, p_value = stats.spearmanr(human_scores, judge_scores)

    exact_match = sum(h == j for h, j in zip(human_scores, judge_scores))
    within_one = sum(abs(h - j) <= 1 for h, j in zip(human_scores, judge_scores))

    return {
        "spearman_correlation": round(correlation, 3),
        "p_value": round(p_value, 4),
        "exact_match_pct": round(exact_match / len(human_scores) * 100, 1),
        "within_one_pct": round(within_one / len(human_scores) * 100, 1),
        "judge_mean": round(np.mean(judge_scores), 2),
        "human_mean": round(np.mean(human_scores), 2),
        "bias": round(np.mean(judge_scores) - np.mean(human_scores), 2),
    }

A Spearman correlation above 0.7 and within-one agreement above 85% indicates a reliable judge. If bias is consistently positive, the judge is too lenient — adjust the rubric.

Multi-Criteria Evaluation

Combine individual criteria into a weighted overall score.

def weighted_score(criteria_scores: dict[str, int], weights: dict[str, float]) -> float:
    total = sum(criteria_scores[k] * weights[k] for k in criteria_scores)
    weight_sum = sum(weights[k] for k in criteria_scores)
    return round(total / weight_sum, 2)

# For a customer support agent, correctness matters most
SUPPORT_WEIGHTS = {"correctness": 0.5, "completeness": 0.3, "clarity": 0.2}

# For a creative writing agent, clarity matters most
CREATIVE_WEIGHTS = {"correctness": 0.2, "completeness": 0.2, "clarity": 0.6}

Running Batch Evaluations

Evaluate your full dataset efficiently with concurrent judge calls.

import asyncio

async def batch_evaluate(
    eval_cases: list[dict],
    agent_fn,
    judge_fn,
    concurrency: int = 5,
) -> list[dict]:
    semaphore = asyncio.Semaphore(concurrency)
    results = []

    async def process_case(case):
        async with semaphore:
            agent_output = await agent_fn(case["input"])
            judge_result = await judge_fn(
                case["input"], agent_output, case["expected"]
            )
            return {**case, "output": agent_output, "judge": judge_result}

    tasks = [process_case(c) for c in eval_cases]
    results = await asyncio.gather(*tasks)
    return results

FAQ

Does the judge model need to be stronger than the agent model?

Yes, generally. A GPT-4o judge evaluating GPT-3.5 agent outputs works well. Judging a model with an equally capable or weaker model produces unreliable scores because the judge may share the same blind spots.

How do I prevent position bias in the judge?

When comparing two responses (A vs B), run the evaluation twice — once with A first, once with B first — and average the results. This counteracts the tendency for LLMs to prefer whichever response appears first.

How much does LLM-as-Judge cost compared to human evaluation?

Evaluating 1,000 responses with GPT-4o costs roughly two to five dollars depending on response length. The same volume of human evaluation typically costs hundreds of dollars and takes days. LLM-as-Judge is roughly 50-100x cheaper.

#LLMasJudge #Evaluation #AIAgents #AutomatedTesting #Python #ScoringRubrics #AgenticAI #LearnAI #AIEngineering

LLM-as-Judge: Using AI to Evaluate AI Agent Responses Automatically

Why Use an LLM to Judge Another LLM

Basic Judge Implementation

Designing Scoring Rubrics

Calibration Against Human Labels

Multi-Criteria Evaluation

Running Batch Evaluations

FAQ

Does the judge model need to be stronger than the agent model?

How do I prevent position bias in the judge?

How much does LLM-as-Judge cost compared to human evaluation?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding