Reflection Agents: Building AI Systems That Critique and Improve Their Own Output

What Are Reflection Agents?

A reflection agent is an AI system that generates an initial output, then turns around and critiques that output against explicit quality criteria. Based on the critique, it produces an improved version. This loop repeats until the output meets a quality threshold or a maximum number of iterations is reached.

The concept draws from the Reflexion paper (Shinn et al., 2023), which demonstrated that LLM agents equipped with verbal self-reflection significantly outperform single-pass agents on coding, reasoning, and decision-making benchmarks.

The Core Loop: Generate, Evaluate, Refine

Every reflection agent follows the same three-step cycle:

Generate — produce an initial response to the task
Evaluate — score the response against defined criteria
Refine — use the evaluation feedback to produce a better version

Here is a minimal implementation:

from openai import OpenAI

client = OpenAI()

def reflect_and_improve(task: str, max_rounds: int = 3, threshold: float = 8.0):
    """Generate, evaluate, and refine output over multiple rounds."""

    # Round 1: initial generation
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": task}],
    )
    current_output = response.choices[0].message.content

    for round_num in range(max_rounds):
        # Evaluate the current output
        eval_prompt = f"""Rate the following output on a scale of 1-10 for:
- Accuracy (factual correctness)
- Completeness (covers all aspects)
- Clarity (easy to understand)

Output to evaluate:
{current_output}

Original task: {task}

Return JSON: {{"score": <average>, "weaknesses": ["..."], "suggestions": ["..."]}}"""

        eval_response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": eval_prompt}],
            response_format={"type": "json_object"},
        )
        evaluation = eval_response.choices[0].message.content
        import json
        eval_data = json.loads(evaluation)

        print(f"Round {round_num + 1}: Score = {eval_data['score']}")

        if eval_data["score"] >= threshold:
            print("Quality threshold met.")
            return current_output

        # Refine based on feedback
        refine_prompt = f"""Improve this output based on the feedback below.

Original task: {task}
Current output: {current_output}
Weaknesses: {eval_data['weaknesses']}
Suggestions: {eval_data['suggestions']}

Write an improved version that addresses every weakness."""

        refined = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": refine_prompt}],
        )
        current_output = refined.choices[0].message.content

    return current_output

Separating the Critic from the Generator

A stronger pattern uses two separate system prompts — one optimized for generation, another for harsh evaluation. This prevents the self-serving bias where a model rates its own output too generously.

GENERATOR_SYSTEM = """You are an expert technical writer.
Produce detailed, accurate, well-structured content."""

CRITIC_SYSTEM = """You are a demanding editor and fact-checker.
Find every flaw, gap, and inaccuracy. Be harsh but constructive.
Never rate above 7 unless the output is genuinely excellent."""

By giving the critic a deliberately strict persona, you force the generator to actually earn a passing score rather than coasting through on inflated self-assessments.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

Multi-Dimensional Scoring

Production reflection agents score across multiple dimensions rather than using a single number. A scoring rubric might look like this:

RUBRIC = {
    "accuracy": "Are all facts, numbers, and claims correct?",
    "completeness": "Does the output address every part of the task?",
    "clarity": "Is the writing clear and free of jargon?",
    "actionability": "Can the reader act on this immediately?",
    "structure": "Is the output well-organized with logical flow?",
}

def multi_dim_evaluate(output: str, task: str) -> dict:
    prompt = "Score this output 1-10 on each dimension:\n"
    for dim, question in RUBRIC.items():
        prompt += f"- {dim}: {question}\n"
    prompt += f"\nOutput: {output}\nTask: {task}"
    prompt += "\nReturn JSON with each dimension score and overall average."
    # ... call LLM and parse response

When Reflection Helps (and When It Hurts)

Reflection adds latency and cost — each round means additional LLM calls. Use it when:

High stakes: the output will be shown to customers or used in decisions
Complex tasks: multi-step reasoning where errors compound
Code generation: where the critic can actually run tests to verify correctness

Skip it when the task is simple, latency-sensitive, or when the first-pass quality is already high enough for your use case.

FAQ

How many reflection rounds are typically needed?

Most tasks converge after 2-3 rounds. Research shows diminishing returns beyond 3 rounds for text generation tasks, though coding tasks can benefit from up to 5 rounds when combined with test execution.

Should the generator and critic use the same model?

Not necessarily. A common pattern uses a stronger model (GPT-4o) as the critic and a faster model (GPT-4o-mini) as the generator. This keeps costs down while maintaining evaluation quality. Some teams even use a smaller fine-tuned model for the critic.

How do you prevent infinite loops where the critic is never satisfied?

Always set a max_rounds limit. Additionally, track scores across rounds — if the score plateaus or decreases for two consecutive rounds, break early. The agent should recognize when further refinement is not productive.

#ReflectionAgents #Reflexion #SelfImprovement #AgenticAI #LLMArchitecture #AIEngineering #PythonAI #PromptEngineering

Reflection Agents: Building AI Systems That Critique and Improve Their Own Output

What Are Reflection Agents?

The Core Loop: Generate, Evaluate, Refine

Separating the Critic from the Generator

Multi-Dimensional Scoring

When Reflection Helps (and When It Hurts)

FAQ

How many reflection rounds are typically needed?

Should the generator and critic use the same model?

How do you prevent infinite loops where the critic is never satisfied?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding