Building a Debate Agent System: Two AI Agents That Argue to Find Better Answers

Why AI Debates Produce Better Answers

A single LLM answering a question tends to commit to one perspective early and then reinforce it. This leads to confirmation bias, missed nuances, and overconfident conclusions. The debate architecture fixes this by forcing two agents to argue opposing sides while a third agent judges the quality of their arguments.

Research from Anthropic, Google DeepMind, and others has shown that multi-agent debate consistently improves accuracy on reasoning, math, and factual tasks compared to single-agent approaches. The mechanism is simple: adversarial pressure exposes weak reasoning that self-reflection alone would miss.

Architecture Overview

The system has three agent roles:

Pro Agent — argues in favor of a position
Con Agent — argues against the same position
Judge Agent — evaluates arguments, identifies the strongest points, and synthesizes a final answer

from pydantic import BaseModel
from openai import OpenAI
import json

client = OpenAI()

class DebateRound(BaseModel):
    round_number: int
    pro_argument: str
    con_argument: str
    judge_feedback: str
    pro_score: float
    con_score: float

class DebateResult(BaseModel):
    question: str
    rounds: list[DebateRound]
    final_answer: str
    confidence: float

The Debater Agents

Each debater receives the question, its assigned side, and the history of previous rounds so it can respond to the opponent:

def create_debater_message(
    question: str,
    side: str,
    history: list[DebateRound],
    round_num: int,
) -> str:
    """Generate an argument for one side of the debate."""
    history_text = ""
    for r in history:
        history_text += f"\n--- Round {r.round_number} ---\n"
        history_text += f"Pro: {r.pro_argument}\n"
        history_text += f"Con: {r.con_argument}\n"
        history_text += f"Judge: {r.judge_feedback}\n"

    side_instruction = {
        "pro": "Argue IN FAVOR of the position. Build on your previous points and directly counter the opponent's strongest arguments.",
        "con": "Argue AGAINST the position. Build on your previous points and directly counter the opponent's strongest arguments.",
    }

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"""You are a skilled debater arguing the {side} side.
Rules:
- Make specific, evidence-based arguments
- Directly address your opponent's strongest points
- Acknowledge valid opposing points but explain why your side is stronger
- Do NOT strawman the opponent
- Be concise: 150-200 words per round

{side_instruction[side]}"""},
            {"role": "user", "content": (
                f"Question: {question}\n"
                f"Debate history: {history_text}\n"
                f"Round {round_num}: Present your {side} argument."
            )},
        ],
    )
    return response.choices[0].message.content

The Judge Agent

The judge evaluates both sides after each round, scores them, and provides feedback that guides the next round:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

def judge_round(
    question: str,
    pro_argument: str,
    con_argument: str,
    history: list[DebateRound],
) -> dict:
    """Judge evaluates both arguments and provides scores."""
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": """You are an impartial debate judge.
Evaluate both arguments on:
1. Logical validity (is the reasoning sound?)
2. Evidence quality (are claims supported?)
3. Responsiveness (does it address the opponent's points?)
4. Persuasiveness (how compelling is the overall argument?)

Score each side 0-10. Identify:
- The single strongest point from each side
- The single weakest point from each side
- What each side should address in the next round

Be genuinely impartial. Do not favor either side by default."""},
            {"role": "user", "content": (
                f"Question: {question}\n"
                f"Pro argument: {pro_argument}\n"
                f"Con argument: {con_argument}\n"
                "Evaluate and return JSON with: pro_score, con_score, "
                "feedback, strongest_pro_point, strongest_con_point."
            )},
        ],
        response_format={"type": "json_object"},
    )
    return json.loads(response.choices[0].message.content)

The Debate Loop

def run_debate(question: str, num_rounds: int = 3) -> DebateResult:
    """Run a full multi-round debate and produce a final answer."""
    rounds: list[DebateRound] = []

    for round_num in range(1, num_rounds + 1):
        # Both sides argue
        pro_arg = create_debater_message(question, "pro", rounds, round_num)
        con_arg = create_debater_message(question, "con", rounds, round_num)

        # Judge evaluates
        judgment = judge_round(question, pro_arg, con_arg, rounds)

        round_result = DebateRound(
            round_number=round_num,
            pro_argument=pro_arg,
            con_argument=con_arg,
            judge_feedback=judgment.get("feedback", ""),
            pro_score=judgment.get("pro_score", 5.0),
            con_score=judgment.get("con_score", 5.0),
        )
        rounds.append(round_result)
        print(f"Round {round_num}: Pro={round_result.pro_score:.1f} Con={round_result.con_score:.1f}")

    # Synthesize final answer from the full debate
    final = synthesize_debate(question, rounds)
    return DebateResult(
        question=question,
        rounds=rounds,
        final_answer=final["answer"],
        confidence=final["confidence"],
    )

Synthesis: Combining the Best of Both Sides

The final answer should not simply pick a winner. Instead, it synthesizes the strongest points from both sides into a nuanced conclusion:

def synthesize_debate(question: str, rounds: list[DebateRound]) -> dict:
    """Produce a final answer that incorporates the best arguments."""
    debate_summary = "\n".join(
        f"Round {r.round_number}: Pro({r.pro_score}) said: {r.pro_argument[:200]}... "
        f"Con({r.con_score}) said: {r.con_argument[:200]}..."
        for r in rounds
    )

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": """Synthesize the debate into a final answer.
- Incorporate the strongest validated points from both sides
- Acknowledge genuine uncertainty where the debate was inconclusive
- Provide a clear conclusion with appropriate caveats
- Rate your confidence (0.0-1.0) based on how decisive the debate was"""},
            {"role": "user", "content": (
                f"Question: {question}\nDebate summary:\n{debate_summary}"
            )},
        ],
        response_format={"type": "json_object"},
    )
    return json.loads(response.choices[0].message.content)

Convergence and Quality

Well-designed debates converge over 2-4 rounds. Watch for: (1) debaters repeating arguments without new substance (time to stop), (2) scores stabilizing (the strongest arguments have been found), or (3) the judge identifying that both sides agree on key points (consensus reached). Set a maximum round limit of 4-5 to control costs.

FAQ

Does the debate always improve answer quality?

For factual and reasoning tasks, yes — studies consistently show improvement over single-agent baselines. For creative tasks, debates can be overly analytical and suppress creative thinking. For opinion-based questions, debates produce better-nuanced answers but may feel indecisive.

Can you use more than two debaters?

Yes. A "panel" format with 3-4 agents each defending a different position works well for questions with more than two viable answers. The judge then evaluates across all positions. Be aware that costs scale linearly with the number of debaters.

How do you prevent the debaters from agreeing too quickly?

Assign strong contrarian system prompts and penalize the judge for scoring both sides equally in early rounds. Some implementations use a "devil's advocate" instruction that forces the con agent to find flaws even when it might privately agree with the pro side.

#DebateAgents #MultiAgentSystems #AdversarialAI #AIDebate #AgenticAI #PythonAI #ReasoningImprovement #AgentArchitecture

Building a Debate Agent System: Two AI Agents That Argue to Find Better Answers

Why AI Debates Produce Better Answers

Architecture Overview

The Debater Agents

The Judge Agent

The Debate Loop

Synthesis: Combining the Best of Both Sides

Convergence and Quality

FAQ

Does the debate always improve answer quality?

Can you use more than two debaters?

How do you prevent the debaters from agreeing too quickly?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding