Consensus Algorithms for Multi-Agent Systems: Voting, Averaging, and Byzantine Fault Tolerance

Why Agents Need Consensus

When multiple AI agents collaborate on a task, they frequently produce different answers. One agent might classify a support ticket as "billing," another as "account access," and a third as "technical." Without a structured way to reconcile these disagreements, your system either picks arbitrarily or fails entirely.

Consensus algorithms provide the mechanism for agents to reach agreement. Borrowed from distributed systems theory, these patterns let you build multi-agent pipelines that are more accurate than any single agent and resilient to individual agent failures.

Pattern 1: Majority Voting

The simplest consensus mechanism asks each agent for a discrete answer and picks the one chosen most often. This works best when agents produce categorical outputs like classifications, yes/no decisions, or label assignments.

from collections import Counter
from dataclasses import dataclass
from typing import Any

@dataclass
class AgentVote:
    agent_id: str
    choice: str
    confidence: float

class MajorityVotingConsensus:
    def __init__(self, quorum: int = 3):
        self.quorum = quorum

    def resolve(self, votes: list[AgentVote]) -> dict[str, Any]:
        if len(votes) < self.quorum:
            raise ValueError(
                f"Need {self.quorum} votes, got {len(votes)}"
            )

        counts = Counter(v.choice for v in votes)
        winner, winner_count = counts.most_common(1)[0]
        total = len(votes)

        return {
            "decision": winner,
            "agreement_ratio": winner_count / total,
            "vote_distribution": dict(counts),
            "unanimous": winner_count == total,
        }

# Usage
consensus = MajorityVotingConsensus(quorum=3)
votes = [
    AgentVote("classifier-1", "billing", 0.85),
    AgentVote("classifier-2", "billing", 0.72),
    AgentVote("classifier-3", "account_access", 0.65),
]
result = consensus.resolve(votes)
# decision: "billing", agreement_ratio: 0.667

The agreement_ratio field is critical for downstream logic. A 3-to-0 unanimous vote carries far more weight than a 2-to-1 split. You should define thresholds — for example, escalate to a human reviewer when agreement drops below 0.6.

Pattern 2: Weighted Averaging

When agents produce numeric outputs (scores, probabilities, estimates), weighted averaging lets you combine them while giving more influence to agents with higher confidence or better historical accuracy.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

class WeightedAverageConsensus:
    def __init__(self, agent_weights: dict[str, float] | None = None):
        self.agent_weights = agent_weights or {}

    def resolve(
        self, estimates: list[dict[str, float]]
    ) -> dict[str, float]:
        total_weight = 0.0
        weighted_sum = 0.0

        for est in estimates:
            agent_id = est["agent_id"]
            value = est["value"]
            confidence = est["confidence"]
            historical_weight = self.agent_weights.get(agent_id, 1.0)

            weight = confidence * historical_weight
            weighted_sum += value * weight
            total_weight += weight

        consensus_value = weighted_sum / total_weight
        variance = sum(
            ((e["value"] - consensus_value) ** 2) for e in estimates
        ) / len(estimates)

        return {
            "consensus_value": round(consensus_value, 4),
            "variance": round(variance, 4),
            "num_agents": len(estimates),
        }

# Agents with proven track records get higher weight
consensus = WeightedAverageConsensus(
    agent_weights={"estimator-a": 1.5, "estimator-b": 1.0, "estimator-c": 0.7}
)

Pattern 3: Byzantine Fault Tolerance

In real deployments, agents can fail in unpredictable ways — returning garbage, hallucinating confidently, or being compromised. Byzantine fault tolerance (BFT) handles these scenarios by requiring a supermajority to agree, filtering out outliers before consensus.

import statistics

class ByzantineFaultTolerantConsensus:
    """Tolerates up to f faulty agents out of 3f+1 total."""

    def __init__(self, max_faulty: int = 1):
        self.max_faulty = max_faulty
        self.min_agents = 3 * max_faulty + 1

    def resolve(self, responses: list[dict]) -> dict:
        if len(responses) < self.min_agents:
            raise ValueError(
                f"Need >= {self.min_agents} agents for f={self.max_faulty}"
            )

        values = [r["value"] for r in responses]
        median = statistics.median(values)
        mad = statistics.median(
            [abs(v - median) for v in values]
        )
        threshold = 3 * mad if mad > 0 else 0.1 * abs(median)

        trusted = [
            r for r in responses
            if abs(r["value"] - median) <= threshold
        ]
        excluded = [
            r for r in responses
            if abs(r["value"] - median) > threshold
        ]

        if len(trusted) < len(responses) - self.max_faulty:
            return {"status": "no_consensus", "excluded": excluded}

        consensus_val = statistics.mean(r["value"] for r in trusted)
        return {
            "status": "consensus",
            "value": round(consensus_val, 4),
            "trusted_agents": len(trusted),
            "excluded_agents": [e["agent_id"] for e in excluded],
        }

The key insight is 3f + 1: to tolerate one faulty agent, you need at least four agents total. To tolerate two, you need seven. This is a fundamental lower bound from distributed systems theory.

Choosing the Right Pattern

Use majority voting for classification tasks with discrete outputs. Use weighted averaging for numeric estimates where agent reliability varies. Use BFT when agent outputs cannot be trusted unconditionally — such as when agents call external APIs that might return errors, or when you run heterogeneous models with different failure modes.

FAQ

When should I use consensus instead of just picking the best single agent?

Use consensus whenever the cost of a wrong answer exceeds the cost of running multiple agents. In practice, a 3-agent majority vote with mid-tier models often outperforms a single top-tier model at lower total cost, especially for classification tasks where agreement rate gives you a built-in confidence signal.

How do I handle ties in majority voting?

Common strategies include: adding more agents until the tie breaks, falling back to the agent with the highest confidence score, or escalating to a human reviewer. Never resolve ties randomly in production — you lose reproducibility and auditability.

Does BFT work for text generation, not just numeric outputs?

Yes, but you need a similarity metric to replace numeric distance. Use embedding cosine similarity or ROUGE scores to identify outliers. If one agent generates text that is semantically distant from all others, treat it as a Byzantine failure and exclude it before selecting the most representative output.

#ConsensusAlgorithms #MultiAgentSystems #ByzantineFaultTolerance #DistributedAI #Python #AgenticAI #LearnAI #AIEngineering

Consensus Algorithms for Multi-Agent Systems: Voting, Averaging, and Byzantine Fault Tolerance

Why Agents Need Consensus

Pattern 1: Majority Voting

Pattern 2: Weighted Averaging

Pattern 3: Byzantine Fault Tolerance

Choosing the Right Pattern

FAQ

When should I use consensus instead of just picking the best single agent?

How do I handle ties in majority voting?

Does BFT work for text generation, not just numeric outputs?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding