Building a Mixture-of-Agents System: Combining Multiple LLMs for Superior Output

What Is Mixture-of-Agents?

Mixture-of-Agents (MoA) is an architecture where multiple LLMs independently generate responses to a query, and an aggregator model synthesizes their outputs into a single, superior response. Research from Together AI demonstrated that MoA can achieve state-of-the-art performance on benchmarks like AlpacaEval, surpassing even the strongest individual models.

The core insight is that LLMs are collaboratively better — each model brings different strengths, knowledge patterns, and reasoning approaches. An aggregator that sees all their outputs can cherry-pick the best reasoning, catch errors that some models made but others avoided, and produce more comprehensive and accurate responses.

The Proposer-Aggregator Pattern

The architecture has two layers. Proposer agents independently generate candidate responses. The aggregator agent receives all proposals and produces the final output.

import asyncio
from dataclasses import dataclass
from typing import Any

@dataclass
class ProposerConfig:
    name: str
    model: str
    temperature: float = 0.7
    system_prompt: str = "You are a helpful assistant."

@dataclass
class Proposal:
    source: str
    content: str
    model: str

class MixtureOfAgents:
    def __init__(
        self,
        proposers: list[ProposerConfig],
        aggregator_model: str = "gpt-4o",
        num_layers: int = 1,
    ):
        self.proposers = proposers
        self.aggregator_model = aggregator_model
        self.num_layers = num_layers

    async def _call_llm(
        self, model: str, messages: list[dict], temperature: float
    ) -> str:
        """Replace with your actual LLM client."""
        # Example using openai client:
        # response = await client.chat.completions.create(
        #     model=model, messages=messages, temperature=temperature
        # )
        # return response.choices[0].message.content
        raise NotImplementedError("Wire up your LLM client here")

    async def _get_proposal(
        self, config: ProposerConfig, query: str
    ) -> Proposal:
        messages = [
            {"role": "system", "content": config.system_prompt},
            {"role": "user", "content": query},
        ]
        content = await self._call_llm(
            config.model, messages, config.temperature
        )
        return Proposal(
            source=config.name, content=content, model=config.model
        )

    async def _aggregate(
        self, query: str, proposals: list[Proposal]
    ) -> str:
        proposal_text = "\n\n".join(
            f"--- Response from {p.source} ({p.model}) ---\n{p.content}"
            for p in proposals
        )

        agg_prompt = (
            "You have been given several AI-generated responses to "
            "the same query. Synthesize them into a single, superior "
            "response that:\n"
            "1. Combines the best reasoning and insights from each\n"
            "2. Corrects any errors present in individual responses\n"
            "3. Fills gaps where one response covers something others missed\n"
            "4. Maintains a coherent, well-structured narrative\n\n"
            f"Original query: {query}\n\n"
            f"Responses to synthesize:\n{proposal_text}"
        )

        messages = [
            {"role": "system", "content": "You are an expert synthesizer."},
            {"role": "user", "content": agg_prompt},
        ]
        return await self._call_llm(self.aggregator_model, messages, 0.3)

    async def run(self, query: str) -> dict[str, Any]:
        current_query = query

        for layer in range(self.num_layers):
            proposals = await asyncio.gather(
                *[self._get_proposal(p, current_query) for p in self.proposers]
            )

            if layer < self.num_layers - 1:
                # Intermediate layer: aggregated output becomes
                # the input for the next layer of proposers
                current_query = await self._aggregate(query, proposals)
            else:
                final = await self._aggregate(query, proposals)

        return {
            "final_response": final,
            "num_proposals": len(proposals),
            "models_used": [p.model for p in proposals],
            "layers": self.num_layers,
        }

Multi-Layer MoA

The num_layers parameter enables stacking. In a 2-layer MoA, the aggregated output from layer 1 becomes the input for proposers in layer 2, which are then aggregated again. Each layer refines the response further. Research shows that 2-3 layers provide meaningful improvement, but returns diminish rapidly after that.

Configuring Diverse Proposers

The power of MoA comes from diversity. If all proposers use the same model with the same temperature, you get redundant outputs. Configure proposers with different models, temperatures, and system prompts.

proposers = [
    ProposerConfig(
        name="analytical",
        model="gpt-4o",
        temperature=0.3,
        system_prompt="You are a precise analytical thinker. Focus on accuracy and logical reasoning.",
    ),
    ProposerConfig(
        name="creative",
        model="claude-sonnet-4-20250514",
        temperature=0.8,
        system_prompt="You are a creative problem solver. Consider unconventional angles.",
    ),
    ProposerConfig(
        name="practical",
        model="gemini-1.5-pro",
        temperature=0.5,
        system_prompt="You are a pragmatic engineer. Focus on implementation details.",
    ),
]

moa = MixtureOfAgents(
    proposers=proposers,
    aggregator_model="gpt-4o",
    num_layers=2,
)

Cost and Latency Management

MoA multiplies your LLM costs by the number of proposers plus one (for the aggregator). Mitigate this with three strategies.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

Tiered proposers: Use cheaper models (GPT-4o-mini, Claude Haiku) as proposers and reserve the expensive model for aggregation only. The aggregator benefits from seeing diverse reasoning without each proposal needing top-tier quality.

Parallel execution: All proposals run concurrently with asyncio.gather, so latency equals the slowest proposer rather than the sum. The aggregation step adds one more round-trip.

Selective MoA: Use a router that invokes MoA only for complex queries. Simple factual questions can go directly to a single model. Score query complexity based on length, ambiguity, or domain, and only fan out to multiple proposers above a threshold.

FAQ

How many proposers should I use?

Three is the sweet spot for most applications. Two proposers often agree, giving the aggregator little to work with. Five or more adds cost without proportional quality gains unless the task is highly ambiguous. Start with three models from different providers to maximize diversity.

Does MoA work for code generation, or only for text?

MoA works excellently for code generation. Different models make different kinds of mistakes — one might miss an edge case, another might use a deprecated API. The aggregator can combine the correct logic from one proposal with the proper API usage from another. For code, add a "test the code" verification step after aggregation.

Can I use MoA with open-source models to avoid API costs entirely?

Absolutely. Run three different open-source models (Llama, Mistral, Qwen) locally and use the strongest as the aggregator. This is one of MoA's most compelling use cases — three medium-quality open-source models combined often outperform a single large proprietary model, at zero API cost.

#MixtureOfAgents #LLMOrchestration #MultiModelSystems #AIArchitecture #Python #AgenticAI #LearnAI #AIEngineering

Building a Mixture-of-Agents System: Combining Multiple LLMs for Superior Output

What Is Mixture-of-Agents?

The Proposer-Aggregator Pattern

Multi-Layer MoA

Configuring Diverse Proposers

Cost and Latency Management

FAQ

How many proposers should I use?

Does MoA work for code generation, or only for text?

Can I use MoA with open-source models to avoid API costs entirely?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding