Building a Mixture-of-Agents System: Combining Multiple LLMs for Superior Output
Learn how to build a Mixture-of-Agents (MoA) architecture that combines outputs from multiple LLMs using a proposer-aggregator pattern to produce higher quality results than any single model.
What Is Mixture-of-Agents?
Mixture-of-Agents (MoA) is an architecture where multiple LLMs independently generate responses to a query, and an aggregator model synthesizes their outputs into a single, superior response. Research from Together AI demonstrated that MoA can achieve state-of-the-art performance on benchmarks like AlpacaEval, surpassing even the strongest individual models.
The core insight is that LLMs are collaboratively better — each model brings different strengths, knowledge patterns, and reasoning approaches. An aggregator that sees all their outputs can cherry-pick the best reasoning, catch errors that some models made but others avoided, and produce more comprehensive and accurate responses.
The Proposer-Aggregator Pattern
The architecture has two layers. Proposer agents independently generate candidate responses. The aggregator agent receives all proposals and produces the final output.
import asyncio
from dataclasses import dataclass
from typing import Any
@dataclass
class ProposerConfig:
name: str
model: str
temperature: float = 0.7
system_prompt: str = "You are a helpful assistant."
@dataclass
class Proposal:
source: str
content: str
model: str
class MixtureOfAgents:
def __init__(
self,
proposers: list[ProposerConfig],
aggregator_model: str = "gpt-4o",
num_layers: int = 1,
):
self.proposers = proposers
self.aggregator_model = aggregator_model
self.num_layers = num_layers
async def _call_llm(
self, model: str, messages: list[dict], temperature: float
) -> str:
"""Replace with your actual LLM client."""
# Example using openai client:
# response = await client.chat.completions.create(
# model=model, messages=messages, temperature=temperature
# )
# return response.choices[0].message.content
raise NotImplementedError("Wire up your LLM client here")
async def _get_proposal(
self, config: ProposerConfig, query: str
) -> Proposal:
messages = [
{"role": "system", "content": config.system_prompt},
{"role": "user", "content": query},
]
content = await self._call_llm(
config.model, messages, config.temperature
)
return Proposal(
source=config.name, content=content, model=config.model
)
async def _aggregate(
self, query: str, proposals: list[Proposal]
) -> str:
proposal_text = "\n\n".join(
f"--- Response from {p.source} ({p.model}) ---\n{p.content}"
for p in proposals
)
agg_prompt = (
"You have been given several AI-generated responses to "
"the same query. Synthesize them into a single, superior "
"response that:\n"
"1. Combines the best reasoning and insights from each\n"
"2. Corrects any errors present in individual responses\n"
"3. Fills gaps where one response covers something others missed\n"
"4. Maintains a coherent, well-structured narrative\n\n"
f"Original query: {query}\n\n"
f"Responses to synthesize:\n{proposal_text}"
)
messages = [
{"role": "system", "content": "You are an expert synthesizer."},
{"role": "user", "content": agg_prompt},
]
return await self._call_llm(self.aggregator_model, messages, 0.3)
async def run(self, query: str) -> dict[str, Any]:
current_query = query
for layer in range(self.num_layers):
proposals = await asyncio.gather(
*[self._get_proposal(p, current_query) for p in self.proposers]
)
if layer < self.num_layers - 1:
# Intermediate layer: aggregated output becomes
# the input for the next layer of proposers
current_query = await self._aggregate(query, proposals)
else:
final = await self._aggregate(query, proposals)
return {
"final_response": final,
"num_proposals": len(proposals),
"models_used": [p.model for p in proposals],
"layers": self.num_layers,
}
Multi-Layer MoA
The num_layers parameter enables stacking. In a 2-layer MoA, the aggregated output from layer 1 becomes the input for proposers in layer 2, which are then aggregated again. Each layer refines the response further. Research shows that 2-3 layers provide meaningful improvement, but returns diminish rapidly after that.
Configuring Diverse Proposers
The power of MoA comes from diversity. If all proposers use the same model with the same temperature, you get redundant outputs. Configure proposers with different models, temperatures, and system prompts.
proposers = [
ProposerConfig(
name="analytical",
model="gpt-4o",
temperature=0.3,
system_prompt="You are a precise analytical thinker. Focus on accuracy and logical reasoning.",
),
ProposerConfig(
name="creative",
model="claude-sonnet-4-20250514",
temperature=0.8,
system_prompt="You are a creative problem solver. Consider unconventional angles.",
),
ProposerConfig(
name="practical",
model="gemini-1.5-pro",
temperature=0.5,
system_prompt="You are a pragmatic engineer. Focus on implementation details.",
),
]
moa = MixtureOfAgents(
proposers=proposers,
aggregator_model="gpt-4o",
num_layers=2,
)
Cost and Latency Management
MoA multiplies your LLM costs by the number of proposers plus one (for the aggregator). Mitigate this with three strategies.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Tiered proposers: Use cheaper models (GPT-4o-mini, Claude Haiku) as proposers and reserve the expensive model for aggregation only. The aggregator benefits from seeing diverse reasoning without each proposal needing top-tier quality.
Parallel execution: All proposals run concurrently with asyncio.gather, so latency equals the slowest proposer rather than the sum. The aggregation step adds one more round-trip.
Selective MoA: Use a router that invokes MoA only for complex queries. Simple factual questions can go directly to a single model. Score query complexity based on length, ambiguity, or domain, and only fan out to multiple proposers above a threshold.
FAQ
How many proposers should I use?
Three is the sweet spot for most applications. Two proposers often agree, giving the aggregator little to work with. Five or more adds cost without proportional quality gains unless the task is highly ambiguous. Start with three models from different providers to maximize diversity.
Does MoA work for code generation, or only for text?
MoA works excellently for code generation. Different models make different kinds of mistakes — one might miss an edge case, another might use a deprecated API. The aggregator can combine the correct logic from one proposal with the proper API usage from another. For code, add a "test the code" verification step after aggregation.
Can I use MoA with open-source models to avoid API costs entirely?
Absolutely. Run three different open-source models (Llama, Mistral, Qwen) locally and use the strongest as the aggregator. This is one of MoA's most compelling use cases — three medium-quality open-source models combined often outperform a single large proprietary model, at zero API cost.
#MixtureOfAgents #LLMOrchestration #MultiModelSystems #AIArchitecture #Python #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.