Skip to content
Back to Blog
Agentic AI7 min read

Mixture of Experts (MoE) Models: How Modern LLMs Scale Efficiently

A technical deep-dive into Mixture of Experts architecture, explaining how MoE models like Mixtral, DeepSeek, and Grok achieve massive parameter counts with efficient inference. Covers routing mechanisms, training strategies, and practical implications for AI engineers.

The Scaling Problem MoE Solves

Dense transformer models have a fundamental scaling limitation: every token processed passes through every parameter. A 70B parameter model uses all 70 billion parameters for every single token, regardless of whether the input is simple arithmetic or complex legal reasoning. This means compute cost scales linearly with model size.

Mixture of Experts (MoE) breaks this constraint. An MoE model can have 400B total parameters but only activate 50B for any given token. The result: the knowledge capacity of a massive model with the inference cost of a much smaller one.

How MoE Architecture Works

The Standard Transformer Block

In a standard (dense) transformer, each layer contains:

  1. Self-attention mechanism
  2. Feed-forward network (FFN) -- two linear layers with an activation function

The FFN is where most parameters live and most computation happens. MoE replaces the single FFN with multiple "expert" FFNs and a router that decides which experts to use.

The MoE Layer

Input Token
    |
    v
[Self-Attention] -- same as dense transformer
    |
    v
[Router Network] -- small neural network
    |
    +---> Expert 1 (FFN)  [score: 0.45] ✓ Selected
    +---> Expert 2 (FFN)  [score: 0.38] ✓ Selected
    +---> Expert 3 (FFN)  [score: 0.09]
    +---> Expert 4 (FFN)  [score: 0.05]
    +---> Expert 5 (FFN)  [score: 0.02]
    +---> Expert 6 (FFN)  [score: 0.01]
    +---> Expert 7 (FFN)  [score: 0.00]
    +---> Expert 8 (FFN)  [score: 0.00]
    |
    v
[Weighted Sum of Selected Expert Outputs]
    |
    v
Output

The Router (Gating Network)

The router is a small linear layer that takes the token's hidden state and produces a probability distribution over all experts:

import torch
import torch.nn as nn
import torch.nn.functional as F

class TopKRouter(nn.Module):
    def __init__(self, hidden_dim: int, num_experts: int, top_k: int = 2):
        super().__init__()
        self.gate = nn.Linear(hidden_dim, num_experts, bias=False)
        self.top_k = top_k
        self.num_experts = num_experts

    def forward(self, x: torch.Tensor):
        # x shape: (batch_size, seq_len, hidden_dim)
        logits = self.gate(x)  # (batch_size, seq_len, num_experts)

        # Select top-k experts
        top_k_logits, top_k_indices = torch.topk(logits, self.top_k, dim=-1)
        top_k_weights = F.softmax(top_k_logits, dim=-1)

        return top_k_weights, top_k_indices

class MoELayer(nn.Module):
    def __init__(self, hidden_dim: int, ffn_dim: int, num_experts: int, top_k: int = 2):
        super().__init__()
        self.router = TopKRouter(hidden_dim, num_experts, top_k)
        self.experts = nn.ModuleList([
            FFNExpert(hidden_dim, ffn_dim) for _ in range(num_experts)
        ])

    def forward(self, x: torch.Tensor):
        weights, indices = self.router(x)
        # weights: (batch, seq, top_k), indices: (batch, seq, top_k)

        output = torch.zeros_like(x)
        for k in range(self.router.top_k):
            expert_idx = indices[:, :, k]  # Which expert for each token
            expert_weight = weights[:, :, k].unsqueeze(-1)

            for i, expert in enumerate(self.experts):
                mask = (expert_idx == i)
                if mask.any():
                    expert_input = x[mask]
                    expert_output = expert(expert_input)
                    output[mask] += expert_weight[mask] * expert_output

        return output

Key MoE Models in 2026

Mixtral 8x7B and 8x22B (Mistral AI)

The model that popularized MoE for open-source LLMs. Mixtral 8x7B has 46.7B total parameters but only activates 12.9B per token (2 of 8 experts).

Model Total Params Active Params Experts Top-K
Mixtral 8x7B 46.7B 12.9B 8 2
Mixtral 8x22B 141B 39B 8 2

DeepSeek-V3 (DeepSeek AI)

DeepSeek-V3 uses a more granular MoE with 256 fine-grained experts and an auxiliary-loss-free load balancing strategy:

Model Total Params Active Params Experts Top-K
DeepSeek-V3 671B 37B 256 + 1 shared 8

Grok-2 (xAI)

Grok-2 uses MoE architecture, though xAI has not published full architectural details. Based on inference behavior, it is estimated to use 8-16 experts with top-2 routing.

The Load Balancing Problem

A naive router tends to collapse: it learns to send most tokens to a small number of experts while the rest go unused. This "expert collapse" wastes parameters and reduces model quality.

Auxiliary Loss for Load Balancing

The standard solution adds a load-balancing loss term during training:

def load_balancing_loss(router_logits: torch.Tensor, num_experts: int) -> torch.Tensor:
    """
    Encourages equal utilization of all experts.
    router_logits: (batch_size * seq_len, num_experts)
    """
    # Fraction of tokens routed to each expert
    routing_probs = F.softmax(router_logits, dim=-1)
    tokens_per_expert = routing_probs.mean(dim=0)  # (num_experts,)

    # Ideal: each expert gets 1/num_experts fraction
    target = torch.ones(num_experts, device=router_logits.device) / num_experts

    # L2 loss between actual and ideal distribution
    return num_experts * torch.sum(tokens_per_expert * tokens_per_expert)

DeepSeek's Auxiliary-Loss-Free Approach

DeepSeek-V3 introduced a bias term in the router that is adjusted dynamically during training to maintain balance, avoiding the quality degradation that auxiliary losses can cause:

class DeepSeekRouter(nn.Module):
    def __init__(self, hidden_dim, num_experts, top_k=8):
        super().__init__()
        self.gate = nn.Linear(hidden_dim, num_experts, bias=False)
        # Learnable bias for load balancing (not gradient-based)
        self.expert_bias = nn.Parameter(torch.zeros(num_experts), requires_grad=False)
        self.top_k = top_k

    def forward(self, x):
        logits = self.gate(x) + self.expert_bias  # Add bias for balancing
        top_k_logits, top_k_indices = torch.topk(logits, self.top_k, dim=-1)
        # Softmax on original logits (without bias) for actual weighting
        original_logits = self.gate(x)
        weights = F.softmax(
            original_logits.gather(-1, top_k_indices), dim=-1
        )
        return weights, top_k_indices

Inference Efficiency

Memory Bandwidth is the Bottleneck

For MoE inference, the key performance factor is not computation but memory bandwidth. All expert weights must be stored in memory (or on disk), but only active experts need to be loaded for each token.

Dense 70B model:
  - Parameters loaded per token: 70B * 2 bytes = 140 GB
  - All parameters always active

MoE 8x7B (Mixtral):
  - Total parameters: 46.7B * 2 bytes = 93 GB (stored)
  - Parameters loaded per token: 12.9B * 2 bytes = 26 GB (active)
  - 3.6x less memory bandwidth per token

Expert Offloading

For running large MoE models on consumer hardware, expert offloading keeps inactive experts on disk or CPU RAM and loads them on demand:

class OffloadedMoELayer:
    def __init__(self, experts, device="cuda"):
        self.device = device
        # Keep all experts on CPU
        self.cpu_experts = [e.cpu() for e in experts]
        # Only active experts on GPU
        self.gpu_cache = {}

    def forward(self, x, expert_indices):
        unique_experts = expert_indices.unique().tolist()

        # Load needed experts to GPU
        for idx in unique_experts:
            if idx not in self.gpu_cache:
                self.gpu_cache[idx] = self.cpu_experts[idx].to(self.device)

        # Run computation with GPU experts
        output = self._compute(x, expert_indices)

        # Evict least recently used experts if GPU memory is tight
        self._evict_if_needed()

        return output

Practical Implications for AI Engineers

1. Cost Efficiency

MoE models offer better quality-per-dollar for API consumers because providers can serve more concurrent requests with the same GPU fleet. A 400B MoE model that activates 50B parameters per token can serve 8x more concurrent requests than a dense 400B model on the same hardware.

2. Latency Characteristics

MoE models have similar latency to dense models of the same active parameter count. Mixtral 8x7B (12.9B active) has latency comparable to a 13B dense model, not a 47B model.

3. Specialization Emergence

Research shows that MoE experts naturally specialize during training. In Mixtral, different experts handle different types of content: some specialize in code, others in formal writing, others in multilingual content. This specialization happens without explicit guidance.

4. Fine-Tuning Considerations

Fine-tuning MoE models is more complex than dense models:

  • Full fine-tuning: Expensive, requires updating all experts
  • LoRA on all experts: Applies adapter to every expert FFN
  • LoRA on router + selected experts: Most efficient, fine-tune only the experts most relevant to your domain

Key Takeaways

MoE represents the current best approach for scaling LLM capability while controlling inference costs. The architecture allows models to store far more knowledge than they compute over for any single token, giving them the capacity of a very large model with the speed of a much smaller one. For AI engineers, the practical implication is that MoE models offer the best quality-per-dollar ratio, and understanding their architecture helps in making informed decisions about model selection, fine-tuning strategy, and deployment planning.

Share this article
N

NYC News

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.