The Scaling Problem MoE Solves

Dense transformer models have a fundamental scaling limitation: every token processed passes through every parameter. A 70B parameter model uses all 70 billion parameters for every single token, regardless of whether the input is simple arithmetic or complex legal reasoning. This means compute cost scales linearly with model size.

Mixture of Experts (MoE) breaks this constraint. An MoE model can have 400B total parameters but only activate 50B for any given token. The result: the knowledge capacity of a massive model with the inference cost of a much smaller one.

How MoE Architecture Works

The Standard Transformer Block

In a standard (dense) transformer, each layer contains:

flowchart TD
    START["Mixture of Experts MoE Models: How Modern LLMs Sc…"] --> A
    A["The Scaling Problem MoE Solves"]
    A --> B
    B["How MoE Architecture Works"]
    B --> C
    C["Key MoE Models in 2026"]
    C --> D
    D["The Load Balancing Problem"]
    D --> E
    E["Inference Efficiency"]
    E --> F
    F["Practical Implications for AI Engineers"]
    F --> G
    G["Key Takeaways"]
    G --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff

Self-attention mechanism
Feed-forward network (FFN) -- two linear layers with an activation function

The FFN is where most parameters live and most computation happens. MoE replaces the single FFN with multiple "expert" FFNs and a router that decides which experts to use.

The MoE Layer

Input Token
    |
    v
[Self-Attention] -- same as dense transformer
    |
    v
[Router Network] -- small neural network
    |
    +---> Expert 1 (FFN)  [score: 0.45] ✓ Selected
    +---> Expert 2 (FFN)  [score: 0.38] ✓ Selected
    +---> Expert 3 (FFN)  [score: 0.09]
    +---> Expert 4 (FFN)  [score: 0.05]
    +---> Expert 5 (FFN)  [score: 0.02]
    +---> Expert 6 (FFN)  [score: 0.01]
    +---> Expert 7 (FFN)  [score: 0.00]
    +---> Expert 8 (FFN)  [score: 0.00]
    |
    v
[Weighted Sum of Selected Expert Outputs]
    |
    v
Output

The Router (Gating Network)

The router is a small linear layer that takes the token's hidden state and produces a probability distribution over all experts:

import torch
import torch.nn as nn
import torch.nn.functional as F

class TopKRouter(nn.Module):
    def __init__(self, hidden_dim: int, num_experts: int, top_k: int = 2):
        super().__init__()
        self.gate = nn.Linear(hidden_dim, num_experts, bias=False)
        self.top_k = top_k
        self.num_experts = num_experts

    def forward(self, x: torch.Tensor):
        # x shape: (batch_size, seq_len, hidden_dim)
        logits = self.gate(x)  # (batch_size, seq_len, num_experts)

        # Select top-k experts
        top_k_logits, top_k_indices = torch.topk(logits, self.top_k, dim=-1)
        top_k_weights = F.softmax(top_k_logits, dim=-1)

        return top_k_weights, top_k_indices

class MoELayer(nn.Module):
    def __init__(self, hidden_dim: int, ffn_dim: int, num_experts: int, top_k: int = 2):
        super().__init__()
        self.router = TopKRouter(hidden_dim, num_experts, top_k)
        self.experts = nn.ModuleList([
            FFNExpert(hidden_dim, ffn_dim) for _ in range(num_experts)
        ])

    def forward(self, x: torch.Tensor):
        weights, indices = self.router(x)
        # weights: (batch, seq, top_k), indices: (batch, seq, top_k)

        output = torch.zeros_like(x)
        for k in range(self.router.top_k):
            expert_idx = indices[:, :, k]  # Which expert for each token
            expert_weight = weights[:, :, k].unsqueeze(-1)

            for i, expert in enumerate(self.experts):
                mask = (expert_idx == i)
                if mask.any():
                    expert_input = x[mask]
                    expert_output = expert(expert_input)
                    output[mask] += expert_weight[mask] * expert_output

        return output

Key MoE Models in 2026

Mixtral 8x7B and 8x22B (Mistral AI)

The model that popularized MoE for open-source LLMs. Mixtral 8x7B has 46.7B total parameters but only activates 12.9B per token (2 of 8 experts).

Model	Total Params	Active Params	Experts	Top-K
Mixtral 8x7B	46.7B	12.9B	8	2
Mixtral 8x22B	141B	39B	8	2

DeepSeek-V3 (DeepSeek AI)

DeepSeek-V3 uses a more granular MoE with 256 fine-grained experts and an auxiliary-loss-free load balancing strategy:

Model	Total Params	Active Params	Experts	Top-K
DeepSeek-V3	671B	37B	256 + 1 shared	8

Grok-2 (xAI)

Grok-2 uses MoE architecture, though xAI has not published full architectural details. Based on inference behavior, it is estimated to use 8-16 experts with top-2 routing.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Try Live Demo ROI Calculator

The Load Balancing Problem

A naive router tends to collapse: it learns to send most tokens to a small number of experts while the rest go unused. This "expert collapse" wastes parameters and reduces model quality.

flowchart TD
    ROOT["Mixture of Experts MoE Models: How Modern LL…"] 
    ROOT --> P0["How MoE Architecture Works"]
    P0 --> P0C0["The Standard Transformer Block"]
    P0 --> P0C1["The MoE Layer"]
    P0 --> P0C2["The Router Gating Network"]
    ROOT --> P1["Key MoE Models in 2026"]
    P1 --> P1C0["Mixtral 8x7B and 8x22B Mistral AI"]
    P1 --> P1C1["DeepSeek-V3 DeepSeek AI"]
    P1 --> P1C2["Grok-2 xAI"]
    ROOT --> P2["The Load Balancing Problem"]
    P2 --> P2C0["Auxiliary Loss for Load Balancing"]
    P2 --> P2C1["DeepSeek39s Auxiliary-Loss-Free Approach"]
    ROOT --> P3["Inference Efficiency"]
    P3 --> P3C0["Memory Bandwidth is the Bottleneck"]
    P3 --> P3C1["Expert Offloading"]
    style ROOT fill:#4f46e5,stroke:#4338ca,color:#fff
    style P0 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
    style P1 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
    style P2 fill:#e0e7ff,stroke:#6366f1,color:#1e293b
    style P3 fill:#e0e7ff,stroke:#6366f1,color:#1e293b

Auxiliary Loss for Load Balancing

The standard solution adds a load-balancing loss term during training:

def load_balancing_loss(router_logits: torch.Tensor, num_experts: int) -> torch.Tensor:
    """
    Encourages equal utilization of all experts.
    router_logits: (batch_size * seq_len, num_experts)
    """
    # Fraction of tokens routed to each expert
    routing_probs = F.softmax(router_logits, dim=-1)
    tokens_per_expert = routing_probs.mean(dim=0)  # (num_experts,)

    # Ideal: each expert gets 1/num_experts fraction
    target = torch.ones(num_experts, device=router_logits.device) / num_experts

    # L2 loss between actual and ideal distribution
    return num_experts * torch.sum(tokens_per_expert * tokens_per_expert)

DeepSeek's Auxiliary-Loss-Free Approach

DeepSeek-V3 introduced a bias term in the router that is adjusted dynamically during training to maintain balance, avoiding the quality degradation that auxiliary losses can cause:

class DeepSeekRouter(nn.Module):
    def __init__(self, hidden_dim, num_experts, top_k=8):
        super().__init__()
        self.gate = nn.Linear(hidden_dim, num_experts, bias=False)
        # Learnable bias for load balancing (not gradient-based)
        self.expert_bias = nn.Parameter(torch.zeros(num_experts), requires_grad=False)
        self.top_k = top_k

    def forward(self, x):
        logits = self.gate(x) + self.expert_bias  # Add bias for balancing
        top_k_logits, top_k_indices = torch.topk(logits, self.top_k, dim=-1)
        # Softmax on original logits (without bias) for actual weighting
        original_logits = self.gate(x)
        weights = F.softmax(
            original_logits.gather(-1, top_k_indices), dim=-1
        )
        return weights, top_k_indices

Inference Efficiency

Memory Bandwidth is the Bottleneck

For MoE inference, the key performance factor is not computation but memory bandwidth. All expert weights must be stored in memory (or on disk), but only active experts need to be loaded for each token.

flowchart LR
    S0["1. Cost Efficiency"]
    S0 --> S1
    S1["2. Latency Characteristics"]
    S1 --> S2
    S2["3. Specialization Emergence"]
    S2 --> S3
    S3["4. Fine-Tuning Considerations"]
    style S0 fill:#4f46e5,stroke:#4338ca,color:#fff
    style S3 fill:#059669,stroke:#047857,color:#fff

Dense 70B model:
  - Parameters loaded per token: 70B * 2 bytes = 140 GB
  - All parameters always active

MoE 8x7B (Mixtral):
  - Total parameters: 46.7B * 2 bytes = 93 GB (stored)
  - Parameters loaded per token: 12.9B * 2 bytes = 26 GB (active)
  - 3.6x less memory bandwidth per token

Expert Offloading

For running large MoE models on consumer hardware, expert offloading keeps inactive experts on disk or CPU RAM and loads them on demand:

class OffloadedMoELayer:
    def __init__(self, experts, device="cuda"):
        self.device = device
        # Keep all experts on CPU
        self.cpu_experts = [e.cpu() for e in experts]
        # Only active experts on GPU
        self.gpu_cache = {}

    def forward(self, x, expert_indices):
        unique_experts = expert_indices.unique().tolist()

        # Load needed experts to GPU
        for idx in unique_experts:
            if idx not in self.gpu_cache:
                self.gpu_cache[idx] = self.cpu_experts[idx].to(self.device)

        # Run computation with GPU experts
        output = self._compute(x, expert_indices)

        # Evict least recently used experts if GPU memory is tight
        self._evict_if_needed()

        return output

Practical Implications for AI Engineers

1. Cost Efficiency

MoE models offer better quality-per-dollar for API consumers because providers can serve more concurrent requests with the same GPU fleet. A 400B MoE model that activates 50B parameters per token can serve 8x more concurrent requests than a dense 400B model on the same hardware.

2. Latency Characteristics

MoE models have similar latency to dense models of the same active parameter count. Mixtral 8x7B (12.9B active) has latency comparable to a 13B dense model, not a 47B model.

3. Specialization Emergence

Research shows that MoE experts naturally specialize during training. In Mixtral, different experts handle different types of content: some specialize in code, others in formal writing, others in multilingual content. This specialization happens without explicit guidance.

4. Fine-Tuning Considerations

Fine-tuning MoE models is more complex than dense models:

Full fine-tuning: Expensive, requires updating all experts
LoRA on all experts: Applies adapter to every expert FFN
LoRA on router + selected experts: Most efficient, fine-tune only the experts most relevant to your domain

Key Takeaways

MoE represents the current best approach for scaling LLM capability while controlling inference costs. The architecture allows models to store far more knowledge than they compute over for any single token, giving them the capacity of a very large model with the speed of a much smaller one. For AI engineers, the practical implication is that MoE models offer the best quality-per-dollar ratio, and understanding their architecture helps in making informed decisions about model selection, fine-tuning strategy, and deployment planning.

Mixture of Experts (MoE) Models: How Modern LLMs Scale Efficiently