The Transformer Bottleneck

Transformers have dominated language modeling since 2017, but their quadratic attention mechanism creates a fundamental scaling problem. Processing a sequence of length N requires O(N^2) computation and memory for the self-attention step. This means doubling the context length quadruples the cost. At 128K+ token context windows, this cost becomes prohibitive for many applications.

Several alternative architectures are emerging that achieve linear or near-linear scaling with sequence length while approaching transformer-quality performance.

Mamba and Selective State Spaces

Mamba, introduced by Albert Gu and Tri Dao in December 2023, is the most prominent transformer alternative. It builds on Structured State Space Models (S4) with a critical innovation: selective state spaces that allow the model to dynamically filter information based on input.

How Mamba Works

Traditional state space models process sequences through a fixed linear recurrence:

h_t = A * h_{t-1} + B * x_t    (state update)
y_t = C * h_t                    (output)

Where A, B, and C are fixed matrices. Mamba makes B, C, and the discretization step size input-dependent, allowing the model to selectively retain or forget information based on the current token.

Performance Characteristics

Linear time complexity: O(N) instead of O(N^2), enabling efficient processing of very long sequences
No KV cache: Mamba uses a fixed-size state instead of a growing KV cache, making inference memory constant regardless of sequence length
Hardware-efficient: The selective scan operation is implemented as a custom CUDA kernel that achieves high GPU utilization

Mamba-2 and Improvements

Mamba-2, released in mid-2024, reformulated the selective state space as a form of structured matrix computation, connecting it theoretically to attention. This enabled:

2-8x faster training than the original Mamba
Better parallelization across GPUs during training
Clearer theoretical understanding of what the model learns

RWKV: Linear Attention for Language

RWKV (pronounced "RwaKuv") combines the parallelizable training of transformers with the efficient inference of RNNs. It achieves this through a linear attention mechanism that avoids the softmax operation responsible for transformers' quadratic cost.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

Architecture

RWKV uses two key mechanisms:

Time mixing: A linear interpolation between the current input and previous states, weighted by learned decay factors
Channel mixing: A feed-forward layer similar to transformers but applied with recurrent state

During training, RWKV processes all tokens in parallel (like a transformer). During inference, it operates as an RNN, processing one token at a time with constant memory and compute.

RWKV v6 (Eagle/Finch)

The latest RWKV versions introduce data-dependent linear recurrence, similar to Mamba's selective mechanism:

Eagle (v6): Improved training dynamics with dynamic recurrence
Finch (v6): Multilingual variant with expanded vocabulary and training data
Models available up to 14B parameters with competitive performance against similarly-sized transformers

Hybrid Architectures

The most practical approach emerging in 2025-2026 is hybrid architectures that combine transformer attention layers with linear-complexity layers.

Jamba (AI21)

Jamba interleaves Mamba layers with transformer attention layers and adds mixture-of-experts (MoE) for parameter efficiency. The result:

256K token context window with manageable memory
Attention layers handle tasks requiring precise token-level recall
Mamba layers handle long-range dependencies efficiently
MoE keeps active parameter count reasonable

NVIDIA's Hybrid Approach

NVIDIA has explored architectures that use Mamba for the majority of layers with strategically placed attention layers for tasks requiring exact retrieval (like copying specific strings from the context). This gives near-linear scaling for most of the model while preserving the capabilities that pure state-space models struggle with.

Where Non-Transformer Models Struggle

Despite their efficiency advantages, transformer alternatives have consistent weaknesses:

In-context learning: Transformers excel at learning new patterns from examples provided in the prompt. SSMs are weaker at this, likely because attention's O(N^2) comparison mechanism is genuinely useful for matching patterns across the context.
Exact recall: Tasks like "What was the third word in the second paragraph?" require precise attention to specific positions. Linear models tend to blur positional information.
Established ecosystem: The transformer ecosystem (optimization tools, deployment frameworks, fine-tuning methods) is vastly more mature.

Practical Implications

For most application developers, the architecture underlying the LLM is transparent — you call an API and get text back. Architecture matters when:

Self-hosting long-context models: Linear models require dramatically less memory for long sequences
Edge deployment: Mamba's constant-memory inference fits devices with limited RAM
Streaming applications: RNN-style inference (one token at a time, constant compute) suits real-time applications
Cost optimization: Linear scaling means 10x longer contexts cost 10x more, not 100x more

The future likely involves hybrid architectures that combine attention where it matters most with linear layers for efficiency. Pure transformer dominance is ending, but transformers are not going away.

Sources: Mamba Paper - arXiv:2312.00752 | RWKV Project | Jamba Architecture - AI21 Labs

Beyond Transformers: Mamba, RWKV, and State-Space Models Challenging the Dominant Architecture