The Original Promise of Scaling Laws

In 2020, Kaplan et al. at OpenAI published "Scaling Laws for Neural Language Models," demonstrating a remarkably predictable relationship: model performance improves as a power law of model size, dataset size, and compute budget. Double the compute, get a predictable improvement in loss.

This paper launched the scaling era. Labs raced to train ever-larger models, confident that more compute would translate directly to more capability. GPT-3 (175B parameters), PaLM (540B), and eventually GPT-4 (rumored to be a mixture of experts with trillions of parameters) were all justified by scaling law projections.

The Chinchilla Correction

In 2022, DeepMind's Chinchilla paper challenged the Kaplan scaling ratios. It showed that most large models were undertrained — they had too many parameters relative to their training data. Chinchilla demonstrated that a 70B parameter model trained on 1.4T tokens outperformed a 280B model trained on 300B tokens, despite using the same total compute.

The Chinchilla-optimal ratio — roughly 20 tokens per parameter — became the new standard. Llama 2 (70B trained on 2T tokens) and Mistral's models followed this guidance closely.

Where the Debate Stands in 2026

The "Scaling Is Hitting Walls" Camp

Several signals suggest diminishing returns from pure scale:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

GPT-4 to GPT-4o improvements were modest compared to the GPT-3 to GPT-4 leap
Data exhaustion: The supply of high-quality text data on the internet is finite. Estimates suggest we may exhaust unique high-quality web text by 2028 at current training rates
Benchmark saturation: Models are approaching human-level performance on many benchmarks, making further improvements harder to measure
Cost prohibitions: Training runs costing $100M+ are economically unsustainable for all but the largest companies

The "Scaling Still Works" Camp

Other researchers argue that scaling is far from exhausted:

New data modalities: Video, audio, code execution traces, and tool-use trajectories provide vast new training data sources
Synthetic data: LLM-generated training data (when properly filtered and decontaminated) extends the effective data supply
Architecture improvements: Mixture of Experts (MoE) enables larger total parameters while keeping inference cost constant
Multi-epoch training: Recent research shows that training on the same data for multiple epochs, with proper data ordering and curriculum learning, continues to improve models

The Inference-Time Compute Paradigm

The most significant shift in 2025-2026 is the move from training-time scaling to inference-time scaling. OpenAI's o1, o3, and DeepSeek's R1 demonstrate that giving a model more time to "think" at inference time — through chain-of-thought reasoning, search, and verification — can achieve capabilities that would require orders of magnitude more training compute.

This changes the economics fundamentally:

Training compute: Spent once, amortized over all users
Inference compute: Spent per query, scales with usage

The question becomes: is it more cost-effective to train a larger model or to give a smaller model more inference-time compute? For many tasks, the answer is increasingly the latter.

Test-Time Training

An emerging approach that blurs the line: adapting the model's weights at inference time using the specific test input. This is not full fine-tuning — it is a lightweight, temporary update that improves performance on the specific input without permanently changing the model. Early results on math and coding benchmarks are promising.

The Mixture of Experts Factor

MoE architectures have changed how we think about model size. A model with 8 experts of 70B parameters each has 560B total parameters but only activates 70B per token. This means:

Training cost scales with total parameters (you still need to train all experts)
Inference cost scales with active parameters (much cheaper per query)
Scaling laws need to be re-derived for MoE architectures, as the original Kaplan and Chinchilla results assumed dense models

What This Means for Practitioners

Do not wait for bigger models to solve your problems: If your current model cannot do it, a 2x larger model probably will not either. Invest in better prompting, fine-tuning, and agentic architectures.
Consider inference-time compute: Giving your model a reasoning step or self-verification loop may be more cost-effective than upgrading to a larger model.
Watch the small model space: Models like Phi-3, Gemma 2, and Mistral's smaller offerings are closing the gap with larger models for many practical tasks.
Data quality over data quantity: The Chinchilla lesson extends beyond pre-training. For fine-tuning, 1,000 high-quality examples often outperform 100,000 noisy ones.

Sources:

The AI Compute Scaling Laws Debate: Are Bigger Models Still Better in 2026?

The Original Promise of Scaling Laws

The Chinchilla Correction

Where the Debate Stands in 2026

The "Scaling Is Hitting Walls" Camp

The "Scaling Still Works" Camp

The Inference-Time Compute Paradigm

Test-Time Training

The Mixture of Experts Factor

What This Means for Practitioners

Try CallSphere AI Voice Agents

Related Articles

Federated Learning Meets LLMs: Privacy-Preserving AI Without Centralizing Data

LLM Compression Techniques for Cost-Effective Deployment in 2026

Gemini 3.1 Pro: Google DeepMind's Most Powerful Model Scores 77% on ARC-AGI-2