7 ML Fundamentals Questions That Top AI Companies Still Ask in 2026

ML Fundamentals in 2026: Not Your Textbook Questions

A common misconception: "With LLM APIs available, companies don't ask ML fundamentals anymore." Wrong. They still do — but the questions have evolved. Nobody asks you to derive backpropagation anymore. Instead, they ask about modern transformer internals — the building blocks of every model powering today's AI products.

These 7 questions test whether you understand why modern architectures work, not just how to use them.

HARD OpenAI Google DeepMind xAI

Q1: Explain the Attention Mechanism in Detail. What Is Its Computational Complexity, and How Do Modern Approaches Reduce It?

Standard Self-Attention

# Scaled Dot-Product Attention
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V

# Where:
# Q = query matrix (n x d_k)
# K = key matrix (n x d_k)
# V = value matrix (n x d_v)
# n = sequence length
# d_k = key dimension

Complexity: O(n^2 * d) — quadratic in sequence length. For a 128K token context, the attention matrix is 128K x 128K = 16 billion elements. This is the bottleneck.

Multi-Head Attention

Split Q, K, V into h heads, each with dimension d_k/h. Each head attends independently, then concatenate:

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) * W_O
where head_i = Attention(Q*W_Qi, K*W_Ki, V*W_Vi)

Why multiple heads? Different heads learn different attention patterns — some attend to local context, some to long-range dependencies, some to syntactic structure.

Modern Approaches to Reduce Complexity

Method	Complexity	How It Works
Flash Attention	O(n^2) but 2-4x faster	Fuses attention computation into a single GPU kernel, avoids materializing the n x n attention matrix in HBM. Memory: O(n) instead of O(n^2).
Grouped-Query Attention (GQA)	O(n^2) but less memory	Share K,V heads across multiple Q heads. If 32 Q heads share 8 KV heads, KV cache is 4x smaller.
Multi-Query Attention (MQA)	O(n^2) but minimal KV cache	All Q heads share a single K,V head. Maximum memory savings, slight quality tradeoff.
Sliding Window Attention	O(n * w) where w = window	Each token attends only to w nearby tokens. Used in Mistral. Stacked layers give effective receptive field of L*w.
Linear Attention	O(n * d)	Replace softmax with kernel approximation: Attention = phi(Q) * (phi(K)^T * V). Avoids materializing n x n matrix entirely.

The Nuance That Gets You Hired

"Flash Attention doesn't reduce the theoretical O(n^2) complexity — it reduces the IO complexity. Standard attention reads/writes the n x n matrix to GPU HBM multiple times. Flash Attention tiles the computation so it stays in fast SRAM, reducing HBM reads by 5-20x. This is why it gives 2-4x wall-clock speedup despite the same FLOP count. The lesson: in modern deep learning, memory bandwidth is often the bottleneck, not compute."

MEDIUM OpenAI Anthropic xAI

Q2: What Is the KV Cache in Transformer Inference? How Does GQA Optimize It?

The KV Cache Problem

During autoregressive generation, each new token needs to attend to ALL previous tokens. Without caching:

Token 1: Compute K,V for token 1
Token 2: Recompute K,V for tokens 1,2
Token 3: Recompute K,V for tokens 1,2,3
...
Token n: Recompute K,V for all n tokens → O(n^2) total

With KV cache: Store computed K,V for previous tokens. Each new token only computes its own K,V and attends to the cached values → O(n) per token.

Memory Cost

KV cache size per token = 2 * n_layers * n_kv_heads * d_head * bytes_per_param

Example (LLaMA 70B, FP16):
= 2 * 80 layers * 8 KV heads * 128 dim * 2 bytes
= 327,680 bytes per token
= ~320 KB per token

For 128K context: 320 KB * 128K = 40 GB just for KV cache!

How GQA Helps

Standard Multi-Head Attention: 64 query heads, 64 key heads, 64 value heads Grouped-Query Attention: 64 query heads, 8 key heads, 8 value heads (groups of 8 queries share 1 KV pair)

KV cache reduction: 64/8 = 8x smaller. For our 70B example: 40 GB → 5 GB.

MHA:  Q Q Q Q Q Q Q Q  |  K K K K K K K K  |  V V V V V V V V
       ↕ ↕ ↕ ↕ ↕ ↕ ↕ ↕     ↕ ↕ ↕ ↕ ↕ ↕ ↕ ↕     ↕ ↕ ↕ ↕ ↕ ↕ ↕ ↕

GQA:  Q Q Q Q Q Q Q Q  |  K K              |  V V
       ↕ ↕ ↕ ↕ ↕ ↕ ↕ ↕     ↕ ↕                  ↕ ↕
      (groups of 4 share one KV pair)

The Nuance That Gets You Hired

"KV cache is the reason batch size during inference is usually memory-bound, not compute-bound. Each request in a batch needs its own KV cache, so serving 100 concurrent users means 100x the KV cache memory. This is why GQA was essential for scaling — it directly increases the number of concurrent users a single GPU can serve. PagedAttention (vLLM) takes this further by managing KV cache as virtual memory pages, allowing non-contiguous allocation and reducing memory waste from variable-length sequences by up to 55%."

HARD OpenAI Meta Google

Q3: How Do You Train a Model That Doesn't Fit on a Single GPU?

The Scale of the Problem

GPT-4 class models have ~1.8 trillion parameters. At FP16, that's 3.6 TB of weights alone. A top-end H100 has 80 GB memory. You need at minimum 45 GPUs just to hold the model — and training requires 2-3x more memory for optimizer states and gradients.

Parallelism Strategies

1. Data Parallelism (DP)

Replicate the model on N GPUs
Each GPU processes a different data batch
All-reduce gradients across GPUs after each step
Limitation: Model must fit on one GPU (doesn't solve our problem)

2. Fully Sharded Data Parallelism (FSDP / ZeRO)

Shard optimizer states (ZeRO Stage 1), gradients (Stage 2), AND parameters (Stage 3) across GPUs
Each GPU holds only 1/N of everything
All-gather parameters before forward/backward, reduce-scatter gradients after
Memory per GPU: O(model_size / N) instead of O(model_size)

3. Tensor Parallelism (TP)

Split individual layers across GPUs
Example: A 16384-dim linear layer on 8 GPUs → each GPU computes 2048-dim slice
Requires fast interconnect (NVLink) — every layer needs communication

4. Pipeline Parallelism (PP)

Split model layers into stages: GPU 1 has layers 1-20, GPU 2 has layers 21-40, etc.
Micro-batching: Split batch into micro-batches, pipeline them through stages
Bubble overhead: Some GPUs idle while waiting for micro-batches → ~20-30% efficiency loss

5. In Practice: 3D Parallelism

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

3D Parallelism = TP (within node) + PP (across nodes) + FSDP (across replicas)

Example: Training 1T model on 1024 GPUs
- 8-way TP within each 8-GPU node (NVLink, fast)
- 16-way PP across 16 nodes (InfiniBand)
- 8 FSDP replicas for data parallelism

The Nuance That Gets You Hired

"The key insight is matching parallelism strategy to hardware topology. Tensor parallelism needs the highest bandwidth (NVLink at 900 GB/s within a node). Pipeline parallelism can tolerate lower bandwidth (InfiniBand at 400 Gb/s across nodes). FSDP communication is mostly gradients, which can overlap with computation. A common mistake is applying tensor parallelism across nodes — the latency kills throughput. Always TP within a node, PP across nodes."

Also mention: "For fine-tuning (not pre-training), FSDP alone is usually sufficient. Combined with QLoRA, you can fine-tune a 70B model on 4 GPUs. Pre-training at frontier scale is where you need the full 3D parallelism stack."

STANDARD OpenAI

Q4: Explain Batch Normalization vs. Layer Normalization. Why Do Transformers Use LayerNorm?

The Core Difference

Batch Normalization (BN):

Normalizes across the batch dimension for each feature
For a feature at position (i,j): compute mean and variance across all samples in the batch
Requires a batch of samples → depends on batch size

Layer Normalization (LN):

Normalizes across the feature dimension for each sample
For a sample: compute mean and variance across all features in that sample
Independent of batch size → works with batch size 1

Why Transformers Use LayerNorm

Variable sequence lengths: Batch norm would compute statistics across padded sequences, polluting the normalization with padding tokens
Autoregressive generation: At inference, batch size is effectively 1 (generating one token at a time). BN's running statistics from training wouldn't match.
Sequence position independence: LN normalizes each position independently — the normalization of token at position 5 doesn't depend on what's at position 100

Modern Variant: RMSNorm

Most current models (LLaMA, Mistral, Gemma) use RMSNorm instead of LayerNorm:

# LayerNorm: subtract mean, divide by std
LayerNorm(x) = (x - mean(x)) / std(x) * gamma + beta

# RMSNorm: skip mean subtraction, divide by RMS only
RMSNorm(x) = x / RMS(x) * gamma
where RMS(x) = sqrt(mean(x^2))

RMSNorm is ~10-15% faster (no mean computation) with negligible quality difference.

The Nuance That Gets You Hired

"The placement of LayerNorm also matters. Original Transformer used Post-LN (normalize after attention/FFN). Modern models use Pre-LN (normalize before attention/FFN). Pre-LN enables better gradient flow and more stable training at scale, which is why it's universal in models trained after 2020. The tradeoff: Pre-LN can slightly underperform Post-LN at convergence, but it trains much more stably without careful learning rate warmup."

MEDIUM Widely Asked

Q5: What Is Mixture of Experts (MoE)? Why Is It the Dominant Scaling Architecture?

Core Concept

MoE replaces the dense FFN (feed-forward network) in each transformer layer with multiple expert FFNs and a router that selects which experts process each token.

Input Token → Router → Top-K Experts (e.g., 2 of 16) → Weighted Sum → Output

Standard FFN:  All parameters activated for every token
MoE FFN:       Only K/N parameters activated per token (e.g., 2/16 = 12.5%)

Why MoE Dominates in 2026

The scaling insight: You can have a 1T total parameter model that only uses 100B parameters per token. This gives you the knowledge capacity of a massive model with the inference cost of a smaller one.

Model	Total Params	Active Params/Token	Experts
Mixtral 8x7B	46.7B	12.9B	8 experts, top-2
LLaMA 4 Maverick	400B	~100B	128 experts
GPT-4 (rumored)	~1.8T	~280B	16 experts, top-2

Key Design Decisions

Number of experts: 8-128. More experts = more capacity, but harder to train (load balancing)
Top-K routing: Usually K=2. Top-1 is faster but less stable. Top-2 gives good quality with reasonable cost.
Load balancing loss: Without it, the router sends all tokens to 1-2 "popular" experts. Add auxiliary loss to encourage uniform expert utilization.
Expert capacity factor: Max tokens per expert per batch. Overflow tokens are dropped (lossy) or sent to a shared expert.

The Nuance That Gets You Hired

"The main challenge with MoE is training instability and expert collapse — where most experts become unused. The solutions are: (1) auxiliary load balancing loss (penalize when expert utilization is uneven), (2) expert parallelism (place different experts on different GPUs, so each GPU handles fewer experts with more tokens), and (3) shared experts (1-2 experts that process every token, ensuring a baseline quality even if routing is suboptimal). DeepSeek-V3 pioneered the 'shared + routed' pattern that's now standard."

Also: "MoE models are harder to serve because the total model size determines memory requirements, not the active parameters. A 400B MoE model needs 400B params loaded into GPU memory even though it only uses 100B per token. This is why MoE inference benefits heavily from tensor parallelism across many GPUs."

MEDIUM OpenAI Anthropic Google

Q6: Explain Speculative Decoding. How Does It Speed Up LLM Inference?

The Bottleneck It Solves

Autoregressive LLM generation is memory-bandwidth bound, not compute-bound. Generating one token requires loading the entire model from memory, but only does a tiny amount of computation. The GPU is mostly waiting for data to arrive from memory.

How Speculative Decoding Works

Step 1: Draft model (small, fast) generates K candidate tokens
        "The capital of France is Paris, a beautiful"

Step 2: Target model (large, accurate) verifies ALL K tokens in one forward pass
        Accepts: "The capital of France is Paris" (5 tokens)
        Rejects: "a beautiful" (diverges at token 6)

Step 3: Accept verified tokens, resample from target distribution at rejection point
        Output: "The capital of France is Paris, which is"
        (5 accepted + 1 resampled = 6 tokens from one target pass)

Why This Is Faster

Without speculation: 6 tokens = 6 forward passes through the large model
With speculation: 6 tokens = 1 draft pass + 1 verification pass
Speedup depends on acceptance rate: If the draft model agrees with the target 80% of the time, you get ~3-4x speedup
Quality guarantee: The output distribution is mathematically identical to the target model (no quality loss!)

Key Design Decisions

Factor	Choice	Impact
Draft model size	1-7B (vs. 70B+ target)	Smaller = faster drafting, but lower acceptance rate
Speculation length K	3-8 tokens	Higher K = more speedup if accepted, more waste if rejected
Draft model type	Same family (distilled) vs. N-gram	Same family has higher acceptance rate

The Nuance That Gets You Hired

"There are two emerging variants worth mentioning: (1) Self-speculative decoding — use the model's own early-exit layers as the draft model, avoiding the need for a separate small model. (2) Medusa — add multiple parallel prediction heads to the model, each predicting 1, 2, 3... tokens ahead. These can be verified in a single tree-attention pass. Medusa is gaining traction because it doesn't require a separate draft model and is easier to deploy."

Also: "The acceptance rate varies dramatically by task. For code generation (highly predictable syntax), acceptance rates can be 90%+. For creative writing (high entropy), acceptance rates drop to 40-50%. Smart implementations adaptively adjust the speculation length K based on recent acceptance rates."

HARD Google DeepMind Anthropic

Q7: What Post-Transformer Architectures Are Emerging? Explain Mamba / State Space Models.

Why This Question Is Asked

Transformers have dominated since 2017, but their quadratic attention cost is a fundamental limitation. Interviewers (especially at research-focused companies) want to know if you're thinking about what comes next.

State Space Models (SSMs) / Mamba

Core idea: Replace attention with a linear recurrence that processes sequences in O(n) time and O(1) memory per step.

Transformers:  Every token attends to every other token → O(n^2)
SSMs/Mamba:    Each token updates a fixed-size hidden state → O(n)

Mamba's key innovation — Selective State Spaces:

Traditional SSMs have fixed state transition matrices (can't selectively remember/forget)
Mamba makes the state transition matrices input-dependent — the model can learn to selectively attend to important tokens and ignore irrelevant ones
This gives attention-like selectivity with linear complexity

SSM vs. Transformer Comparison

Aspect	Transformer	Mamba/SSM
Training complexity	O(n^2)	O(n)
Inference (per token)	O(n) — attends to all history	O(1) — fixed state update
Inference memory	O(n) — KV cache grows	O(1) — fixed state size
Long-range reasoning	Excellent (direct attention)	Good but weaker (compressed state)
Throughput on long seqs	Drops significantly	Stays constant

The Hybrid Trend

The 2025-2026 frontier is hybrid architectures that combine attention and SSM layers:

Jamba (AI21): Alternating transformer and Mamba layers
Griffin (Google): Recurrent layer (SSM) + local attention
Mamba-2: Improved SSM that can be computed as structured matrix multiplication (hardware-friendly)

The Nuance That Gets You Hired

"The honest assessment: pure SSMs still underperform transformers on tasks requiring precise in-context retrieval — 'find the needle in the haystack.' Attention can directly look up any token in history; SSMs must compress everything into a fixed-size state, so information gets lossy. This is why hybrids are winning — use attention layers for the information retrieval heavy-lifting, and SSM layers for efficient sequence processing in between. My prediction: the 2027-era frontier models will be hybrids, not pure transformers or pure SSMs."

Research-specific follow-up: "RWKV (an RNN-transformer hybrid) is another contender. It reformulates attention as a linear recurrence, giving O(n) training and O(1) inference while maintaining attention-like expressiveness. The competition between Mamba, RWKV, and hybrid approaches is the most active area of architecture research right now."

Quick Reference Card

Concept	One-Line Summary
Self-Attention	Every token attends to every other: O(n^2) but extremely expressive
Flash Attention	Same math, 2-4x faster by staying in SRAM, O(n) memory
GQA	Share KV heads across query groups, 4-8x KV cache reduction
KV Cache	Store computed K,V to avoid recomputation, main inference memory bottleneck
FSDP	Shard all params/grads/optimizer across GPUs for distributed training
3D Parallelism	TP within node + PP across nodes + FSDP for replicas
RMSNorm	Simplified LayerNorm (no mean subtraction), 10-15% faster
MoE	Multiple expert FFNs + router, 10x capacity at 1x compute
Speculative Decoding	Small model drafts, large model verifies in one pass, 2-4x speedup
Mamba/SSMs	Linear-time sequence modeling, O(1) inference memory, weaker on retrieval

Frequently Asked Questions

Do I need to implement transformers from scratch for interviews?

At research-focused companies (OpenAI, Google DeepMind, Anthropic), yes — you should be able to implement multi-head attention in PyTorch from basic tensor operations. At application-focused companies, understanding the concepts and trade-offs is sufficient.

How deep should I go on the math?

Know the key equations (attention formula, softmax, normalization). Be able to reason about complexity (O(n^2) for attention, O(n) for SSMs). You don't need to derive backprop or prove convergence.

Are SSMs going to replace transformers?

Not in the near term. Hybrids are more likely. Transformers are too good at in-context learning and retrieval. But SSMs will likely handle the bulk of sequence processing in hybrid architectures, with attention reserved for information-critical layers.

7 ML Fundamentals Questions That Top AI Companies Still Ask in 2026

ML Fundamentals in 2026: Not Your Textbook Questions

Standard Self-Attention

Multi-Head Attention

Modern Approaches to Reduce Complexity

The KV Cache Problem

Memory Cost

How GQA Helps

The Scale of the Problem

Parallelism Strategies

The Core Difference

Why Transformers Use LayerNorm

Modern Variant: RMSNorm

Core Concept

Why MoE Dominates in 2026

Key Design Decisions

The Bottleneck It Solves

How Speculative Decoding Works

Why This Is Faster

Key Design Decisions

Why This Question Is Asked

State Space Models (SSMs) / Mamba

SSM vs. Transformer Comparison

The Hybrid Trend

Quick Reference Card

Frequently Asked Questions

Do I need to implement transformers from scratch for interviews?

How deep should I go on the math?

Are SSMs going to replace transformers?

Try CallSphere AI Voice Agents

Related Articles

8 AI System Design Interview Questions Actually Asked at FAANG in 2026

8 LLM & RAG Interview Questions That OpenAI, Anthropic & Google Actually Ask

7 AI Coding Interview Questions From Anthropic, Meta & OpenAI (2026 Edition)