7 ML Fundamentals Questions That Top AI Companies Still Ask in 2026
Real machine learning fundamentals interview questions from OpenAI, Google DeepMind, Meta, and xAI in 2026. Covers attention mechanisms, KV cache, distributed training, MoE, speculative decoding, and emerging architectures.
ML Fundamentals in 2026: Not Your Textbook Questions
A common misconception: "With LLM APIs available, companies don't ask ML fundamentals anymore." Wrong. They still do — but the questions have evolved. Nobody asks you to derive backpropagation anymore. Instead, they ask about modern transformer internals — the building blocks of every model powering today's AI products.
These 7 questions test whether you understand why modern architectures work, not just how to use them.
Standard Self-Attention
# Scaled Dot-Product Attention
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V
# Where:
# Q = query matrix (n x d_k)
# K = key matrix (n x d_k)
# V = value matrix (n x d_v)
# n = sequence length
# d_k = key dimension
Complexity: O(n^2 * d) — quadratic in sequence length. For a 128K token context, the attention matrix is 128K x 128K = 16 billion elements. This is the bottleneck.
Multi-Head Attention
Split Q, K, V into h heads, each with dimension d_k/h. Each head attends independently, then concatenate:
MultiHead(Q, K, V) = Concat(head_1, ..., head_h) * W_O
where head_i = Attention(Q*W_Qi, K*W_Ki, V*W_Vi)
Why multiple heads? Different heads learn different attention patterns — some attend to local context, some to long-range dependencies, some to syntactic structure.
Modern Approaches to Reduce Complexity
| Method | Complexity | How It Works |
|---|---|---|
| Flash Attention | O(n^2) but 2-4x faster | Fuses attention computation into a single GPU kernel, avoids materializing the n x n attention matrix in HBM. Memory: O(n) instead of O(n^2). |
| Grouped-Query Attention (GQA) | O(n^2) but less memory | Share K,V heads across multiple Q heads. If 32 Q heads share 8 KV heads, KV cache is 4x smaller. |
| Multi-Query Attention (MQA) | O(n^2) but minimal KV cache | All Q heads share a single K,V head. Maximum memory savings, slight quality tradeoff. |
| Sliding Window Attention | O(n * w) where w = window | Each token attends only to w nearby tokens. Used in Mistral. Stacked layers give effective receptive field of L*w. |
| Linear Attention | O(n * d) | Replace softmax with kernel approximation: Attention = phi(Q) * (phi(K)^T * V). Avoids materializing n x n matrix entirely. |
The Nuance That Gets You Hired
"Flash Attention doesn't reduce the theoretical O(n^2) complexity — it reduces the IO complexity. Standard attention reads/writes the n x n matrix to GPU HBM multiple times. Flash Attention tiles the computation so it stays in fast SRAM, reducing HBM reads by 5-20x. This is why it gives 2-4x wall-clock speedup despite the same FLOP count. The lesson: in modern deep learning, memory bandwidth is often the bottleneck, not compute."
The KV Cache Problem
During autoregressive generation, each new token needs to attend to ALL previous tokens. Without caching:
- Token 1: Compute K,V for token 1
- Token 2: Recompute K,V for tokens 1,2
- Token 3: Recompute K,V for tokens 1,2,3
- ...
- Token n: Recompute K,V for all n tokens → O(n^2) total
With KV cache: Store computed K,V for previous tokens. Each new token only computes its own K,V and attends to the cached values → O(n) per token.
Memory Cost
KV cache size per token = 2 * n_layers * n_kv_heads * d_head * bytes_per_param
Example (LLaMA 70B, FP16):
= 2 * 80 layers * 8 KV heads * 128 dim * 2 bytes
= 327,680 bytes per token
= ~320 KB per token
For 128K context: 320 KB * 128K = 40 GB just for KV cache!
How GQA Helps
Standard Multi-Head Attention: 64 query heads, 64 key heads, 64 value heads Grouped-Query Attention: 64 query heads, 8 key heads, 8 value heads (groups of 8 queries share 1 KV pair)
KV cache reduction: 64/8 = 8x smaller. For our 70B example: 40 GB → 5 GB.
MHA: Q Q Q Q Q Q Q Q | K K K K K K K K | V V V V V V V V
↕ ↕ ↕ ↕ ↕ ↕ ↕ ↕ ↕ ↕ ↕ ↕ ↕ ↕ ↕ ↕ ↕ ↕ ↕ ↕ ↕ ↕ ↕ ↕
GQA: Q Q Q Q Q Q Q Q | K K | V V
↕ ↕ ↕ ↕ ↕ ↕ ↕ ↕ ↕ ↕ ↕ ↕
(groups of 4 share one KV pair)
The Nuance That Gets You Hired
"KV cache is the reason batch size during inference is usually memory-bound, not compute-bound. Each request in a batch needs its own KV cache, so serving 100 concurrent users means 100x the KV cache memory. This is why GQA was essential for scaling — it directly increases the number of concurrent users a single GPU can serve. PagedAttention (vLLM) takes this further by managing KV cache as virtual memory pages, allowing non-contiguous allocation and reducing memory waste from variable-length sequences by up to 55%."
The Scale of the Problem
GPT-4 class models have ~1.8 trillion parameters. At FP16, that's 3.6 TB of weights alone. A top-end H100 has 80 GB memory. You need at minimum 45 GPUs just to hold the model — and training requires 2-3x more memory for optimizer states and gradients.
Parallelism Strategies
1. Data Parallelism (DP)
- Replicate the model on N GPUs
- Each GPU processes a different data batch
- All-reduce gradients across GPUs after each step
- Limitation: Model must fit on one GPU (doesn't solve our problem)
2. Fully Sharded Data Parallelism (FSDP / ZeRO)
- Shard optimizer states (ZeRO Stage 1), gradients (Stage 2), AND parameters (Stage 3) across GPUs
- Each GPU holds only 1/N of everything
- All-gather parameters before forward/backward, reduce-scatter gradients after
- Memory per GPU: O(model_size / N) instead of O(model_size)
3. Tensor Parallelism (TP)
- Split individual layers across GPUs
- Example: A 16384-dim linear layer on 8 GPUs → each GPU computes 2048-dim slice
- Requires fast interconnect (NVLink) — every layer needs communication
4. Pipeline Parallelism (PP)
- Split model layers into stages: GPU 1 has layers 1-20, GPU 2 has layers 21-40, etc.
- Micro-batching: Split batch into micro-batches, pipeline them through stages
- Bubble overhead: Some GPUs idle while waiting for micro-batches → ~20-30% efficiency loss
5. In Practice: 3D Parallelism
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
3D Parallelism = TP (within node) + PP (across nodes) + FSDP (across replicas)
Example: Training 1T model on 1024 GPUs
- 8-way TP within each 8-GPU node (NVLink, fast)
- 16-way PP across 16 nodes (InfiniBand)
- 8 FSDP replicas for data parallelism
The Nuance That Gets You Hired
"The key insight is matching parallelism strategy to hardware topology. Tensor parallelism needs the highest bandwidth (NVLink at 900 GB/s within a node). Pipeline parallelism can tolerate lower bandwidth (InfiniBand at 400 Gb/s across nodes). FSDP communication is mostly gradients, which can overlap with computation. A common mistake is applying tensor parallelism across nodes — the latency kills throughput. Always TP within a node, PP across nodes."
Also mention: "For fine-tuning (not pre-training), FSDP alone is usually sufficient. Combined with QLoRA, you can fine-tune a 70B model on 4 GPUs. Pre-training at frontier scale is where you need the full 3D parallelism stack."
The Core Difference
Batch Normalization (BN):
- Normalizes across the batch dimension for each feature
- For a feature at position (i,j): compute mean and variance across all samples in the batch
- Requires a batch of samples → depends on batch size
Layer Normalization (LN):
- Normalizes across the feature dimension for each sample
- For a sample: compute mean and variance across all features in that sample
- Independent of batch size → works with batch size 1
Why Transformers Use LayerNorm
- Variable sequence lengths: Batch norm would compute statistics across padded sequences, polluting the normalization with padding tokens
- Autoregressive generation: At inference, batch size is effectively 1 (generating one token at a time). BN's running statistics from training wouldn't match.
- Sequence position independence: LN normalizes each position independently — the normalization of token at position 5 doesn't depend on what's at position 100
Modern Variant: RMSNorm
Most current models (LLaMA, Mistral, Gemma) use RMSNorm instead of LayerNorm:
# LayerNorm: subtract mean, divide by std
LayerNorm(x) = (x - mean(x)) / std(x) * gamma + beta
# RMSNorm: skip mean subtraction, divide by RMS only
RMSNorm(x) = x / RMS(x) * gamma
where RMS(x) = sqrt(mean(x^2))
RMSNorm is ~10-15% faster (no mean computation) with negligible quality difference.
The Nuance That Gets You Hired
"The placement of LayerNorm also matters. Original Transformer used Post-LN (normalize after attention/FFN). Modern models use Pre-LN (normalize before attention/FFN). Pre-LN enables better gradient flow and more stable training at scale, which is why it's universal in models trained after 2020. The tradeoff: Pre-LN can slightly underperform Post-LN at convergence, but it trains much more stably without careful learning rate warmup."
Core Concept
MoE replaces the dense FFN (feed-forward network) in each transformer layer with multiple expert FFNs and a router that selects which experts process each token.
Input Token → Router → Top-K Experts (e.g., 2 of 16) → Weighted Sum → Output
Standard FFN: All parameters activated for every token
MoE FFN: Only K/N parameters activated per token (e.g., 2/16 = 12.5%)
Why MoE Dominates in 2026
The scaling insight: You can have a 1T total parameter model that only uses 100B parameters per token. This gives you the knowledge capacity of a massive model with the inference cost of a smaller one.
| Model | Total Params | Active Params/Token | Experts |
|---|---|---|---|
| Mixtral 8x7B | 46.7B | 12.9B | 8 experts, top-2 |
| LLaMA 4 Maverick | 400B | ~100B | 128 experts |
| GPT-4 (rumored) | ~1.8T | ~280B | 16 experts, top-2 |
Key Design Decisions
- Number of experts: 8-128. More experts = more capacity, but harder to train (load balancing)
- Top-K routing: Usually K=2. Top-1 is faster but less stable. Top-2 gives good quality with reasonable cost.
- Load balancing loss: Without it, the router sends all tokens to 1-2 "popular" experts. Add auxiliary loss to encourage uniform expert utilization.
- Expert capacity factor: Max tokens per expert per batch. Overflow tokens are dropped (lossy) or sent to a shared expert.
The Nuance That Gets You Hired
"The main challenge with MoE is training instability and expert collapse — where most experts become unused. The solutions are: (1) auxiliary load balancing loss (penalize when expert utilization is uneven), (2) expert parallelism (place different experts on different GPUs, so each GPU handles fewer experts with more tokens), and (3) shared experts (1-2 experts that process every token, ensuring a baseline quality even if routing is suboptimal). DeepSeek-V3 pioneered the 'shared + routed' pattern that's now standard."
Also: "MoE models are harder to serve because the total model size determines memory requirements, not the active parameters. A 400B MoE model needs 400B params loaded into GPU memory even though it only uses 100B per token. This is why MoE inference benefits heavily from tensor parallelism across many GPUs."
The Bottleneck It Solves
Autoregressive LLM generation is memory-bandwidth bound, not compute-bound. Generating one token requires loading the entire model from memory, but only does a tiny amount of computation. The GPU is mostly waiting for data to arrive from memory.
How Speculative Decoding Works
Step 1: Draft model (small, fast) generates K candidate tokens
"The capital of France is Paris, a beautiful"
Step 2: Target model (large, accurate) verifies ALL K tokens in one forward pass
Accepts: "The capital of France is Paris" (5 tokens)
Rejects: "a beautiful" (diverges at token 6)
Step 3: Accept verified tokens, resample from target distribution at rejection point
Output: "The capital of France is Paris, which is"
(5 accepted + 1 resampled = 6 tokens from one target pass)
Why This Is Faster
- Without speculation: 6 tokens = 6 forward passes through the large model
- With speculation: 6 tokens = 1 draft pass + 1 verification pass
- Speedup depends on acceptance rate: If the draft model agrees with the target 80% of the time, you get ~3-4x speedup
- Quality guarantee: The output distribution is mathematically identical to the target model (no quality loss!)
Key Design Decisions
| Factor | Choice | Impact |
|---|---|---|
| Draft model size | 1-7B (vs. 70B+ target) | Smaller = faster drafting, but lower acceptance rate |
| Speculation length K | 3-8 tokens | Higher K = more speedup if accepted, more waste if rejected |
| Draft model type | Same family (distilled) vs. N-gram | Same family has higher acceptance rate |
The Nuance That Gets You Hired
"There are two emerging variants worth mentioning: (1) Self-speculative decoding — use the model's own early-exit layers as the draft model, avoiding the need for a separate small model. (2) Medusa — add multiple parallel prediction heads to the model, each predicting 1, 2, 3... tokens ahead. These can be verified in a single tree-attention pass. Medusa is gaining traction because it doesn't require a separate draft model and is easier to deploy."
Also: "The acceptance rate varies dramatically by task. For code generation (highly predictable syntax), acceptance rates can be 90%+. For creative writing (high entropy), acceptance rates drop to 40-50%. Smart implementations adaptively adjust the speculation length K based on recent acceptance rates."
Why This Question Is Asked
Transformers have dominated since 2017, but their quadratic attention cost is a fundamental limitation. Interviewers (especially at research-focused companies) want to know if you're thinking about what comes next.
State Space Models (SSMs) / Mamba
Core idea: Replace attention with a linear recurrence that processes sequences in O(n) time and O(1) memory per step.
Transformers: Every token attends to every other token → O(n^2)
SSMs/Mamba: Each token updates a fixed-size hidden state → O(n)
Mamba's key innovation — Selective State Spaces:
- Traditional SSMs have fixed state transition matrices (can't selectively remember/forget)
- Mamba makes the state transition matrices input-dependent — the model can learn to selectively attend to important tokens and ignore irrelevant ones
- This gives attention-like selectivity with linear complexity
SSM vs. Transformer Comparison
| Aspect | Transformer | Mamba/SSM |
|---|---|---|
| Training complexity | O(n^2) | O(n) |
| Inference (per token) | O(n) — attends to all history | O(1) — fixed state update |
| Inference memory | O(n) — KV cache grows | O(1) — fixed state size |
| Long-range reasoning | Excellent (direct attention) | Good but weaker (compressed state) |
| Throughput on long seqs | Drops significantly | Stays constant |
The Hybrid Trend
The 2025-2026 frontier is hybrid architectures that combine attention and SSM layers:
- Jamba (AI21): Alternating transformer and Mamba layers
- Griffin (Google): Recurrent layer (SSM) + local attention
- Mamba-2: Improved SSM that can be computed as structured matrix multiplication (hardware-friendly)
The Nuance That Gets You Hired
"The honest assessment: pure SSMs still underperform transformers on tasks requiring precise in-context retrieval — 'find the needle in the haystack.' Attention can directly look up any token in history; SSMs must compress everything into a fixed-size state, so information gets lossy. This is why hybrids are winning — use attention layers for the information retrieval heavy-lifting, and SSM layers for efficient sequence processing in between. My prediction: the 2027-era frontier models will be hybrids, not pure transformers or pure SSMs."
Research-specific follow-up: "RWKV (an RNN-transformer hybrid) is another contender. It reformulates attention as a linear recurrence, giving O(n) training and O(1) inference while maintaining attention-like expressiveness. The competition between Mamba, RWKV, and hybrid approaches is the most active area of architecture research right now."
Quick Reference Card
| Concept | One-Line Summary |
|---|---|
| Self-Attention | Every token attends to every other: O(n^2) but extremely expressive |
| Flash Attention | Same math, 2-4x faster by staying in SRAM, O(n) memory |
| GQA | Share KV heads across query groups, 4-8x KV cache reduction |
| KV Cache | Store computed K,V to avoid recomputation, main inference memory bottleneck |
| FSDP | Shard all params/grads/optimizer across GPUs for distributed training |
| 3D Parallelism | TP within node + PP across nodes + FSDP for replicas |
| RMSNorm | Simplified LayerNorm (no mean subtraction), 10-15% faster |
| MoE | Multiple expert FFNs + router, 10x capacity at 1x compute |
| Speculative Decoding | Small model drafts, large model verifies in one pass, 2-4x speedup |
| Mamba/SSMs | Linear-time sequence modeling, O(1) inference memory, weaker on retrieval |
Frequently Asked Questions
Do I need to implement transformers from scratch for interviews?
At research-focused companies (OpenAI, Google DeepMind, Anthropic), yes — you should be able to implement multi-head attention in PyTorch from basic tensor operations. At application-focused companies, understanding the concepts and trade-offs is sufficient.
How deep should I go on the math?
Know the key equations (attention formula, softmax, normalization). Be able to reason about complexity (O(n^2) for attention, O(n) for SSMs). You don't need to derive backprop or prove convergence.
Are SSMs going to replace transformers?
Not in the near term. Hybrids are more likely. Transformers are too good at in-context learning and retrieval. But SSMs will likely handle the bulk of sequence processing in hybrid architectures, with attention reserved for information-critical layers.
Written by
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.