8 LLM & RAG Interview Questions That OpenAI, Anthropic & Google Actually Ask
Real LLM and RAG interview questions from top AI labs in 2026. Covers fine-tuning vs RAG decisions, production RAG pipelines, evaluation, PEFT methods, positional embeddings, and safety guardrails with expert answers.
LLM & RAG: The Technical Core of Every AI Interview in 2026
If you're interviewing for any AI engineering role in 2026, you will be asked about Large Language Models and Retrieval-Augmented Generation. These questions separate candidates who've built production systems from those who've only read tutorials.
These 8 questions come from real interview loops at OpenAI, Anthropic, Google, and top AI startups. Each includes what the interviewer is actually testing, a structured answer framework, and the nuances that top candidates mention.
What They're Really Testing
This is the most asked LLM question in 2026. They want a decision framework, not a textbook definition. The wrong answer is "it depends" without specifics.
The Decision Framework
| Factor | RAG | Fine-Tuning | Both |
|---|---|---|---|
| Knowledge source | External, frequently changing docs | Static domain knowledge | Changing docs + domain behavior |
| What you're changing | What the model knows | How the model behaves | Both |
| Data requirement | Just documents (no labels) | 100-10K labeled examples | Both |
| Latency | +50-200ms (retrieval step) | No extra latency | +50-200ms |
| Cost | Vector DB + embeddings | Training compute (one-time) | Both |
| Hallucination risk | Lower (grounded in docs) | Higher (no grounding) | Lowest |
When to Use Each
RAG first (80% of enterprise use cases):
- Customer support over company docs
- Legal/compliance Q&A over policies
- Any task where answers must cite sources
- Data changes frequently (weekly or more)
Fine-tuning when:
- You need a specific output format consistently (JSON, SQL, code)
- Domain-specific tone or style (medical, legal, financial writing)
- Task specialization (classification, extraction, structured output)
- Latency is critical and you can't afford the retrieval step
Both for premium use cases:
- Fine-tuned model that's better at reading retrieved context
- Domain-adapted embeddings + domain-adapted generator
- Example: medical Q&A with fine-tuned model + RAG over medical literature
The Nuance That Gets You Hired
Most candidates stop at the table above. Top candidates add: "In practice, I start with RAG because it requires no training data, is easier to debug (you can inspect retrieved chunks), and is easier to update (just re-index documents). I only add fine-tuning when RAG alone doesn't achieve the required output quality or format consistency. This is also the cheapest path — you avoid expensive training compute until you've proven the use case."
Also mention: "The emerging pattern is RAG with a fine-tuned embedding model — you keep the generator general-purpose but fine-tune the retriever on your domain's query-document pairs. This gives you 80% of fine-tuning's quality improvement at 20% of the cost."
What They're Really Testing
Evaluation is the hardest unsolved problem in LLM engineering. They want to see a multi-layered evaluation strategy, not just "we use BLEU score."
Answer Framework: Three Evaluation Layers
Layer 1 — Automated Metrics (Fast, Cheap, Continuous)
- Task-specific metrics: Accuracy for classification, F1 for extraction, exact match for structured output
- LLM-as-Judge: Use a stronger model to evaluate weaker model outputs. Score on dimensions: factual accuracy, relevance, completeness, harmlessness
- Reference-free metrics: Perplexity, semantic similarity between question and answer
- Hallucination detection: NLI model checks if generated claims are entailed by the source context
Layer 2 — Human Evaluation (Gold Standard, Expensive, Periodic)
- Side-by-side comparison: Show evaluators outputs from model A and B, ask which is better
- Likert scale rating: Rate on 1-5 for specific dimensions (helpfulness, accuracy, tone)
- Red-teaming: Dedicated adversarial evaluation — try to break the system
Layer 3 — Production Monitoring (Real User Signal)
- Implicit feedback: Thumbs up/down, regeneration rate, conversation length, task completion rate
- Drift detection: Monitor output distribution changes — if the model suddenly generates 30% longer responses, something changed
- Regression alerts: Compare daily metrics against rolling baselines
The Evaluation Pipeline
New Model Version
→ Offline Eval (automated benchmarks + LLM-as-Judge)
→ Human Eval (sample of 200-500 examples)
→ Shadow Mode (run alongside production, compare outputs)
→ Canary Deployment (5% traffic)
→ Full Rollout
The Nuance That Gets You Hired
"The biggest pitfall with LLM-as-Judge is position bias — the judge model tends to prefer the first response shown. Always randomize the order and run evaluation twice with swapped positions. Also, LLM judges are sycophantic — they'll rate longer, more verbose answers higher even when concise answers are better. Calibrate by including known-good and known-bad examples."
Also: "In practice, I've found that user behavior signals (regeneration rate, time spent reading) are more predictive of real quality than any automated metric. The best eval system combines all three layers."
The Core Comparison
| Aspect | Sparse (BM25) | Dense (Embeddings) |
|---|---|---|
| How it works | Term frequency + inverse doc frequency | Neural embedding similarity |
| Strengths | Exact keyword matching, rare terms, zero-shot | Semantic understanding, paraphrase handling |
| Weaknesses | No semantic understanding, vocabulary mismatch | Misses exact terms, needs training data |
| Latency | ~5ms (inverted index) | ~20-50ms (ANN search) |
| Infrastructure | Elasticsearch/Lucene | Vector DB (Pinecone, Weaviate, pgvector) |
Why Hybrid Is Almost Always Better
Query: "How do I fix error code E4521?"
BM25 Result: Finds doc with exact "E4521" mention (correct)
Dense Result: Finds docs about "error resolution" general (wrong)
Query: "My screen goes black when I plug in the charger"
BM25 Result: No relevant match (no keyword overlap) (miss)
Dense Result: Finds "display issues when connecting power" (correct)
Hybrid approach: Run both, combine with Reciprocal Rank Fusion (RRF):
score(doc) = sum(1 / (k + rank_in_list)) for each retrieval method
The Nuance That Gets You Hired
"Dense retrieval quality depends heavily on the embedding model. General-purpose models (OpenAI ada-3, Cohere embed-v4) work well for common domains, but for specialized domains (legal, medical, code), you often need to fine-tune the embedding model on domain-specific query-document pairs. The cheapest approach is hard negative mining — find documents that BM25 ranks highly but aren't relevant, and use those as negative examples during embedding training."
Core Concepts
PEFT (Parameter-Efficient Fine-Tuning) modifies only a small fraction of model parameters while keeping the base model frozen.
LoRA (Low-Rank Adaptation):
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
- Injects trainable low-rank matrices into attention layers: W' = W + BA where B is (d x r) and A is (r x d), with r << d
- Typical rank r = 8-64, modifying <1% of parameters
- At inference: Merge BA into W (zero additional latency)
QLoRA:
- LoRA + 4-bit quantized base model
- Reduces memory by ~4x, enabling fine-tuning of 70B models on a single 48GB GPU
- Uses NF4 (Normal Float 4-bit) quantization + double quantization
Decision Framework
| Scenario | Method | Why |
|---|---|---|
| Limited GPU budget | QLoRA | Fine-tune 70B on 1 GPU |
| Need to serve multiple fine-tuned variants | LoRA | Swap adapters at inference, one base model |
| Maximum quality, unlimited compute | Full fine-tune | Updates all parameters, best performance |
| Quick experiments / iteration | LoRA | 10-100x faster than full fine-tune |
| Catastrophic forgetting is a concern | LoRA | Frozen base preserves general knowledge |
The Nuance That Gets You Hired
"The key insight is that LoRA works because the weight updates during fine-tuning have low intrinsic rank — even full fine-tuning only modifies weights along a low-dimensional subspace. LoRA exploits this directly. In practice, I use rank 16-32 for most tasks and only go higher for complex multi-task fine-tuning."
Follow-up they often ask: "What about RLHF-style fine-tuning?" Answer: "DPO (Direct Preference Optimization) has largely replaced PPO-based RLHF in 2025-2026 because it's simpler (no reward model needed), more stable, and often achieves similar quality. GRPO (Group Relative Policy Optimization) is the newest variant, used in DeepSeek-R1, which doesn't even need a reference model."
Why This Is Asked
RoPE is the dominant positional encoding in modern LLMs (GPT-4, Claude, LLaMA, Gemini). Understanding it shows you know transformer internals, not just API usage.
The Core Idea
Traditional absolute positional encodings add a fixed vector to each token embedding based on its position. The problem: the model can't easily generalize to sequence lengths it hasn't seen.
RoPE encodes position by rotating query and key vectors in 2D subspaces. For position m, it applies a rotation of angle m*theta to each pair of dimensions:
RoPE(x, m) = [x1*cos(m*θ1) - x2*sin(m*θ1),
x1*sin(m*θ1) + x2*cos(m*θ1),
x3*cos(m*θ2) - x4*sin(m*θ2),
...]
Why It's Better
- Relative position: The dot product between RoPE-encoded q and k depends only on their relative distance (m-n), not absolute positions
- Extrapolation: With tricks like NTK-aware scaling or YaRN, RoPE models can handle sequences much longer than training length
- Decay property: Attention naturally decays with distance (tokens far apart attend less), which matches linguistic intuition
The Nuance That Gets You Hired
"The key breakthrough for long-context models is theta scaling. The original RoPE uses theta=10000. By increasing theta (e.g., to 500000 in LLaMA 3.1), you reduce the rotation speed per position, allowing the model to handle much longer sequences. Combined with continued pre-training on long documents, this is how models went from 4K to 128K+ context windows. YaRN further improves this by applying different scaling factors to different frequency bands — high-frequency dimensions need less scaling because they already encode fine-grained local patterns."
The Three Architectures
| Architecture | Example Models | Use Case |
|---|---|---|
| Encoder-only | BERT, RoBERTa | Classification, NER, sentence embeddings |
| Decoder-only | GPT-4, Claude, LLaMA | Text generation, chat, code, reasoning |
| Encoder-decoder | T5, BART | Translation, summarization |
Why Decoder-Only Won
- Simplicity: One architecture, one training objective (next-token prediction), scales predictably
- Emergent abilities: Scaling decoder-only models unlocked reasoning, coding, and instruction following — capabilities that didn't emerge in encoder-only models
- Unification: Decoder-only handles ALL tasks — classification (generate "yes/no"), extraction (generate the extracted text), translation (generate in target language). No need for task-specific architectures.
- Training efficiency: Causal language modeling uses every token as a training example. Masked language modeling (BERT-style) only trains on 15% of tokens.
When Encoder-Only Still Wins
- Embedding/retrieval: BERT-style models produce better sentence embeddings for search because they attend bidirectionally
- Classification at scale: When you need to classify millions of documents per second, a small BERT model (110M params) is 100x cheaper than prompting a GPT-4 class model
- Token-level tasks: NER, POS tagging where you need a label for each token
The Nuance That Gets You Hired
"The interesting nuance is that decoder-only models can be adapted for bidirectional understanding by fine-tuning them as embedding models (e.g., GritLM, SFR-Embedding). These 'decoder-as-encoder' models are increasingly competitive with BERT-style models for retrieval while also being usable for generation. We might see encoder-only models fully deprecated in 2-3 years."
The Problem
Context windows are finite (even 200K tokens fill up). A customer support conversation might go 50+ turns with tool calls, retrieved documents, and system prompts. How do you manage this?
Answer Framework
1. Context Window Budget Allocation
Total Context: 128K tokens
├── System Prompt: 2K (fixed)
├── Tool Definitions: 3K (fixed)
├── Retrieved Context: 8K (per-turn, refreshed)
├── Conversation History: 100K (managed)
└── Generation Budget: 15K (reserved for output)
2. History Management Strategies
- Sliding window: Keep last N turns. Simple, but loses early context.
- Summarization: Periodically summarize older turns into a compressed representation. Keep summary + recent turns.
- Hierarchical memory:
- Hot: Last 5 turns (verbatim)
- Warm: Turns 6-20 (summarized)
- Cold: Earlier (stored in vector DB, retrieved on demand)
3. Token Counting
- Count tokens BEFORE sending to the model (use tiktoken or model-specific tokenizer)
- Maintain a running token count; trigger compression when approaching 80% of context window
- Always reserve enough tokens for the expected output length
The Nuance That Gets You Hired
"The critical insight is that not all history is equal. In a support conversation, the customer's initial problem description and any error codes are high-value context that should never be summarized away, even if they're 30 turns old. I'd implement a pinning mechanism — certain messages are marked as high-value and always kept verbatim, while lower-value turns (confirmations, pleasantries) are summarized first."
Also: "With models supporting 1M+ tokens (Gemini, Claude), token budget management is less about fitting in the window and more about cost and latency optimization. Sending 500K tokens per request is technically possible but costs 50x more than sending 10K. Smart context management is a cost optimization tool, not just a technical constraint."
What They're Really Testing
At Anthropic, safety isn't a nice-to-have — it's the core mission. At every company, safety failures mean PR disasters and lawsuits. They want a multi-layered defense strategy, not just "we use a content filter."
The Multi-Layer Defense Stack
User Input
→ Layer 1: Input Validation (PII detection, injection detection)
→ Layer 2: Input Classification (toxicity, off-topic, jailbreak attempt)
→ Layer 3: LLM Generation (with system prompt guardrails)
→ Layer 4: Output Classification (harmful content, hallucination, PII leakage)
→ Layer 5: Business Rules (allowed topics, response format)
→ User Output
Each Layer in Detail
Layer 1 — Input Validation
- PII detection & redaction (regex + NER model for SSN, credit card, email, phone)
- Input length limits
- Character encoding sanitization
Layer 2 — Input Classification
- Toxicity classifier (fine-tuned model, not keyword matching)
- Jailbreak detection: Detect prompt injection attempts (role-play attacks, encoding tricks, multi-language evasion)
- Topic classifier: Is this within the allowed scope?
Layer 3 — System Prompt Engineering
- Constitutional principles embedded in system prompt
- Explicit refusal instructions for harmful categories
- Output format constraints ("always respond in JSON", "never include personal opinions")
Layer 4 — Output Classification
- Run the same toxicity classifier on model output
- Hallucination detection: For RAG, check if output claims are supported by retrieved context
- PII leakage check: Did the model accidentally output training data PII?
Layer 5 — Business Rules
- Response length limits
- Allowed topic whitelist
- Competitor mention filtering
- Mandatory disclaimers (medical, legal, financial advice)
The Nuance That Gets You Hired
"The hardest part isn't building the layers — it's handling the false positive problem. Overly aggressive safety filters block legitimate queries and frustrate users. I've seen systems where 15% of support queries were incorrectly flagged as 'harmful' because the classifier couldn't distinguish between a customer describing a problem ('this is killing my business') and actual harmful content. The solution is tiered responses: low-confidence flags get a gentle redirect instead of a hard block, and high-confidence flags get blocked with an explanation. Always log blocked requests for human review to tune the thresholds."
At Anthropic specifically: "I'd reference Constitutional AI — the model should be trained to follow a set of principles (be helpful, be harmless, be honest) and use self-critique during generation to check its own outputs against these principles, rather than relying solely on external classifiers."
Quick Reference: LLM Interview Cheat Sheet
| Concept | One-Sentence Summary |
|---|---|
| RAG | Retrieve relevant docs, inject into prompt, generate grounded answer |
| LoRA | Low-rank weight updates (1% of params) that merge at inference for zero overhead |
| QLoRA | LoRA + 4-bit quantized base = fine-tune 70B on one GPU |
| RoPE | Rotary position encoding — relative position through rotation, extrapolates to longer sequences |
| DPO | Direct preference optimization — simpler than RLHF, no reward model needed |
| GQA | Grouped-query attention — share KV heads to reduce cache size and speed up inference |
| Continuous Batching | Dynamically add/remove requests from a batch during generation for max GPU utilization |
| Speculative Decoding | Small model drafts tokens, large model verifies in parallel — 2-3x speedup |
Frequently Asked Questions
Which LLM questions are most commonly asked?
RAG vs. fine-tuning is asked in nearly every AI interview. Evaluation and safety guardrails are the second most common. Positional encodings and architecture choices are more common at research-heavy companies (OpenAI, Anthropic, Google DeepMind).
Do I need to know the math behind transformers?
For AI engineering roles: understand the concepts and be able to explain attention, positional encoding, and training objectives intuitively. For research roles: yes, you should be comfortable with the full mathematical formulation.
How do I demonstrate production experience with LLMs?
Talk about evaluation (how you measured quality), cost optimization (how you reduced inference costs), and failure modes (what went wrong and how you fixed it). These signal real-world experience more than knowing the latest paper.
Written by
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.