8 LLM & RAG Interview Questions That OpenAI, Anthropic & Google Actually Ask

LLM & RAG: The Technical Core of Every AI Interview in 2026

If you're interviewing for any AI engineering role in 2026, you will be asked about Large Language Models and Retrieval-Augmented Generation. These questions separate candidates who've built production systems from those who've only read tutorials.

These 8 questions come from real interview loops at OpenAI, Anthropic, Google, and top AI startups. Each includes what the interviewer is actually testing, a structured answer framework, and the nuances that top candidates mention.

HARD Anthropic OpenAI Google

Q1: When Would You Use RAG vs. Fine-Tuning vs. Both?

What They're Really Testing

This is the most asked LLM question in 2026. They want a decision framework, not a textbook definition. The wrong answer is "it depends" without specifics.

The Decision Framework

Factor	RAG	Fine-Tuning	Both
Knowledge source	External, frequently changing docs	Static domain knowledge	Changing docs + domain behavior
What you're changing	What the model knows	How the model behaves	Both
Data requirement	Just documents (no labels)	100-10K labeled examples	Both
Latency	+50-200ms (retrieval step)	No extra latency	+50-200ms
Cost	Vector DB + embeddings	Training compute (one-time)	Both
Hallucination risk	Lower (grounded in docs)	Higher (no grounding)	Lowest

When to Use Each

RAG first (80% of enterprise use cases):

Customer support over company docs
Legal/compliance Q&A over policies
Any task where answers must cite sources
Data changes frequently (weekly or more)

Fine-tuning when:

You need a specific output format consistently (JSON, SQL, code)
Domain-specific tone or style (medical, legal, financial writing)
Task specialization (classification, extraction, structured output)
Latency is critical and you can't afford the retrieval step

Both for premium use cases:

Fine-tuned model that's better at reading retrieved context
Domain-adapted embeddings + domain-adapted generator
Example: medical Q&A with fine-tuned model + RAG over medical literature

The Nuance That Gets You Hired

Most candidates stop at the table above. Top candidates add: "In practice, I start with RAG because it requires no training data, is easier to debug (you can inspect retrieved chunks), and is easier to update (just re-index documents). I only add fine-tuning when RAG alone doesn't achieve the required output quality or format consistency. This is also the cheapest path — you avoid expensive training compute until you've proven the use case."

Also mention: "The emerging pattern is RAG with a fine-tuned embedding model — you keep the generator general-purpose but fine-tune the retriever on your domain's query-document pairs. This gives you 80% of fine-tuning's quality improvement at 20% of the cost."

HARD OpenAI Anthropic Microsoft

Q2: How Do You Evaluate LLM Outputs in Production?

What They're Really Testing

Evaluation is the hardest unsolved problem in LLM engineering. They want to see a multi-layered evaluation strategy, not just "we use BLEU score."

Answer Framework: Three Evaluation Layers

Layer 1 — Automated Metrics (Fast, Cheap, Continuous)

Task-specific metrics: Accuracy for classification, F1 for extraction, exact match for structured output
LLM-as-Judge: Use a stronger model to evaluate weaker model outputs. Score on dimensions: factual accuracy, relevance, completeness, harmlessness
Reference-free metrics: Perplexity, semantic similarity between question and answer
Hallucination detection: NLI model checks if generated claims are entailed by the source context

Layer 2 — Human Evaluation (Gold Standard, Expensive, Periodic)

Side-by-side comparison: Show evaluators outputs from model A and B, ask which is better
Likert scale rating: Rate on 1-5 for specific dimensions (helpfulness, accuracy, tone)
Red-teaming: Dedicated adversarial evaluation — try to break the system

Layer 3 — Production Monitoring (Real User Signal)

Implicit feedback: Thumbs up/down, regeneration rate, conversation length, task completion rate
Drift detection: Monitor output distribution changes — if the model suddenly generates 30% longer responses, something changed
Regression alerts: Compare daily metrics against rolling baselines

The Evaluation Pipeline

New Model Version
    → Offline Eval (automated benchmarks + LLM-as-Judge)
        → Human Eval (sample of 200-500 examples)
            → Shadow Mode (run alongside production, compare outputs)
                → Canary Deployment (5% traffic)
                    → Full Rollout

The Nuance That Gets You Hired

"The biggest pitfall with LLM-as-Judge is position bias — the judge model tends to prefer the first response shown. Always randomize the order and run evaluation twice with swapped positions. Also, LLM judges are sycophantic — they'll rate longer, more verbose answers higher even when concise answers are better. Calibrate by including known-good and known-bad examples."

Also: "In practice, I've found that user behavior signals (regeneration rate, time spent reading) are more predictive of real quality than any automated metric. The best eval system combines all three layers."

MEDIUM Widely Asked

Q3: Explain the Trade-Offs Between Sparse and Dense Retrieval in RAG

The Core Comparison

Aspect	Sparse (BM25)	Dense (Embeddings)
How it works	Term frequency + inverse doc frequency	Neural embedding similarity
Strengths	Exact keyword matching, rare terms, zero-shot	Semantic understanding, paraphrase handling
Weaknesses	No semantic understanding, vocabulary mismatch	Misses exact terms, needs training data
Latency	~5ms (inverted index)	~20-50ms (ANN search)
Infrastructure	Elasticsearch/Lucene	Vector DB (Pinecone, Weaviate, pgvector)

Why Hybrid Is Almost Always Better

Query: "How do I fix error code E4521?"

BM25 Result:  Finds doc with exact "E4521" mention       (correct)
Dense Result: Finds docs about "error resolution" general  (wrong)

Query: "My screen goes black when I plug in the charger"

BM25 Result:  No relevant match (no keyword overlap)       (miss)
Dense Result: Finds "display issues when connecting power"  (correct)

Hybrid approach: Run both, combine with Reciprocal Rank Fusion (RRF):

score(doc) = sum(1 / (k + rank_in_list)) for each retrieval method

The Nuance That Gets You Hired

"Dense retrieval quality depends heavily on the embedding model. General-purpose models (OpenAI ada-3, Cohere embed-v4) work well for common domains, but for specialized domains (legal, medical, code), you often need to fine-tune the embedding model on domain-specific query-document pairs. The cheapest approach is hard negative mining — find documents that BM25 ranks highly but aren't relevant, and use those as negative examples during embedding training."

MEDIUM OpenAI Meta Google

Q4: What Are PEFT Methods (LoRA, QLoRA)? When Would You Use Them Over Full Fine-Tuning?

Core Concepts

PEFT (Parameter-Efficient Fine-Tuning) modifies only a small fraction of model parameters while keeping the base model frozen.

LoRA (Low-Rank Adaptation):

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

Injects trainable low-rank matrices into attention layers: W' = W + BA where B is (d x r) and A is (r x d), with r << d
Typical rank r = 8-64, modifying <1% of parameters
At inference: Merge BA into W (zero additional latency)

QLoRA:

LoRA + 4-bit quantized base model
Reduces memory by ~4x, enabling fine-tuning of 70B models on a single 48GB GPU
Uses NF4 (Normal Float 4-bit) quantization + double quantization

Decision Framework

Scenario	Method	Why
Limited GPU budget	QLoRA	Fine-tune 70B on 1 GPU
Need to serve multiple fine-tuned variants	LoRA	Swap adapters at inference, one base model
Maximum quality, unlimited compute	Full fine-tune	Updates all parameters, best performance
Quick experiments / iteration	LoRA	10-100x faster than full fine-tune
Catastrophic forgetting is a concern	LoRA	Frozen base preserves general knowledge

The Nuance That Gets You Hired

"The key insight is that LoRA works because the weight updates during fine-tuning have low intrinsic rank — even full fine-tuning only modifies weights along a low-dimensional subspace. LoRA exploits this directly. In practice, I use rank 16-32 for most tasks and only go higher for complex multi-task fine-tuning."

Follow-up they often ask: "What about RLHF-style fine-tuning?" Answer: "DPO (Direct Preference Optimization) has largely replaced PPO-based RLHF in 2025-2026 because it's simpler (no reward model needed), more stable, and often achieves similar quality. GRPO (Group Relative Policy Optimization) is the newest variant, used in DeepSeek-R1, which doesn't even need a reference model."

HARD OpenAI Anthropic

Q5: How Does Rotary Positional Embedding (RoPE) Work?

Why This Is Asked

RoPE is the dominant positional encoding in modern LLMs (GPT-4, Claude, LLaMA, Gemini). Understanding it shows you know transformer internals, not just API usage.

The Core Idea

Traditional absolute positional encodings add a fixed vector to each token embedding based on its position. The problem: the model can't easily generalize to sequence lengths it hasn't seen.

RoPE encodes position by rotating query and key vectors in 2D subspaces. For position m, it applies a rotation of angle m*theta to each pair of dimensions:

RoPE(x, m) = [x1*cos(m*θ1) - x2*sin(m*θ1),
               x1*sin(m*θ1) + x2*cos(m*θ1),
               x3*cos(m*θ2) - x4*sin(m*θ2),
               ...]

Why It's Better

Relative position: The dot product between RoPE-encoded q and k depends only on their relative distance (m-n), not absolute positions
Extrapolation: With tricks like NTK-aware scaling or YaRN, RoPE models can handle sequences much longer than training length
Decay property: Attention naturally decays with distance (tokens far apart attend less), which matches linguistic intuition

The Nuance That Gets You Hired

"The key breakthrough for long-context models is theta scaling. The original RoPE uses theta=10000. By increasing theta (e.g., to 500000 in LLaMA 3.1), you reduce the rotation speed per position, allowing the model to handle much longer sequences. Combined with continued pre-training on long documents, this is how models went from 4K to 128K+ context windows. YaRN further improves this by applying different scaling factors to different frequency bands — high-frequency dimensions need less scaling because they already encode fine-grained local patterns."

MEDIUM Widely Asked

Q6: Explain Encoder-Only vs. Decoder-Only vs. Encoder-Decoder. Why Did the Industry Standardize on Causal Decoder-Only?

The Three Architectures

Architecture	Example Models	Use Case
Encoder-only	BERT, RoBERTa	Classification, NER, sentence embeddings
Decoder-only	GPT-4, Claude, LLaMA	Text generation, chat, code, reasoning
Encoder-decoder	T5, BART	Translation, summarization

Why Decoder-Only Won

Simplicity: One architecture, one training objective (next-token prediction), scales predictably
Emergent abilities: Scaling decoder-only models unlocked reasoning, coding, and instruction following — capabilities that didn't emerge in encoder-only models
Unification: Decoder-only handles ALL tasks — classification (generate "yes/no"), extraction (generate the extracted text), translation (generate in target language). No need for task-specific architectures.
Training efficiency: Causal language modeling uses every token as a training example. Masked language modeling (BERT-style) only trains on 15% of tokens.

When Encoder-Only Still Wins

Embedding/retrieval: BERT-style models produce better sentence embeddings for search because they attend bidirectionally
Classification at scale: When you need to classify millions of documents per second, a small BERT model (110M params) is 100x cheaper than prompting a GPT-4 class model
Token-level tasks: NER, POS tagging where you need a label for each token

The Nuance That Gets You Hired

"The interesting nuance is that decoder-only models can be adapted for bidirectional understanding by fine-tuning them as embedding models (e.g., GritLM, SFR-Embedding). These 'decoder-as-encoder' models are increasingly competitive with BERT-style models for retrieval while also being usable for generation. We might see encoder-only models fully deprecated in 2-3 years."

MEDIUM Anthropic OpenAI

Q7: Design Token Budget Management for a Multi-Turn Conversational System

The Problem

Context windows are finite (even 200K tokens fill up). A customer support conversation might go 50+ turns with tool calls, retrieved documents, and system prompts. How do you manage this?

Answer Framework

1. Context Window Budget Allocation

Total Context: 128K tokens
├── System Prompt:       2K  (fixed)
├── Tool Definitions:    3K  (fixed)
├── Retrieved Context:   8K  (per-turn, refreshed)
├── Conversation History: 100K (managed)
└── Generation Budget:   15K  (reserved for output)

2. History Management Strategies

Sliding window: Keep last N turns. Simple, but loses early context.
Summarization: Periodically summarize older turns into a compressed representation. Keep summary + recent turns.
Hierarchical memory:
- Hot: Last 5 turns (verbatim)
- Warm: Turns 6-20 (summarized)
- Cold: Earlier (stored in vector DB, retrieved on demand)

3. Token Counting

Count tokens BEFORE sending to the model (use tiktoken or model-specific tokenizer)
Maintain a running token count; trigger compression when approaching 80% of context window
Always reserve enough tokens for the expected output length

The Nuance That Gets You Hired

"The critical insight is that not all history is equal. In a support conversation, the customer's initial problem description and any error codes are high-value context that should never be summarized away, even if they're 30 turns old. I'd implement a pinning mechanism — certain messages are marked as high-value and always kept verbatim, while lower-value turns (confirmations, pleasantries) are summarized first."

Also: "With models supporting 1M+ tokens (Gemini, Claude), token budget management is less about fitting in the window and more about cost and latency optimization. Sending 500K tokens per request is technically possible but costs 50x more than sending 10K. Smart context management is a cost optimization tool, not just a technical constraint."

HARD Anthropic Microsoft

Q8: How Do You Implement Safety Guardrails in an LLM Application?

What They're Really Testing

At Anthropic, safety isn't a nice-to-have — it's the core mission. At every company, safety failures mean PR disasters and lawsuits. They want a multi-layered defense strategy, not just "we use a content filter."

The Multi-Layer Defense Stack

User Input
  → Layer 1: Input Validation (PII detection, injection detection)
    → Layer 2: Input Classification (toxicity, off-topic, jailbreak attempt)
      → Layer 3: LLM Generation (with system prompt guardrails)
        → Layer 4: Output Classification (harmful content, hallucination, PII leakage)
          → Layer 5: Business Rules (allowed topics, response format)
            → User Output

Each Layer in Detail

Layer 1 — Input Validation

PII detection & redaction (regex + NER model for SSN, credit card, email, phone)
Input length limits
Character encoding sanitization

Layer 2 — Input Classification

Toxicity classifier (fine-tuned model, not keyword matching)
Jailbreak detection: Detect prompt injection attempts (role-play attacks, encoding tricks, multi-language evasion)
Topic classifier: Is this within the allowed scope?

Layer 3 — System Prompt Engineering

Constitutional principles embedded in system prompt
Explicit refusal instructions for harmful categories
Output format constraints ("always respond in JSON", "never include personal opinions")

Layer 4 — Output Classification

Run the same toxicity classifier on model output
Hallucination detection: For RAG, check if output claims are supported by retrieved context
PII leakage check: Did the model accidentally output training data PII?

Layer 5 — Business Rules

Response length limits
Allowed topic whitelist
Competitor mention filtering
Mandatory disclaimers (medical, legal, financial advice)

The Nuance That Gets You Hired

"The hardest part isn't building the layers — it's handling the false positive problem. Overly aggressive safety filters block legitimate queries and frustrate users. I've seen systems where 15% of support queries were incorrectly flagged as 'harmful' because the classifier couldn't distinguish between a customer describing a problem ('this is killing my business') and actual harmful content. The solution is tiered responses: low-confidence flags get a gentle redirect instead of a hard block, and high-confidence flags get blocked with an explanation. Always log blocked requests for human review to tune the thresholds."

At Anthropic specifically: "I'd reference Constitutional AI — the model should be trained to follow a set of principles (be helpful, be harmless, be honest) and use self-critique during generation to check its own outputs against these principles, rather than relying solely on external classifiers."

Quick Reference: LLM Interview Cheat Sheet

Concept	One-Sentence Summary
RAG	Retrieve relevant docs, inject into prompt, generate grounded answer
LoRA	Low-rank weight updates (1% of params) that merge at inference for zero overhead
QLoRA	LoRA + 4-bit quantized base = fine-tune 70B on one GPU
RoPE	Rotary position encoding — relative position through rotation, extrapolates to longer sequences
DPO	Direct preference optimization — simpler than RLHF, no reward model needed
GQA	Grouped-query attention — share KV heads to reduce cache size and speed up inference
Continuous Batching	Dynamically add/remove requests from a batch during generation for max GPU utilization
Speculative Decoding	Small model drafts tokens, large model verifies in parallel — 2-3x speedup

Frequently Asked Questions

Which LLM questions are most commonly asked?

RAG vs. fine-tuning is asked in nearly every AI interview. Evaluation and safety guardrails are the second most common. Positional encodings and architecture choices are more common at research-heavy companies (OpenAI, Anthropic, Google DeepMind).

Do I need to know the math behind transformers?

For AI engineering roles: understand the concepts and be able to explain attention, positional encoding, and training objectives intuitively. For research roles: yes, you should be comfortable with the full mathematical formulation.

How do I demonstrate production experience with LLMs?

Talk about evaluation (how you measured quality), cost optimization (how you reduced inference costs), and failure modes (what went wrong and how you fixed it). These signal real-world experience more than knowing the latest paper.

8 LLM & RAG Interview Questions That OpenAI, Anthropic & Google Actually Ask

LLM & RAG: The Technical Core of Every AI Interview in 2026

What They're Really Testing

The Decision Framework

When to Use Each

What They're Really Testing

Answer Framework: Three Evaluation Layers

The Evaluation Pipeline

The Core Comparison

Why Hybrid Is Almost Always Better

Core Concepts

Decision Framework

Why This Is Asked

The Core Idea

Why It's Better

The Three Architectures

Why Decoder-Only Won

When Encoder-Only Still Wins

The Problem

Answer Framework

What They're Really Testing

The Multi-Layer Defense Stack

Each Layer in Detail

Quick Reference: LLM Interview Cheat Sheet

Frequently Asked Questions

Which LLM questions are most commonly asked?

Do I need to know the math behind transformers?

How do I demonstrate production experience with LLMs?

Try CallSphere AI Voice Agents

Related Articles

8 AI System Design Interview Questions Actually Asked at FAANG in 2026

7 ML Fundamentals Questions That Top AI Companies Still Ask in 2026

7 AI Coding Interview Questions From Anthropic, Meta & OpenAI (2026 Edition)