Skip to content
AI Interview Prep
AI Interview Prep20 min read0 views

8 LLM & RAG Interview Questions That OpenAI, Anthropic & Google Actually Ask

Real LLM and RAG interview questions from top AI labs in 2026. Covers fine-tuning vs RAG decisions, production RAG pipelines, evaluation, PEFT methods, positional embeddings, and safety guardrails with expert answers.

LLM & RAG: The Technical Core of Every AI Interview in 2026

If you're interviewing for any AI engineering role in 2026, you will be asked about Large Language Models and Retrieval-Augmented Generation. These questions separate candidates who've built production systems from those who've only read tutorials.

These 8 questions come from real interview loops at OpenAI, Anthropic, Google, and top AI startups. Each includes what the interviewer is actually testing, a structured answer framework, and the nuances that top candidates mention.


HARD Anthropic OpenAI Google
Q1: When Would You Use RAG vs. Fine-Tuning vs. Both?

What They're Really Testing

This is the most asked LLM question in 2026. They want a decision framework, not a textbook definition. The wrong answer is "it depends" without specifics.

The Decision Framework

Factor RAG Fine-Tuning Both
Knowledge source External, frequently changing docs Static domain knowledge Changing docs + domain behavior
What you're changing What the model knows How the model behaves Both
Data requirement Just documents (no labels) 100-10K labeled examples Both
Latency +50-200ms (retrieval step) No extra latency +50-200ms
Cost Vector DB + embeddings Training compute (one-time) Both
Hallucination risk Lower (grounded in docs) Higher (no grounding) Lowest

When to Use Each

RAG first (80% of enterprise use cases):

  • Customer support over company docs
  • Legal/compliance Q&A over policies
  • Any task where answers must cite sources
  • Data changes frequently (weekly or more)

Fine-tuning when:

  • You need a specific output format consistently (JSON, SQL, code)
  • Domain-specific tone or style (medical, legal, financial writing)
  • Task specialization (classification, extraction, structured output)
  • Latency is critical and you can't afford the retrieval step

Both for premium use cases:

  • Fine-tuned model that's better at reading retrieved context
  • Domain-adapted embeddings + domain-adapted generator
  • Example: medical Q&A with fine-tuned model + RAG over medical literature
The Nuance That Gets You Hired

Most candidates stop at the table above. Top candidates add: "In practice, I start with RAG because it requires no training data, is easier to debug (you can inspect retrieved chunks), and is easier to update (just re-index documents). I only add fine-tuning when RAG alone doesn't achieve the required output quality or format consistency. This is also the cheapest path — you avoid expensive training compute until you've proven the use case."

Also mention: "The emerging pattern is RAG with a fine-tuned embedding model — you keep the generator general-purpose but fine-tune the retriever on your domain's query-document pairs. This gives you 80% of fine-tuning's quality improvement at 20% of the cost."


HARD OpenAI Anthropic Microsoft
Q2: How Do You Evaluate LLM Outputs in Production?

What They're Really Testing

Evaluation is the hardest unsolved problem in LLM engineering. They want to see a multi-layered evaluation strategy, not just "we use BLEU score."

Answer Framework: Three Evaluation Layers

Layer 1 — Automated Metrics (Fast, Cheap, Continuous)

  • Task-specific metrics: Accuracy for classification, F1 for extraction, exact match for structured output
  • LLM-as-Judge: Use a stronger model to evaluate weaker model outputs. Score on dimensions: factual accuracy, relevance, completeness, harmlessness
  • Reference-free metrics: Perplexity, semantic similarity between question and answer
  • Hallucination detection: NLI model checks if generated claims are entailed by the source context

Layer 2 — Human Evaluation (Gold Standard, Expensive, Periodic)

  • Side-by-side comparison: Show evaluators outputs from model A and B, ask which is better
  • Likert scale rating: Rate on 1-5 for specific dimensions (helpfulness, accuracy, tone)
  • Red-teaming: Dedicated adversarial evaluation — try to break the system

Layer 3 — Production Monitoring (Real User Signal)

  • Implicit feedback: Thumbs up/down, regeneration rate, conversation length, task completion rate
  • Drift detection: Monitor output distribution changes — if the model suddenly generates 30% longer responses, something changed
  • Regression alerts: Compare daily metrics against rolling baselines

The Evaluation Pipeline

New Model Version
    → Offline Eval (automated benchmarks + LLM-as-Judge)
        → Human Eval (sample of 200-500 examples)
            → Shadow Mode (run alongside production, compare outputs)
                → Canary Deployment (5% traffic)
                    → Full Rollout
The Nuance That Gets You Hired

"The biggest pitfall with LLM-as-Judge is position bias — the judge model tends to prefer the first response shown. Always randomize the order and run evaluation twice with swapped positions. Also, LLM judges are sycophantic — they'll rate longer, more verbose answers higher even when concise answers are better. Calibrate by including known-good and known-bad examples."

Also: "In practice, I've found that user behavior signals (regeneration rate, time spent reading) are more predictive of real quality than any automated metric. The best eval system combines all three layers."


MEDIUM Widely Asked
Q3: Explain the Trade-Offs Between Sparse and Dense Retrieval in RAG

The Core Comparison

Aspect Sparse (BM25) Dense (Embeddings)
How it works Term frequency + inverse doc frequency Neural embedding similarity
Strengths Exact keyword matching, rare terms, zero-shot Semantic understanding, paraphrase handling
Weaknesses No semantic understanding, vocabulary mismatch Misses exact terms, needs training data
Latency ~5ms (inverted index) ~20-50ms (ANN search)
Infrastructure Elasticsearch/Lucene Vector DB (Pinecone, Weaviate, pgvector)

Why Hybrid Is Almost Always Better

Query: "How do I fix error code E4521?"

BM25 Result:  Finds doc with exact "E4521" mention       (correct)
Dense Result: Finds docs about "error resolution" general  (wrong)

Query: "My screen goes black when I plug in the charger"

BM25 Result:  No relevant match (no keyword overlap)       (miss)
Dense Result: Finds "display issues when connecting power"  (correct)

Hybrid approach: Run both, combine with Reciprocal Rank Fusion (RRF):

score(doc) = sum(1 / (k + rank_in_list)) for each retrieval method
The Nuance That Gets You Hired

"Dense retrieval quality depends heavily on the embedding model. General-purpose models (OpenAI ada-3, Cohere embed-v4) work well for common domains, but for specialized domains (legal, medical, code), you often need to fine-tune the embedding model on domain-specific query-document pairs. The cheapest approach is hard negative mining — find documents that BM25 ranks highly but aren't relevant, and use those as negative examples during embedding training."


MEDIUM OpenAI Meta Google
Q4: What Are PEFT Methods (LoRA, QLoRA)? When Would You Use Them Over Full Fine-Tuning?

Core Concepts

PEFT (Parameter-Efficient Fine-Tuning) modifies only a small fraction of model parameters while keeping the base model frozen.

LoRA (Low-Rank Adaptation):

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

  • Injects trainable low-rank matrices into attention layers: W' = W + BA where B is (d x r) and A is (r x d), with r << d
  • Typical rank r = 8-64, modifying <1% of parameters
  • At inference: Merge BA into W (zero additional latency)

QLoRA:

  • LoRA + 4-bit quantized base model
  • Reduces memory by ~4x, enabling fine-tuning of 70B models on a single 48GB GPU
  • Uses NF4 (Normal Float 4-bit) quantization + double quantization

Decision Framework

Scenario Method Why
Limited GPU budget QLoRA Fine-tune 70B on 1 GPU
Need to serve multiple fine-tuned variants LoRA Swap adapters at inference, one base model
Maximum quality, unlimited compute Full fine-tune Updates all parameters, best performance
Quick experiments / iteration LoRA 10-100x faster than full fine-tune
Catastrophic forgetting is a concern LoRA Frozen base preserves general knowledge
The Nuance That Gets You Hired

"The key insight is that LoRA works because the weight updates during fine-tuning have low intrinsic rank — even full fine-tuning only modifies weights along a low-dimensional subspace. LoRA exploits this directly. In practice, I use rank 16-32 for most tasks and only go higher for complex multi-task fine-tuning."

Follow-up they often ask: "What about RLHF-style fine-tuning?" Answer: "DPO (Direct Preference Optimization) has largely replaced PPO-based RLHF in 2025-2026 because it's simpler (no reward model needed), more stable, and often achieves similar quality. GRPO (Group Relative Policy Optimization) is the newest variant, used in DeepSeek-R1, which doesn't even need a reference model."


HARD OpenAI Anthropic
Q5: How Does Rotary Positional Embedding (RoPE) Work?

Why This Is Asked

RoPE is the dominant positional encoding in modern LLMs (GPT-4, Claude, LLaMA, Gemini). Understanding it shows you know transformer internals, not just API usage.

The Core Idea

Traditional absolute positional encodings add a fixed vector to each token embedding based on its position. The problem: the model can't easily generalize to sequence lengths it hasn't seen.

RoPE encodes position by rotating query and key vectors in 2D subspaces. For position m, it applies a rotation of angle m*theta to each pair of dimensions:

RoPE(x, m) = [x1*cos(m*θ1) - x2*sin(m*θ1),
               x1*sin(m*θ1) + x2*cos(m*θ1),
               x3*cos(m*θ2) - x4*sin(m*θ2),
               ...]

Why It's Better

  1. Relative position: The dot product between RoPE-encoded q and k depends only on their relative distance (m-n), not absolute positions
  2. Extrapolation: With tricks like NTK-aware scaling or YaRN, RoPE models can handle sequences much longer than training length
  3. Decay property: Attention naturally decays with distance (tokens far apart attend less), which matches linguistic intuition
The Nuance That Gets You Hired

"The key breakthrough for long-context models is theta scaling. The original RoPE uses theta=10000. By increasing theta (e.g., to 500000 in LLaMA 3.1), you reduce the rotation speed per position, allowing the model to handle much longer sequences. Combined with continued pre-training on long documents, this is how models went from 4K to 128K+ context windows. YaRN further improves this by applying different scaling factors to different frequency bands — high-frequency dimensions need less scaling because they already encode fine-grained local patterns."


MEDIUM Widely Asked
Q6: Explain Encoder-Only vs. Decoder-Only vs. Encoder-Decoder. Why Did the Industry Standardize on Causal Decoder-Only?

The Three Architectures

Architecture Example Models Use Case
Encoder-only BERT, RoBERTa Classification, NER, sentence embeddings
Decoder-only GPT-4, Claude, LLaMA Text generation, chat, code, reasoning
Encoder-decoder T5, BART Translation, summarization

Why Decoder-Only Won

  1. Simplicity: One architecture, one training objective (next-token prediction), scales predictably
  2. Emergent abilities: Scaling decoder-only models unlocked reasoning, coding, and instruction following — capabilities that didn't emerge in encoder-only models
  3. Unification: Decoder-only handles ALL tasks — classification (generate "yes/no"), extraction (generate the extracted text), translation (generate in target language). No need for task-specific architectures.
  4. Training efficiency: Causal language modeling uses every token as a training example. Masked language modeling (BERT-style) only trains on 15% of tokens.

When Encoder-Only Still Wins

  • Embedding/retrieval: BERT-style models produce better sentence embeddings for search because they attend bidirectionally
  • Classification at scale: When you need to classify millions of documents per second, a small BERT model (110M params) is 100x cheaper than prompting a GPT-4 class model
  • Token-level tasks: NER, POS tagging where you need a label for each token
The Nuance That Gets You Hired

"The interesting nuance is that decoder-only models can be adapted for bidirectional understanding by fine-tuning them as embedding models (e.g., GritLM, SFR-Embedding). These 'decoder-as-encoder' models are increasingly competitive with BERT-style models for retrieval while also being usable for generation. We might see encoder-only models fully deprecated in 2-3 years."


MEDIUM Anthropic OpenAI
Q7: Design Token Budget Management for a Multi-Turn Conversational System

The Problem

Context windows are finite (even 200K tokens fill up). A customer support conversation might go 50+ turns with tool calls, retrieved documents, and system prompts. How do you manage this?

Answer Framework

1. Context Window Budget Allocation

Total Context: 128K tokens
├── System Prompt:       2K  (fixed)
├── Tool Definitions:    3K  (fixed)
├── Retrieved Context:   8K  (per-turn, refreshed)
├── Conversation History: 100K (managed)
└── Generation Budget:   15K  (reserved for output)

2. History Management Strategies

  • Sliding window: Keep last N turns. Simple, but loses early context.
  • Summarization: Periodically summarize older turns into a compressed representation. Keep summary + recent turns.
  • Hierarchical memory:
    • Hot: Last 5 turns (verbatim)
    • Warm: Turns 6-20 (summarized)
    • Cold: Earlier (stored in vector DB, retrieved on demand)

3. Token Counting

  • Count tokens BEFORE sending to the model (use tiktoken or model-specific tokenizer)
  • Maintain a running token count; trigger compression when approaching 80% of context window
  • Always reserve enough tokens for the expected output length
The Nuance That Gets You Hired

"The critical insight is that not all history is equal. In a support conversation, the customer's initial problem description and any error codes are high-value context that should never be summarized away, even if they're 30 turns old. I'd implement a pinning mechanism — certain messages are marked as high-value and always kept verbatim, while lower-value turns (confirmations, pleasantries) are summarized first."

Also: "With models supporting 1M+ tokens (Gemini, Claude), token budget management is less about fitting in the window and more about cost and latency optimization. Sending 500K tokens per request is technically possible but costs 50x more than sending 10K. Smart context management is a cost optimization tool, not just a technical constraint."


HARD Anthropic Microsoft
Q8: How Do You Implement Safety Guardrails in an LLM Application?

What They're Really Testing

At Anthropic, safety isn't a nice-to-have — it's the core mission. At every company, safety failures mean PR disasters and lawsuits. They want a multi-layered defense strategy, not just "we use a content filter."

The Multi-Layer Defense Stack

User Input
  → Layer 1: Input Validation (PII detection, injection detection)
    → Layer 2: Input Classification (toxicity, off-topic, jailbreak attempt)
      → Layer 3: LLM Generation (with system prompt guardrails)
        → Layer 4: Output Classification (harmful content, hallucination, PII leakage)
          → Layer 5: Business Rules (allowed topics, response format)
            → User Output

Each Layer in Detail

Layer 1 — Input Validation

  • PII detection & redaction (regex + NER model for SSN, credit card, email, phone)
  • Input length limits
  • Character encoding sanitization

Layer 2 — Input Classification

  • Toxicity classifier (fine-tuned model, not keyword matching)
  • Jailbreak detection: Detect prompt injection attempts (role-play attacks, encoding tricks, multi-language evasion)
  • Topic classifier: Is this within the allowed scope?

Layer 3 — System Prompt Engineering

  • Constitutional principles embedded in system prompt
  • Explicit refusal instructions for harmful categories
  • Output format constraints ("always respond in JSON", "never include personal opinions")

Layer 4 — Output Classification

  • Run the same toxicity classifier on model output
  • Hallucination detection: For RAG, check if output claims are supported by retrieved context
  • PII leakage check: Did the model accidentally output training data PII?

Layer 5 — Business Rules

  • Response length limits
  • Allowed topic whitelist
  • Competitor mention filtering
  • Mandatory disclaimers (medical, legal, financial advice)
The Nuance That Gets You Hired

"The hardest part isn't building the layers — it's handling the false positive problem. Overly aggressive safety filters block legitimate queries and frustrate users. I've seen systems where 15% of support queries were incorrectly flagged as 'harmful' because the classifier couldn't distinguish between a customer describing a problem ('this is killing my business') and actual harmful content. The solution is tiered responses: low-confidence flags get a gentle redirect instead of a hard block, and high-confidence flags get blocked with an explanation. Always log blocked requests for human review to tune the thresholds."

At Anthropic specifically: "I'd reference Constitutional AI — the model should be trained to follow a set of principles (be helpful, be harmless, be honest) and use self-critique during generation to check its own outputs against these principles, rather than relying solely on external classifiers."


Quick Reference: LLM Interview Cheat Sheet

Concept One-Sentence Summary
RAG Retrieve relevant docs, inject into prompt, generate grounded answer
LoRA Low-rank weight updates (1% of params) that merge at inference for zero overhead
QLoRA LoRA + 4-bit quantized base = fine-tune 70B on one GPU
RoPE Rotary position encoding — relative position through rotation, extrapolates to longer sequences
DPO Direct preference optimization — simpler than RLHF, no reward model needed
GQA Grouped-query attention — share KV heads to reduce cache size and speed up inference
Continuous Batching Dynamically add/remove requests from a batch during generation for max GPU utilization
Speculative Decoding Small model drafts tokens, large model verifies in parallel — 2-3x speedup

Frequently Asked Questions

Which LLM questions are most commonly asked?

RAG vs. fine-tuning is asked in nearly every AI interview. Evaluation and safety guardrails are the second most common. Positional encodings and architecture choices are more common at research-heavy companies (OpenAI, Anthropic, Google DeepMind).

Do I need to know the math behind transformers?

For AI engineering roles: understand the concepts and be able to explain attention, positional encoding, and training objectives intuitively. For research roles: yes, you should be comfortable with the full mathematical formulation.

How do I demonstrate production experience with LLMs?

Talk about evaluation (how you measured quality), cost optimization (how you reduced inference costs), and failure modes (what went wrong and how you fixed it). These signal real-world experience more than knowing the latest paper.

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.