7 MLOps & AI Deployment Interview Questions for 2026

MLOps in 2026: From "Nice to Have" to "Core Interview Topic"

Two years ago, MLOps questions were optional — asked at infrastructure-heavy companies but skipped at AI labs. In 2026, every AI role includes MLOps because every company is deploying models to production. If you can't get a model from a notebook to a scalable service, you're not a complete AI engineer.

These 7 questions cover the real deployment challenges companies face today.

MEDIUM Google Amazon Microsoft

Q1: Design a CI/CD Pipeline for ML Models

What They're Really Testing

They want to see that you understand ML CI/CD is fundamentally different from software CI/CD. In software, if the code compiles and tests pass, you're good. In ML, the code can work perfectly but the model can still be garbage.

Pipeline Architecture

Code Change → Linting + Unit Tests
                  │
                  ▼
           Data Validation (schema checks, distribution checks)
                  │
                  ▼
           Model Training (on standardized environment)
                  │
                  ▼
           Model Evaluation
           ├── Offline Metrics (accuracy, F1, perplexity)
           ├── Regression Tests (known inputs → expected outputs)
           ├── Fairness Checks (performance across demographic groups)
           └── Performance Benchmarks (latency, throughput, memory)
                  │
                  ▼
           Model Registry (version, tag, artifact store)
                  │
                  ▼
           Staging Deployment → Integration Tests
                  │
                  ▼
           Canary (5% traffic) → Monitor metrics
                  │
                  ▼
           Full Rollout (auto if metrics pass, manual gate option)

Key Differences from Software CI/CD

Aspect	Software CI/CD	ML CI/CD
What changes	Code only	Code + data + model weights
Tests	Unit + integration tests	+ model quality tests + data quality tests
Artifact	Docker image	Docker image + model weights + config
Rollback trigger	Errors, crashes	+ metric degradation, data drift
Pipeline trigger	Code push	+ data change, scheduled retraining

Key Talking Points

Data versioning (DVC, LakeFS) is as important as code versioning. You need to reproduce any past training run.
Model registry (MLflow, Weights & Biases) tracks model lineage: which data + code + hyperparameters produced this model.
Canary deployment for ML: Route 5% of traffic to new model, compare key metrics against baseline. Auto-rollback if metrics degrade by >X%.
Shadow deployment: Run new model in parallel, log predictions but serve old model's predictions. Compare offline before switching.

MEDIUM Widely Asked

Q2: How Do You Monitor Models in Production? What Is Data Drift?

Three Types of Drift

1. Data Drift (Covariate Shift)

The input distribution changes: e.g., your model was trained on US English, but suddenly gets 30% Spanish queries
Detection: Compare feature distributions between training data and production inputs using KL divergence, PSI (Population Stability Index), or KS test

2. Concept Drift

The relationship between inputs and outputs changes: e.g., what users consider a "good recommendation" shifts during holiday season
Detection: Monitor prediction-to-outcome correlation over time

3. Model Performance Drift

Model accuracy degrades even without data drift: e.g., the world changes (new products, new slang) and the model's knowledge becomes stale
Detection: Monitor key business metrics (click-through rate, conversion, CSAT) and compare against rolling baselines

Production Monitoring Stack

Production Traffic
    │
    ├── Input Monitoring
    │   ├── Feature distribution tracking
    │   ├── Missing value rates
    │   ├── Schema validation
    │   └── Volume monitoring (QPS anomalies)
    │
    ├── Output Monitoring
    │   ├── Prediction distribution (confidence scores)
    │   ├── Class balance (is the model suddenly predicting one class 99%?)
    │   ├── Latency (p50, p95, p99)
    │   └── Error rates
    │
    └── Outcome Monitoring
        ├── Business metrics correlation
        ├── Human feedback aggregation
        └── Delayed label comparison (when ground truth becomes available)

Key Talking Points

"The most dangerous drift is silent drift — the model keeps producing outputs with high confidence, but the outputs are wrong because the world has changed. This is why you can't just monitor model confidence; you need ground-truth labels (even sampled/delayed) to catch real degradation."
"I set up two types of alerts: statistical (distribution has shifted by >X) and business (conversion rate dropped >Y%). Statistical alerts catch drift early; business alerts catch impact."
Mention tools: Evidently AI, WhyLabs, Arize, or custom Prometheus + Grafana dashboards for monitoring.

HARD OpenAI Anthropic Meta

Q3: Explain Quantization for LLM Deployment (INT8, INT4, FP8)

Why Quantization Matters

A 70B parameter model in FP16 requires 140 GB of GPU memory — almost 2 H100s just for the weights. Quantization compresses model weights to lower precision, reducing memory and speeding up inference.

Quantization Formats

Format	Bits	Memory (70B)	Quality Loss	Speed Gain
FP32	32	280 GB	Baseline	Baseline
FP16/BF16	16	140 GB	None	2x
FP8	8	70 GB	Minimal	3-4x
INT8	8	70 GB	Very small	3-4x
INT4 (GPTQ/AWQ)	4	35 GB	Small-moderate	5-7x
NF4 (QLoRA)	4	35 GB	Small	5-7x (training)

Key Techniques

Post-Training Quantization (PTQ):

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

Quantize after training with a small calibration dataset
GPTQ: Layer-by-layer quantization minimizing reconstruction error
AWQ: Activation-Aware — protects salient weights (high activation channels) from aggressive quantization

Quantization-Aware Training (QAT):

Simulate quantization during training so the model learns to be robust
Higher quality but requires full training pipeline

Dynamic vs. Static Quantization:

Static: Compute scale factors once using calibration data. Faster inference.
Dynamic: Compute scale factors per batch at runtime. Better quality, slight overhead.

Key Talking Points

"The rule of thumb: INT8 is nearly lossless for most models. INT4 degrades quality by 1-3% on benchmarks but halves the memory again. For production, INT8 is the sweet spot unless you're extremely memory-constrained."
"FP8 (E4M3/E5M2) is the emerging standard on H100s and newer GPUs. It has native hardware support, so you get the memory savings of INT8 with better numerical properties for training."
"AWQ > GPTQ in most benchmarks because it identifies which weight channels have high activation magnitudes and keeps those at higher precision. This preserves the model's most important computation paths."
"Quantization + speculative decoding stack: quantize both draft and target models, getting compound speedups."

MEDIUM OpenAI Anthropic

Q4: Describe Continuous Batching for LLM Serving. Why Is It Better?

Static Batching (The Old Way)

Request A (10 tokens)  ████████████████████░░░░░░░░░░  (waits)
Request B (30 tokens)  ████████████████████████████████████████████████████████████
Request C (5 tokens)   ██████████░░░░░░░░░░░░░░░░░░░░  (waits a LOT)

All 3 must wait for the longest request (B) to finish.
GPU is idle for A and C after they complete.

Continuous Batching (The Modern Way)

Iteration 1: Process [A, B, C] together
Iteration 2: A finishes → replace with new Request D
             Process [D, B, C] together
Iteration 3: C finishes → replace with Request E
             Process [D, B, E] together

Key insight: As soon as one request in the batch finishes generating, a new request takes its slot. The GPU is never idle waiting for the longest request.

Performance Impact

Metric	Static Batching	Continuous Batching
GPU Utilization	30-50%	80-95%
Throughput	Baseline	2-3x higher
Latency variance	Very high (short reqs wait for long)	Low (each req finishes independently)

How vLLM Implements This

vLLM combines continuous batching with PagedAttention:

KV cache managed as virtual memory pages (not contiguous blocks)
New requests can be inserted without pre-allocating maximum sequence length
Memory waste reduced by ~55% vs. static allocation

Key Talking Points

"The key implementation challenge is iteration-level scheduling — the serving engine must decide at every decoding step which requests are in the current batch. This requires an efficient scheduler that can handle thousands of concurrent requests."
"Continuous batching pairs well with prefix caching — if multiple requests share the same system prompt, they share the KV cache for that prefix. This is common in production (all requests to a customer support bot share the same system prompt)."
"Mention specific frameworks: vLLM (PagedAttention, most popular), TGI (HuggingFace), TensorRT-LLM (NVIDIA, best raw performance), SGLang (frontier research)."

HARD Amazon Google Microsoft

Q5: How Would You Implement an Automated ML Pipeline?

End-to-End ML Pipeline

Data Sources → Ingestion → Validation → Transformation → Training → Evaluation → Registry → Serving
     │             │            │             │              │            │           │          │
     ▼             ▼            ▼             ▼              ▼            ▼           ▼          ▼
  S3/DB      Airflow/       Great         Feature       GPU Cluster   Eval Suite  MLflow     K8s +
             Prefect     Expectations     Store          (spot)       + gates              vLLM/TGI

Component Choices

Component	Tool Options	Key Consideration
Orchestration	Airflow, Prefect, Kubeflow Pipelines	DAG management, retry logic, scheduling
Data Validation	Great Expectations, Pandera	Schema + distribution checks before training
Feature Store	Feast, Tecton, Vertex AI	Offline/online feature consistency
Training	SageMaker, Vertex AI, bare K8s + spot GPUs	Cost optimization via spot instances
Experiment Tracking	W&B, MLflow, Neptune	Hyperparameter search, metric comparison
Model Registry	MLflow, SageMaker Model Registry	Versioning, staging, approval workflows
Serving	vLLM, TGI, Triton, SageMaker Endpoints	Auto-scaling, A/B testing, shadow mode

Pipeline Triggers

Scheduled: Retrain weekly/monthly on new data
Data-driven: Trigger when new data exceeds threshold (e.g., 10K new labeled examples)
Drift-driven: Trigger when monitoring detects data drift or performance degradation
Manual: Data scientist triggers after experiment validates improvement

Key Talking Points

"The hardest part isn't building the pipeline — it's building the evaluation gates. Every pipeline stage needs a go/no-go decision: Is the data quality good enough to train? Is the model quality good enough to deploy? These gates prevent bad models from reaching production."
"Cost optimization is critical: Use spot/preemptible instances for training (3-5x cheaper), with checkpointing for fault tolerance. For serving, right-size GPU instances — don't use an A100 for a model that fits on a T4."
At Amazon: tie to Leadership Principles — "Frugality" means cost-optimized infrastructure, "Bias for Action" means automated pipelines over manual deployments.

MEDIUM Meta

Q6: Design an Evaluation Framework for Testing Ranking Models in Production

Offline Evaluation

Metrics:

NDCG (Normalized Discounted Cumulative Gain): Measures ranking quality — are the best items at the top?
MAP (Mean Average Precision): Average precision across all relevant items
MRR (Mean Reciprocal Rank): How far down is the first relevant result?

Methodology:

Hold-out test set from recent data (not randomly sampled — temporal split to avoid leakage)
Compute metrics on the test set for both old and new model
Statistical significance testing (paired t-test or bootstrap confidence intervals)

Online Evaluation (A/B Testing)

Production Traffic
    │
    ├── 50% → Control (current model)
    │         Measure: CTR, engagement, revenue
    │
    └── 50% → Treatment (new model)
              Measure: CTR, engagement, revenue

    → Statistical test after N days/users → Ship or revert

Interleaving (The Meta Approach)

Instead of splitting users between models, interleave results from both models in a single result list for each user:

Position 1: Model A's top result
Position 2: Model B's top result
Position 3: Model A's 2nd result
Position 4: Model B's 2nd result
...

Count which model's results get more clicks → more sensitive than traditional A/B testing (requires 10x fewer users for the same statistical power).

Key Talking Points

"Offline metrics can disagree with online metrics. A model with better NDCG might have worse user engagement because it optimizes for relevance without considering diversity (users get bored seeing similar results)."
"Guard against novelty effects: Users might click more on a new ranking initially because it's different, not because it's better. Run experiments for at least 2 weeks."
"Long-term metrics matter: A ranking change might boost short-term CTR but reduce long-term retention. Track both."

MEDIUM Amazon Google Microsoft

Q7: Explain Model Serving Infrastructure (vLLM, TGI, TensorRT-LLM)

The Serving Stack

API Gateway (rate limiting, auth)
    → Load Balancer (route to least-loaded GPU)
        → Serving Framework (vLLM / TGI / TensorRT-LLM)
            → GPU Inference (model loaded in GPU memory)
                → Response Streaming (SSE / WebSocket)

Framework Comparison

Feature	vLLM	TGI (HuggingFace)	TensorRT-LLM (NVIDIA)
Key Innovation	PagedAttention	Production-ready, easy deploy	Kernel-level optimization
Performance	High	Good	Highest (NVIDIA-specific)
Ease of Use	pip install	Docker image	Complex build process
Hardware	Any GPU	Any GPU	NVIDIA only
Continuous Batching	Yes	Yes	Yes
Quantization	GPTQ, AWQ, FP8	GPTQ, bitsandbytes	INT8, INT4, FP8 (native)
Best For	General use, flexibility	Quick deployment	Maximum throughput

Auto-Scaling Strategy

Metric: Scale on GPU utilization + request queue depth (not CPU, which is misleading for GPU workloads)
Scale-up: When queue depth > threshold for > 30 seconds
Scale-down: When GPU utilization < 20% for > 5 minutes (aggressive cooldown to save costs)
Minimum replicas: Always keep 1+ warm (cold start for loading model weights = 30-120 seconds)

Key Talking Points

"In practice, I'd start with vLLM for most use cases — it has the best developer experience and PagedAttention gives you 90%+ of TensorRT-LLM's throughput with much less complexity."
"For maximum throughput at scale (millions of requests/day), TensorRT-LLM with custom CUDA kernels and FP8 quantization on H100s is the gold standard."
"Multi-model serving: If you need to serve multiple models, consider frameworks that support model multiplexing — load multiple LoRA adapters on a single base model rather than running separate instances."
"Discuss cost: GPU inference is expensive. A single H100 is ~$2-3/hr. At 50 tokens/sec output, that's ~$0.004 per 100 tokens. Compare to API pricing ($0.01-0.06 per 100 tokens) to decide build-vs-buy."

Frequently Asked Questions

How important is MLOps knowledge for AI engineering interviews?

It's now a core competency, not optional. Even AI labs like OpenAI and Anthropic ask about deployment, monitoring, and evaluation because they ship models to millions of users. At applied AI companies (Amazon, Microsoft, Google), it's often 25-30% of the interview signal.

Do I need to know specific tools like vLLM or MLflow?

Knowing specific tools demonstrates practical experience. But concepts matter more — if you can explain continuous batching, quantization trade-offs, and monitoring strategies, the specific tool names are secondary.

What's the difference between MLOps and traditional DevOps?

MLOps adds three dimensions: (1) data management (versioning, quality, drift), (2) model management (training, evaluation, registry), and (3) experiment tracking (hyperparameters, metrics, reproducibility). DevOps principles (CI/CD, monitoring, infrastructure-as-code) still apply but are extended for ML-specific challenges.

7 MLOps & AI Deployment Interview Questions for 2026

MLOps in 2026: From "Nice to Have" to "Core Interview Topic"

What They're Really Testing

Pipeline Architecture

Key Differences from Software CI/CD

Three Types of Drift

Production Monitoring Stack

Why Quantization Matters

Quantization Formats

Key Techniques

Static Batching (The Old Way)

Continuous Batching (The Modern Way)

Performance Impact

How vLLM Implements This

End-to-End ML Pipeline

Component Choices

Pipeline Triggers

Offline Evaluation

Online Evaluation (A/B Testing)

Interleaving (The Meta Approach)

The Serving Stack

Framework Comparison

Auto-Scaling Strategy

Frequently Asked Questions

How important is MLOps knowledge for AI engineering interviews?

Do I need to know specific tools like vLLM or MLflow?

What's the difference between MLOps and traditional DevOps?

Try CallSphere AI Voice Agents

Related Articles

8 AI System Design Interview Questions Actually Asked at FAANG in 2026

8 LLM & RAG Interview Questions That OpenAI, Anthropic & Google Actually Ask

7 ML Fundamentals Questions That Top AI Companies Still Ask in 2026