Skip to content
AI Interview Prep
AI Interview Prep17 min read0 views

7 MLOps & AI Deployment Interview Questions for 2026

Real MLOps and AI deployment interview questions from Google, Amazon, Meta, and Microsoft in 2026. Covers CI/CD for ML, model monitoring, quantization, continuous batching, serving infrastructure, and evaluation frameworks.

MLOps in 2026: From "Nice to Have" to "Core Interview Topic"

Two years ago, MLOps questions were optional — asked at infrastructure-heavy companies but skipped at AI labs. In 2026, every AI role includes MLOps because every company is deploying models to production. If you can't get a model from a notebook to a scalable service, you're not a complete AI engineer.

These 7 questions cover the real deployment challenges companies face today.


MEDIUM Google Amazon Microsoft
Q1: Design a CI/CD Pipeline for ML Models

What They're Really Testing

They want to see that you understand ML CI/CD is fundamentally different from software CI/CD. In software, if the code compiles and tests pass, you're good. In ML, the code can work perfectly but the model can still be garbage.

Pipeline Architecture

Code Change → Linting + Unit Tests
                  │
                  ▼
           Data Validation (schema checks, distribution checks)
                  │
                  ▼
           Model Training (on standardized environment)
                  │
                  ▼
           Model Evaluation
           ├── Offline Metrics (accuracy, F1, perplexity)
           ├── Regression Tests (known inputs → expected outputs)
           ├── Fairness Checks (performance across demographic groups)
           └── Performance Benchmarks (latency, throughput, memory)
                  │
                  ▼
           Model Registry (version, tag, artifact store)
                  │
                  ▼
           Staging Deployment → Integration Tests
                  │
                  ▼
           Canary (5% traffic) → Monitor metrics
                  │
                  ▼
           Full Rollout (auto if metrics pass, manual gate option)

Key Differences from Software CI/CD

Aspect Software CI/CD ML CI/CD
What changes Code only Code + data + model weights
Tests Unit + integration tests + model quality tests + data quality tests
Artifact Docker image Docker image + model weights + config
Rollback trigger Errors, crashes + metric degradation, data drift
Pipeline trigger Code push + data change, scheduled retraining
Key Talking Points
  • Data versioning (DVC, LakeFS) is as important as code versioning. You need to reproduce any past training run.
  • Model registry (MLflow, Weights & Biases) tracks model lineage: which data + code + hyperparameters produced this model.
  • Canary deployment for ML: Route 5% of traffic to new model, compare key metrics against baseline. Auto-rollback if metrics degrade by >X%.
  • Shadow deployment: Run new model in parallel, log predictions but serve old model's predictions. Compare offline before switching.

MEDIUM Widely Asked
Q2: How Do You Monitor Models in Production? What Is Data Drift?

Three Types of Drift

1. Data Drift (Covariate Shift)

  • The input distribution changes: e.g., your model was trained on US English, but suddenly gets 30% Spanish queries
  • Detection: Compare feature distributions between training data and production inputs using KL divergence, PSI (Population Stability Index), or KS test

2. Concept Drift

  • The relationship between inputs and outputs changes: e.g., what users consider a "good recommendation" shifts during holiday season
  • Detection: Monitor prediction-to-outcome correlation over time

3. Model Performance Drift

  • Model accuracy degrades even without data drift: e.g., the world changes (new products, new slang) and the model's knowledge becomes stale
  • Detection: Monitor key business metrics (click-through rate, conversion, CSAT) and compare against rolling baselines

Production Monitoring Stack

Production Traffic
    │
    ├── Input Monitoring
    │   ├── Feature distribution tracking
    │   ├── Missing value rates
    │   ├── Schema validation
    │   └── Volume monitoring (QPS anomalies)
    │
    ├── Output Monitoring
    │   ├── Prediction distribution (confidence scores)
    │   ├── Class balance (is the model suddenly predicting one class 99%?)
    │   ├── Latency (p50, p95, p99)
    │   └── Error rates
    │
    └── Outcome Monitoring
        ├── Business metrics correlation
        ├── Human feedback aggregation
        └── Delayed label comparison (when ground truth becomes available)
Key Talking Points
  • "The most dangerous drift is silent drift — the model keeps producing outputs with high confidence, but the outputs are wrong because the world has changed. This is why you can't just monitor model confidence; you need ground-truth labels (even sampled/delayed) to catch real degradation."
  • "I set up two types of alerts: statistical (distribution has shifted by >X) and business (conversion rate dropped >Y%). Statistical alerts catch drift early; business alerts catch impact."
  • Mention tools: Evidently AI, WhyLabs, Arize, or custom Prometheus + Grafana dashboards for monitoring.

HARD OpenAI Anthropic Meta
Q3: Explain Quantization for LLM Deployment (INT8, INT4, FP8)

Why Quantization Matters

A 70B parameter model in FP16 requires 140 GB of GPU memory — almost 2 H100s just for the weights. Quantization compresses model weights to lower precision, reducing memory and speeding up inference.

Quantization Formats

Format Bits Memory (70B) Quality Loss Speed Gain
FP32 32 280 GB Baseline Baseline
FP16/BF16 16 140 GB None 2x
FP8 8 70 GB Minimal 3-4x
INT8 8 70 GB Very small 3-4x
INT4 (GPTQ/AWQ) 4 35 GB Small-moderate 5-7x
NF4 (QLoRA) 4 35 GB Small 5-7x (training)

Key Techniques

Post-Training Quantization (PTQ):

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

  • Quantize after training with a small calibration dataset
  • GPTQ: Layer-by-layer quantization minimizing reconstruction error
  • AWQ: Activation-Aware — protects salient weights (high activation channels) from aggressive quantization

Quantization-Aware Training (QAT):

  • Simulate quantization during training so the model learns to be robust
  • Higher quality but requires full training pipeline

Dynamic vs. Static Quantization:

  • Static: Compute scale factors once using calibration data. Faster inference.
  • Dynamic: Compute scale factors per batch at runtime. Better quality, slight overhead.
Key Talking Points
  • "The rule of thumb: INT8 is nearly lossless for most models. INT4 degrades quality by 1-3% on benchmarks but halves the memory again. For production, INT8 is the sweet spot unless you're extremely memory-constrained."
  • "FP8 (E4M3/E5M2) is the emerging standard on H100s and newer GPUs. It has native hardware support, so you get the memory savings of INT8 with better numerical properties for training."
  • "AWQ > GPTQ in most benchmarks because it identifies which weight channels have high activation magnitudes and keeps those at higher precision. This preserves the model's most important computation paths."
  • "Quantization + speculative decoding stack: quantize both draft and target models, getting compound speedups."

MEDIUM OpenAI Anthropic
Q4: Describe Continuous Batching for LLM Serving. Why Is It Better?

Static Batching (The Old Way)

Request A (10 tokens)  ████████████████████░░░░░░░░░░  (waits)
Request B (30 tokens)  ████████████████████████████████████████████████████████████
Request C (5 tokens)   ██████████░░░░░░░░░░░░░░░░░░░░  (waits a LOT)

All 3 must wait for the longest request (B) to finish.
GPU is idle for A and C after they complete.

Continuous Batching (The Modern Way)

Iteration 1: Process [A, B, C] together
Iteration 2: A finishes → replace with new Request D
             Process [D, B, C] together
Iteration 3: C finishes → replace with Request E
             Process [D, B, E] together

Key insight: As soon as one request in the batch finishes generating, a new request takes its slot. The GPU is never idle waiting for the longest request.

Performance Impact

Metric Static Batching Continuous Batching
GPU Utilization 30-50% 80-95%
Throughput Baseline 2-3x higher
Latency variance Very high (short reqs wait for long) Low (each req finishes independently)

How vLLM Implements This

vLLM combines continuous batching with PagedAttention:

  • KV cache managed as virtual memory pages (not contiguous blocks)
  • New requests can be inserted without pre-allocating maximum sequence length
  • Memory waste reduced by ~55% vs. static allocation
Key Talking Points
  • "The key implementation challenge is iteration-level scheduling — the serving engine must decide at every decoding step which requests are in the current batch. This requires an efficient scheduler that can handle thousands of concurrent requests."
  • "Continuous batching pairs well with prefix caching — if multiple requests share the same system prompt, they share the KV cache for that prefix. This is common in production (all requests to a customer support bot share the same system prompt)."
  • "Mention specific frameworks: vLLM (PagedAttention, most popular), TGI (HuggingFace), TensorRT-LLM (NVIDIA, best raw performance), SGLang (frontier research)."

HARD Amazon Google Microsoft
Q5: How Would You Implement an Automated ML Pipeline?

End-to-End ML Pipeline

Data Sources → Ingestion → Validation → Transformation → Training → Evaluation → Registry → Serving
     │             │            │             │              │            │           │          │
     ▼             ▼            ▼             ▼              ▼            ▼           ▼          ▼
  S3/DB      Airflow/       Great         Feature       GPU Cluster   Eval Suite  MLflow     K8s +
             Prefect     Expectations     Store          (spot)       + gates              vLLM/TGI

Component Choices

Component Tool Options Key Consideration
Orchestration Airflow, Prefect, Kubeflow Pipelines DAG management, retry logic, scheduling
Data Validation Great Expectations, Pandera Schema + distribution checks before training
Feature Store Feast, Tecton, Vertex AI Offline/online feature consistency
Training SageMaker, Vertex AI, bare K8s + spot GPUs Cost optimization via spot instances
Experiment Tracking W&B, MLflow, Neptune Hyperparameter search, metric comparison
Model Registry MLflow, SageMaker Model Registry Versioning, staging, approval workflows
Serving vLLM, TGI, Triton, SageMaker Endpoints Auto-scaling, A/B testing, shadow mode

Pipeline Triggers

  • Scheduled: Retrain weekly/monthly on new data
  • Data-driven: Trigger when new data exceeds threshold (e.g., 10K new labeled examples)
  • Drift-driven: Trigger when monitoring detects data drift or performance degradation
  • Manual: Data scientist triggers after experiment validates improvement
Key Talking Points
  • "The hardest part isn't building the pipeline — it's building the evaluation gates. Every pipeline stage needs a go/no-go decision: Is the data quality good enough to train? Is the model quality good enough to deploy? These gates prevent bad models from reaching production."
  • "Cost optimization is critical: Use spot/preemptible instances for training (3-5x cheaper), with checkpointing for fault tolerance. For serving, right-size GPU instances — don't use an A100 for a model that fits on a T4."
  • At Amazon: tie to Leadership Principles — "Frugality" means cost-optimized infrastructure, "Bias for Action" means automated pipelines over manual deployments.

MEDIUM Meta
Q6: Design an Evaluation Framework for Testing Ranking Models in Production

Offline Evaluation

Metrics:

  • NDCG (Normalized Discounted Cumulative Gain): Measures ranking quality — are the best items at the top?
  • MAP (Mean Average Precision): Average precision across all relevant items
  • MRR (Mean Reciprocal Rank): How far down is the first relevant result?

Methodology:

  • Hold-out test set from recent data (not randomly sampled — temporal split to avoid leakage)
  • Compute metrics on the test set for both old and new model
  • Statistical significance testing (paired t-test or bootstrap confidence intervals)

Online Evaluation (A/B Testing)

Production Traffic
    │
    ├── 50% → Control (current model)
    │         Measure: CTR, engagement, revenue
    │
    └── 50% → Treatment (new model)
              Measure: CTR, engagement, revenue

    → Statistical test after N days/users → Ship or revert

Interleaving (The Meta Approach)

Instead of splitting users between models, interleave results from both models in a single result list for each user:

Position 1: Model A's top result
Position 2: Model B's top result
Position 3: Model A's 2nd result
Position 4: Model B's 2nd result
...

Count which model's results get more clicks → more sensitive than traditional A/B testing (requires 10x fewer users for the same statistical power).

Key Talking Points
  • "Offline metrics can disagree with online metrics. A model with better NDCG might have worse user engagement because it optimizes for relevance without considering diversity (users get bored seeing similar results)."
  • "Guard against novelty effects: Users might click more on a new ranking initially because it's different, not because it's better. Run experiments for at least 2 weeks."
  • "Long-term metrics matter: A ranking change might boost short-term CTR but reduce long-term retention. Track both."

MEDIUM Amazon Google Microsoft
Q7: Explain Model Serving Infrastructure (vLLM, TGI, TensorRT-LLM)

The Serving Stack

API Gateway (rate limiting, auth)
    → Load Balancer (route to least-loaded GPU)
        → Serving Framework (vLLM / TGI / TensorRT-LLM)
            → GPU Inference (model loaded in GPU memory)
                → Response Streaming (SSE / WebSocket)

Framework Comparison

Feature vLLM TGI (HuggingFace) TensorRT-LLM (NVIDIA)
Key Innovation PagedAttention Production-ready, easy deploy Kernel-level optimization
Performance High Good Highest (NVIDIA-specific)
Ease of Use pip install Docker image Complex build process
Hardware Any GPU Any GPU NVIDIA only
Continuous Batching Yes Yes Yes
Quantization GPTQ, AWQ, FP8 GPTQ, bitsandbytes INT8, INT4, FP8 (native)
Best For General use, flexibility Quick deployment Maximum throughput

Auto-Scaling Strategy

  • Metric: Scale on GPU utilization + request queue depth (not CPU, which is misleading for GPU workloads)
  • Scale-up: When queue depth > threshold for > 30 seconds
  • Scale-down: When GPU utilization < 20% for > 5 minutes (aggressive cooldown to save costs)
  • Minimum replicas: Always keep 1+ warm (cold start for loading model weights = 30-120 seconds)
Key Talking Points
  • "In practice, I'd start with vLLM for most use cases — it has the best developer experience and PagedAttention gives you 90%+ of TensorRT-LLM's throughput with much less complexity."
  • "For maximum throughput at scale (millions of requests/day), TensorRT-LLM with custom CUDA kernels and FP8 quantization on H100s is the gold standard."
  • "Multi-model serving: If you need to serve multiple models, consider frameworks that support model multiplexing — load multiple LoRA adapters on a single base model rather than running separate instances."
  • "Discuss cost: GPU inference is expensive. A single H100 is ~$2-3/hr. At 50 tokens/sec output, that's ~$0.004 per 100 tokens. Compare to API pricing ($0.01-0.06 per 100 tokens) to decide build-vs-buy."

Frequently Asked Questions

How important is MLOps knowledge for AI engineering interviews?

It's now a core competency, not optional. Even AI labs like OpenAI and Anthropic ask about deployment, monitoring, and evaluation because they ship models to millions of users. At applied AI companies (Amazon, Microsoft, Google), it's often 25-30% of the interview signal.

Do I need to know specific tools like vLLM or MLflow?

Knowing specific tools demonstrates practical experience. But concepts matter more — if you can explain continuous batching, quantization trade-offs, and monitoring strategies, the specific tool names are secondary.

What's the difference between MLOps and traditional DevOps?

MLOps adds three dimensions: (1) data management (versioning, quality, drift), (2) model management (training, evaluation, registry), and (3) experiment tracking (hyperparameters, metrics, reproducibility). DevOps principles (CI/CD, monitoring, infrastructure-as-code) still apply but are extended for ML-specific challenges.

Share
C

Written by

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.