7 MLOps & AI Deployment Interview Questions for 2026
Real MLOps and AI deployment interview questions from Google, Amazon, Meta, and Microsoft in 2026. Covers CI/CD for ML, model monitoring, quantization, continuous batching, serving infrastructure, and evaluation frameworks.
MLOps in 2026: From "Nice to Have" to "Core Interview Topic"
Two years ago, MLOps questions were optional — asked at infrastructure-heavy companies but skipped at AI labs. In 2026, every AI role includes MLOps because every company is deploying models to production. If you can't get a model from a notebook to a scalable service, you're not a complete AI engineer.
These 7 questions cover the real deployment challenges companies face today.
What They're Really Testing
They want to see that you understand ML CI/CD is fundamentally different from software CI/CD. In software, if the code compiles and tests pass, you're good. In ML, the code can work perfectly but the model can still be garbage.
Pipeline Architecture
Code Change → Linting + Unit Tests
│
▼
Data Validation (schema checks, distribution checks)
│
▼
Model Training (on standardized environment)
│
▼
Model Evaluation
├── Offline Metrics (accuracy, F1, perplexity)
├── Regression Tests (known inputs → expected outputs)
├── Fairness Checks (performance across demographic groups)
└── Performance Benchmarks (latency, throughput, memory)
│
▼
Model Registry (version, tag, artifact store)
│
▼
Staging Deployment → Integration Tests
│
▼
Canary (5% traffic) → Monitor metrics
│
▼
Full Rollout (auto if metrics pass, manual gate option)
Key Differences from Software CI/CD
| Aspect | Software CI/CD | ML CI/CD |
|---|---|---|
| What changes | Code only | Code + data + model weights |
| Tests | Unit + integration tests | + model quality tests + data quality tests |
| Artifact | Docker image | Docker image + model weights + config |
| Rollback trigger | Errors, crashes | + metric degradation, data drift |
| Pipeline trigger | Code push | + data change, scheduled retraining |
Key Talking Points
- Data versioning (DVC, LakeFS) is as important as code versioning. You need to reproduce any past training run.
- Model registry (MLflow, Weights & Biases) tracks model lineage: which data + code + hyperparameters produced this model.
- Canary deployment for ML: Route 5% of traffic to new model, compare key metrics against baseline. Auto-rollback if metrics degrade by >X%.
- Shadow deployment: Run new model in parallel, log predictions but serve old model's predictions. Compare offline before switching.
Three Types of Drift
1. Data Drift (Covariate Shift)
- The input distribution changes: e.g., your model was trained on US English, but suddenly gets 30% Spanish queries
- Detection: Compare feature distributions between training data and production inputs using KL divergence, PSI (Population Stability Index), or KS test
2. Concept Drift
- The relationship between inputs and outputs changes: e.g., what users consider a "good recommendation" shifts during holiday season
- Detection: Monitor prediction-to-outcome correlation over time
3. Model Performance Drift
- Model accuracy degrades even without data drift: e.g., the world changes (new products, new slang) and the model's knowledge becomes stale
- Detection: Monitor key business metrics (click-through rate, conversion, CSAT) and compare against rolling baselines
Production Monitoring Stack
Production Traffic
│
├── Input Monitoring
│ ├── Feature distribution tracking
│ ├── Missing value rates
│ ├── Schema validation
│ └── Volume monitoring (QPS anomalies)
│
├── Output Monitoring
│ ├── Prediction distribution (confidence scores)
│ ├── Class balance (is the model suddenly predicting one class 99%?)
│ ├── Latency (p50, p95, p99)
│ └── Error rates
│
└── Outcome Monitoring
├── Business metrics correlation
├── Human feedback aggregation
└── Delayed label comparison (when ground truth becomes available)
Key Talking Points
- "The most dangerous drift is silent drift — the model keeps producing outputs with high confidence, but the outputs are wrong because the world has changed. This is why you can't just monitor model confidence; you need ground-truth labels (even sampled/delayed) to catch real degradation."
- "I set up two types of alerts: statistical (distribution has shifted by >X) and business (conversion rate dropped >Y%). Statistical alerts catch drift early; business alerts catch impact."
- Mention tools: Evidently AI, WhyLabs, Arize, or custom Prometheus + Grafana dashboards for monitoring.
Why Quantization Matters
A 70B parameter model in FP16 requires 140 GB of GPU memory — almost 2 H100s just for the weights. Quantization compresses model weights to lower precision, reducing memory and speeding up inference.
Quantization Formats
| Format | Bits | Memory (70B) | Quality Loss | Speed Gain |
|---|---|---|---|---|
| FP32 | 32 | 280 GB | Baseline | Baseline |
| FP16/BF16 | 16 | 140 GB | None | 2x |
| FP8 | 8 | 70 GB | Minimal | 3-4x |
| INT8 | 8 | 70 GB | Very small | 3-4x |
| INT4 (GPTQ/AWQ) | 4 | 35 GB | Small-moderate | 5-7x |
| NF4 (QLoRA) | 4 | 35 GB | Small | 5-7x (training) |
Key Techniques
Post-Training Quantization (PTQ):
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
- Quantize after training with a small calibration dataset
- GPTQ: Layer-by-layer quantization minimizing reconstruction error
- AWQ: Activation-Aware — protects salient weights (high activation channels) from aggressive quantization
Quantization-Aware Training (QAT):
- Simulate quantization during training so the model learns to be robust
- Higher quality but requires full training pipeline
Dynamic vs. Static Quantization:
- Static: Compute scale factors once using calibration data. Faster inference.
- Dynamic: Compute scale factors per batch at runtime. Better quality, slight overhead.
Key Talking Points
- "The rule of thumb: INT8 is nearly lossless for most models. INT4 degrades quality by 1-3% on benchmarks but halves the memory again. For production, INT8 is the sweet spot unless you're extremely memory-constrained."
- "FP8 (E4M3/E5M2) is the emerging standard on H100s and newer GPUs. It has native hardware support, so you get the memory savings of INT8 with better numerical properties for training."
- "AWQ > GPTQ in most benchmarks because it identifies which weight channels have high activation magnitudes and keeps those at higher precision. This preserves the model's most important computation paths."
- "Quantization + speculative decoding stack: quantize both draft and target models, getting compound speedups."
Static Batching (The Old Way)
Request A (10 tokens) ████████████████████░░░░░░░░░░ (waits)
Request B (30 tokens) ████████████████████████████████████████████████████████████
Request C (5 tokens) ██████████░░░░░░░░░░░░░░░░░░░░ (waits a LOT)
All 3 must wait for the longest request (B) to finish.
GPU is idle for A and C after they complete.
Continuous Batching (The Modern Way)
Iteration 1: Process [A, B, C] together
Iteration 2: A finishes → replace with new Request D
Process [D, B, C] together
Iteration 3: C finishes → replace with Request E
Process [D, B, E] together
Key insight: As soon as one request in the batch finishes generating, a new request takes its slot. The GPU is never idle waiting for the longest request.
Performance Impact
| Metric | Static Batching | Continuous Batching |
|---|---|---|
| GPU Utilization | 30-50% | 80-95% |
| Throughput | Baseline | 2-3x higher |
| Latency variance | Very high (short reqs wait for long) | Low (each req finishes independently) |
How vLLM Implements This
vLLM combines continuous batching with PagedAttention:
- KV cache managed as virtual memory pages (not contiguous blocks)
- New requests can be inserted without pre-allocating maximum sequence length
- Memory waste reduced by ~55% vs. static allocation
Key Talking Points
- "The key implementation challenge is iteration-level scheduling — the serving engine must decide at every decoding step which requests are in the current batch. This requires an efficient scheduler that can handle thousands of concurrent requests."
- "Continuous batching pairs well with prefix caching — if multiple requests share the same system prompt, they share the KV cache for that prefix. This is common in production (all requests to a customer support bot share the same system prompt)."
- "Mention specific frameworks: vLLM (PagedAttention, most popular), TGI (HuggingFace), TensorRT-LLM (NVIDIA, best raw performance), SGLang (frontier research)."
End-to-End ML Pipeline
Data Sources → Ingestion → Validation → Transformation → Training → Evaluation → Registry → Serving
│ │ │ │ │ │ │ │
▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼
S3/DB Airflow/ Great Feature GPU Cluster Eval Suite MLflow K8s +
Prefect Expectations Store (spot) + gates vLLM/TGI
Component Choices
| Component | Tool Options | Key Consideration |
|---|---|---|
| Orchestration | Airflow, Prefect, Kubeflow Pipelines | DAG management, retry logic, scheduling |
| Data Validation | Great Expectations, Pandera | Schema + distribution checks before training |
| Feature Store | Feast, Tecton, Vertex AI | Offline/online feature consistency |
| Training | SageMaker, Vertex AI, bare K8s + spot GPUs | Cost optimization via spot instances |
| Experiment Tracking | W&B, MLflow, Neptune | Hyperparameter search, metric comparison |
| Model Registry | MLflow, SageMaker Model Registry | Versioning, staging, approval workflows |
| Serving | vLLM, TGI, Triton, SageMaker Endpoints | Auto-scaling, A/B testing, shadow mode |
Pipeline Triggers
- Scheduled: Retrain weekly/monthly on new data
- Data-driven: Trigger when new data exceeds threshold (e.g., 10K new labeled examples)
- Drift-driven: Trigger when monitoring detects data drift or performance degradation
- Manual: Data scientist triggers after experiment validates improvement
Key Talking Points
- "The hardest part isn't building the pipeline — it's building the evaluation gates. Every pipeline stage needs a go/no-go decision: Is the data quality good enough to train? Is the model quality good enough to deploy? These gates prevent bad models from reaching production."
- "Cost optimization is critical: Use spot/preemptible instances for training (3-5x cheaper), with checkpointing for fault tolerance. For serving, right-size GPU instances — don't use an A100 for a model that fits on a T4."
- At Amazon: tie to Leadership Principles — "Frugality" means cost-optimized infrastructure, "Bias for Action" means automated pipelines over manual deployments.
Offline Evaluation
Metrics:
- NDCG (Normalized Discounted Cumulative Gain): Measures ranking quality — are the best items at the top?
- MAP (Mean Average Precision): Average precision across all relevant items
- MRR (Mean Reciprocal Rank): How far down is the first relevant result?
Methodology:
- Hold-out test set from recent data (not randomly sampled — temporal split to avoid leakage)
- Compute metrics on the test set for both old and new model
- Statistical significance testing (paired t-test or bootstrap confidence intervals)
Online Evaluation (A/B Testing)
Production Traffic
│
├── 50% → Control (current model)
│ Measure: CTR, engagement, revenue
│
└── 50% → Treatment (new model)
Measure: CTR, engagement, revenue
→ Statistical test after N days/users → Ship or revert
Interleaving (The Meta Approach)
Instead of splitting users between models, interleave results from both models in a single result list for each user:
Position 1: Model A's top result
Position 2: Model B's top result
Position 3: Model A's 2nd result
Position 4: Model B's 2nd result
...
Count which model's results get more clicks → more sensitive than traditional A/B testing (requires 10x fewer users for the same statistical power).
Key Talking Points
- "Offline metrics can disagree with online metrics. A model with better NDCG might have worse user engagement because it optimizes for relevance without considering diversity (users get bored seeing similar results)."
- "Guard against novelty effects: Users might click more on a new ranking initially because it's different, not because it's better. Run experiments for at least 2 weeks."
- "Long-term metrics matter: A ranking change might boost short-term CTR but reduce long-term retention. Track both."
The Serving Stack
API Gateway (rate limiting, auth)
→ Load Balancer (route to least-loaded GPU)
→ Serving Framework (vLLM / TGI / TensorRT-LLM)
→ GPU Inference (model loaded in GPU memory)
→ Response Streaming (SSE / WebSocket)
Framework Comparison
| Feature | vLLM | TGI (HuggingFace) | TensorRT-LLM (NVIDIA) |
|---|---|---|---|
| Key Innovation | PagedAttention | Production-ready, easy deploy | Kernel-level optimization |
| Performance | High | Good | Highest (NVIDIA-specific) |
| Ease of Use | pip install | Docker image | Complex build process |
| Hardware | Any GPU | Any GPU | NVIDIA only |
| Continuous Batching | Yes | Yes | Yes |
| Quantization | GPTQ, AWQ, FP8 | GPTQ, bitsandbytes | INT8, INT4, FP8 (native) |
| Best For | General use, flexibility | Quick deployment | Maximum throughput |
Auto-Scaling Strategy
- Metric: Scale on GPU utilization + request queue depth (not CPU, which is misleading for GPU workloads)
- Scale-up: When queue depth > threshold for > 30 seconds
- Scale-down: When GPU utilization < 20% for > 5 minutes (aggressive cooldown to save costs)
- Minimum replicas: Always keep 1+ warm (cold start for loading model weights = 30-120 seconds)
Key Talking Points
- "In practice, I'd start with vLLM for most use cases — it has the best developer experience and PagedAttention gives you 90%+ of TensorRT-LLM's throughput with much less complexity."
- "For maximum throughput at scale (millions of requests/day), TensorRT-LLM with custom CUDA kernels and FP8 quantization on H100s is the gold standard."
- "Multi-model serving: If you need to serve multiple models, consider frameworks that support model multiplexing — load multiple LoRA adapters on a single base model rather than running separate instances."
- "Discuss cost: GPU inference is expensive. A single H100 is ~$2-3/hr. At 50 tokens/sec output, that's ~$0.004 per 100 tokens. Compare to API pricing ($0.01-0.06 per 100 tokens) to decide build-vs-buy."
Frequently Asked Questions
How important is MLOps knowledge for AI engineering interviews?
It's now a core competency, not optional. Even AI labs like OpenAI and Anthropic ask about deployment, monitoring, and evaluation because they ship models to millions of users. At applied AI companies (Amazon, Microsoft, Google), it's often 25-30% of the interview signal.
Do I need to know specific tools like vLLM or MLflow?
Knowing specific tools demonstrates practical experience. But concepts matter more — if you can explain continuous batching, quantization trade-offs, and monitoring strategies, the specific tool names are secondary.
What's the difference between MLOps and traditional DevOps?
MLOps adds three dimensions: (1) data management (versioning, quality, drift), (2) model management (training, evaluation, registry), and (3) experiment tracking (hyperparameters, metrics, reproducibility). DevOps principles (CI/CD, monitoring, infrastructure-as-code) still apply but are extended for ML-specific challenges.
Written by
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.