Distillation Attacks and Model Extraction: How Attackers Steal LLMs and How to Defend
Understanding how model extraction attacks work against commercial LLMs, the legal and technical landscape, and defense strategies including watermarking, rate limiting, and output perturbation.
The Model Theft Problem
Training a frontier LLM costs tens to hundreds of millions of dollars. Yet the knowledge encoded in that model can be extracted through its API at a fraction of the cost. Model extraction -- also called model stealing or distillation attacks -- is a growing concern for AI providers and enterprises alike.
In early 2026, this moved from academic concern to real-world controversy when multiple open-source models were found to have been trained primarily on outputs from proprietary models, violating terms of service and raising intellectual property questions.
How Distillation Attacks Work
The basic attack is straightforward:
- Generate a large dataset of prompts covering the target model's capabilities
- Query the target model's API to get responses for each prompt
- Train a smaller model on these (prompt, response) pairs to mimic the target
# Simplified distillation attack
prompts = generate_diverse_prompts(count=1_000_000)
# Query the target model
training_data = []
for prompt in prompts:
response = target_api.generate(prompt)
training_data.append({"input": prompt, "output": response})
# Train student model
student_model.fine_tune(training_data)
The student model learns to approximate the teacher's behavior without access to the teacher's weights, training data, or architecture details.
Attack Sophistication Levels
Level 1: Naive Distillation Query the API with random prompts. Cheap but inefficient -- many prompts produce generic responses that do not transfer useful knowledge.
Level 2: Active Learning Strategically select prompts that maximize information extraction. Query near decision boundaries, generate adversarial examples, and focus on capability areas where the student is weakest.
Level 3: Logit Extraction If the API exposes token probabilities (logprobs), the attacker gains much richer training signal. Full probability distributions transfer more knowledge than single text completions.
Level 4: Reinforcement from Comparisons Use the target model as a reward signal. Generate multiple responses with the student model, have the target model rank them, and use the rankings as training signal (similar to RLHF).
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Cost of Extraction
| Target Model | Estimated Training Cost | Extraction API Cost (approximate) |
|---|---|---|
| GPT-4 class | $100M+ | $50K-500K (depending on quality target) |
| Claude Sonnet class | $50M+ | $30K-200K |
| Specialized fine-tuned model | $10K-1M | $1K-50K |
The economics are clear: extraction is 100-1000x cheaper than original training.
Defense Strategies
1. Rate Limiting and Usage Monitoring
The most basic defense: limit how many tokens a single user or API key can consume.
# Detect extraction patterns
EXTRACTION_SIGNALS = [
"high_volume_diverse_prompts", # Many different topics rapidly
"systematic_prompt_variation", # Same prompt with minor tweaks
"unusual_output_length_patterns", # Always requesting max tokens
"no_conversational_context", # Each request is independent
"automated_request_patterns" # Uniform timing between requests
]
When extraction signals are detected, throttle the account or require additional verification.
2. Output Perturbation
Introduce subtle, imperceptible modifications to model outputs that degrade the quality of distilled models:
- Add low-confidence tokens with slightly modified probabilities
- Occasionally rephrase outputs in ways that introduce noise for training but are imperceptible to users
- Vary output format and style in ways that make training data inconsistent
The challenge: perturbation must not degrade the experience for legitimate users.
3. Watermarking
Embed detectable patterns in model outputs that survive distillation:
- Statistical watermarks: Subtly bias token selection in ways that are undetectable per-response but statistically detectable across thousands of responses
- Semantic watermarks: Encode patterns in the reasoning structure that transfer to distilled models
- Proof of provenance: Enable model providers to demonstrate that a competitor's model was trained on their outputs
4. Terms of Service and Legal Action
All major API providers prohibit using outputs to train competing models. In 2025-2026, several legal actions have been filed based on these terms. However, enforcement is challenging:
- Proving a model was trained on specific API outputs is technically difficult
- Jurisdiction varies globally
- Open-source model training data provenance is often opaque
5. Reducing Information Leakage
- Remove logprobs from API responses unless specifically needed (many providers now do this by default)
- Limit output length to prevent extraction of long-form reasoning chains
- Fingerprint outputs with unique per-user patterns that enable tracing
The Ethical Dimension
The distillation debate touches on fundamental questions:
- Should AI outputs be copyrightable? If so, training on them without permission is infringement
- Does knowledge distillation differ ethically from a student learning from a textbook?
- Should open-source models that were distilled from proprietary models be treated differently?
There are no settled answers, but the industry is moving toward stronger protections and clearer norms around attribution and consent.
Sources: Anthropic Acceptable Use Policy | Model Extraction Attacks Survey | Watermarking LLMs Research
NYC News
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.