The Model Theft Problem

Training a frontier LLM costs tens to hundreds of millions of dollars. Yet the knowledge encoded in that model can be extracted through its API at a fraction of the cost. Model extraction -- also called model stealing or distillation attacks -- is a growing concern for AI providers and enterprises alike.

In early 2026, this moved from academic concern to real-world controversy when multiple open-source models were found to have been trained primarily on outputs from proprietary models, violating terms of service and raising intellectual property questions.

How Distillation Attacks Work

The basic attack is straightforward:

Generate a large dataset of prompts covering the target model's capabilities
Query the target model's API to get responses for each prompt
Train a smaller model on these (prompt, response) pairs to mimic the target

# Simplified distillation attack
prompts = generate_diverse_prompts(count=1_000_000)

# Query the target model
training_data = []
for prompt in prompts:
    response = target_api.generate(prompt)
    training_data.append({"input": prompt, "output": response})

# Train student model
student_model.fine_tune(training_data)

The student model learns to approximate the teacher's behavior without access to the teacher's weights, training data, or architecture details.

Attack Sophistication Levels

Level 1: Naive Distillation Query the API with random prompts. Cheap but inefficient -- many prompts produce generic responses that do not transfer useful knowledge.

Level 2: Active Learning Strategically select prompts that maximize information extraction. Query near decision boundaries, generate adversarial examples, and focus on capability areas where the student is weakest.

Level 3: Logit Extraction If the API exposes token probabilities (logprobs), the attacker gains much richer training signal. Full probability distributions transfer more knowledge than single text completions.

Level 4: Reinforcement from Comparisons Use the target model as a reward signal. Generate multiple responses with the student model, have the target model rank them, and use the rankings as training signal (similar to RLHF).

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

Cost of Extraction

Target Model	Estimated Training Cost	Extraction API Cost (approximate)
GPT-4 class	$100M+	$50K-500K (depending on quality target)
Claude Sonnet class	$50M+	$30K-200K
Specialized fine-tuned model	$10K-1M	$1K-50K

The economics are clear: extraction is 100-1000x cheaper than original training.

Defense Strategies

1. Rate Limiting and Usage Monitoring

The most basic defense: limit how many tokens a single user or API key can consume.

# Detect extraction patterns
EXTRACTION_SIGNALS = [
    "high_volume_diverse_prompts",     # Many different topics rapidly
    "systematic_prompt_variation",      # Same prompt with minor tweaks
    "unusual_output_length_patterns",   # Always requesting max tokens
    "no_conversational_context",        # Each request is independent
    "automated_request_patterns"        # Uniform timing between requests
]

When extraction signals are detected, throttle the account or require additional verification.

2. Output Perturbation

Introduce subtle, imperceptible modifications to model outputs that degrade the quality of distilled models:

Add low-confidence tokens with slightly modified probabilities
Occasionally rephrase outputs in ways that introduce noise for training but are imperceptible to users
Vary output format and style in ways that make training data inconsistent

The challenge: perturbation must not degrade the experience for legitimate users.

3. Watermarking

Embed detectable patterns in model outputs that survive distillation:

Statistical watermarks: Subtly bias token selection in ways that are undetectable per-response but statistically detectable across thousands of responses
Semantic watermarks: Encode patterns in the reasoning structure that transfer to distilled models
Proof of provenance: Enable model providers to demonstrate that a competitor's model was trained on their outputs

4. Terms of Service and Legal Action

All major API providers prohibit using outputs to train competing models. In 2025-2026, several legal actions have been filed based on these terms. However, enforcement is challenging:

Proving a model was trained on specific API outputs is technically difficult
Jurisdiction varies globally
Open-source model training data provenance is often opaque

5. Reducing Information Leakage

Remove logprobs from API responses unless specifically needed (many providers now do this by default)
Limit output length to prevent extraction of long-form reasoning chains
Fingerprint outputs with unique per-user patterns that enable tracing

The Ethical Dimension

The distillation debate touches on fundamental questions:

Should AI outputs be copyrightable? If so, training on them without permission is infringement
Does knowledge distillation differ ethically from a student learning from a textbook?
Should open-source models that were distilled from proprietary models be treated differently?

There are no settled answers, but the industry is moving toward stronger protections and clearer norms around attribution and consent.

Sources: Anthropic Acceptable Use Policy | Model Extraction Attacks Survey | Watermarking LLMs Research

Distillation Attacks and Model Extraction: How Attackers Steal LLMs and How to Defend

The Model Theft Problem

How Distillation Attacks Work

Attack Sophistication Levels

Cost of Extraction

Defense Strategies

1. Rate Limiting and Usage Monitoring

2. Output Perturbation

3. Watermarking

4. Terms of Service and Legal Action

5. Reducing Information Leakage

The Ethical Dimension

Try CallSphere AI Voice Agents

Related Articles

AI Safety and Alignment: From RLHF to Constitutional AI and Beyond

New York's AI Layoff Law Has Zero Compliance — and That's a Problem for Everyone

The Future of AI Agents: Predictions for the Next 12 Months