Fine-Tuning LLMs for Agentic Tasks: When and How to Customize Foundation Models
When fine-tuning beats prompting for AI agents: dataset creation from agent traces, SFT and DPO training approaches, evaluation methodology, and cost-benefit analysis for agentic fine-tuning.
When Fine-Tuning Beats Prompting for Agents
Prompt engineering is the first tool you should reach for when building AI agents. It is faster, cheaper, and easier to iterate. But there are specific situations where fine-tuning a foundation model delivers dramatically better results for agentic tasks:
Consistent formatting under pressure. When your agent must always produce valid JSON with specific field names, or always follow a particular tool-calling convention, fine-tuning bakes this format into the model's weights rather than relying on instructions that can be ignored under complex reasoning load.
Domain-specific tool selection. An agent operating in a specialized domain (medical coding, financial compliance, industrial control) may need to select from 50+ domain-specific tools. Fine-tuning teaches the model which tool to use for which situation far more reliably than cramming all tool descriptions into the context.
Latency-sensitive deployments. Fine-tuning a smaller model (7B-13B parameters) to match the agentic capabilities of a larger model (70B+) can reduce inference latency by 3-5x while maintaining task-specific accuracy. If your agent needs sub-second response times, this is often the only viable path.
Volume economics. When you are running millions of agent interactions per month, the per-token cost of a smaller fine-tuned model (often 10-20x cheaper than frontier models) compounds into massive savings.
Creating Training Datasets from Agent Traces
The highest-quality training data for agentic fine-tuning comes from your own agent's successful interactions. Here is a systematic approach to collecting and curating this data.
from dataclasses import dataclass, field
from typing import Optional
from datetime import datetime
import json
@dataclass
class AgentTrace:
trace_id: str
task: str
messages: list[dict]
tool_calls: list[dict]
outcome: str # "success", "failure", "partial"
human_rating: Optional[float] = None # 1-5
timestamp: datetime = field(default_factory=datetime.utcnow)
metadata: dict = field(default_factory=dict)
class TraceCollector:
"""Collects and curates agent traces for fine-tuning."""
def __init__(self, storage):
self.storage = storage
async def log_trace(self, trace: AgentTrace):
await self.storage.insert({
"trace_id": trace.trace_id,
"task": trace.task,
"messages": trace.messages,
"tool_calls": trace.tool_calls,
"outcome": trace.outcome,
"human_rating": trace.human_rating,
"timestamp": trace.timestamp.isoformat(),
"metadata": trace.metadata,
})
async def export_training_data(
self,
min_rating: float = 4.0,
outcome_filter: str = "success",
max_samples: int = 10000,
) -> list[dict]:
"""Export high-quality traces as training examples."""
traces = await self.storage.query(
filters={
"outcome": outcome_filter,
"human_rating": {"$gte": min_rating},
},
limit=max_samples,
sort_by="human_rating",
sort_order="desc",
)
training_examples = []
for trace in traces:
example = self._trace_to_training_example(trace)
if example:
training_examples.append(example)
return training_examples
def _trace_to_training_example(
self, trace: dict
) -> Optional[dict]:
"""Convert a trace into a chat-format training example."""
messages = trace.get("messages", [])
if len(messages) < 2:
return None
# Filter to keep system prompt + user/assistant turns
training_messages = []
for msg in messages:
role = msg.get("role")
if role in ("system", "user", "assistant", "tool"):
training_messages.append({
"role": role,
"content": msg.get("content", ""),
})
# Include tool calls in assistant messages
if role == "assistant" and msg.get("tool_calls"):
training_messages[-1]["tool_calls"] = (
msg["tool_calls"]
)
return {"messages": training_messages}
class DatasetCurator:
"""Curates and prepares datasets for fine-tuning."""
def __init__(self, llm_client):
self.llm = llm_client
async def deduplicate(
self, examples: list[dict], similarity_threshold: float = 0.9
) -> list[dict]:
"""Remove near-duplicate training examples."""
unique = []
seen_hashes = set()
for ex in examples:
content_hash = self._content_hash(ex)
if content_hash not in seen_hashes:
seen_hashes.add(content_hash)
unique.append(ex)
return unique
async def augment_with_negatives(
self, positive_examples: list[dict]
) -> list[dict]:
"""Generate contrastive negative examples for DPO."""
augmented = []
for example in positive_examples:
# Generate a plausible but incorrect alternative
negative = await self._generate_negative(example)
augmented.append({
"prompt": self._extract_prompt(example),
"chosen": self._extract_response(example),
"rejected": negative,
})
return augmented
async def _generate_negative(
self, example: dict
) -> str:
"""Generate a plausible but incorrect response."""
prompt = self._extract_prompt(example)
correct = self._extract_response(example)
response = await self.llm.chat(messages=[{
"role": "user",
"content": (
f"Given this prompt and the correct response, "
f"generate a plausible but INCORRECT alternative "
f"response. The incorrect response should have a "
f"subtle error: wrong tool selection, incorrect "
f"parameter, or flawed reasoning.\n\n"
f"Prompt: {prompt}\n\n"
f"Correct response: {correct}\n\n"
f"Generate an incorrect alternative:"
),
}])
return response.content
def _content_hash(self, example: dict) -> str:
import hashlib
content = json.dumps(
example, sort_keys=True, default=str
)
return hashlib.md5(content.encode()).hexdigest()
def _extract_prompt(self, example: dict) -> str:
messages = example.get("messages", [])
user_msgs = [
m["content"] for m in messages if m["role"] == "user"
]
return user_msgs[0] if user_msgs else ""
def _extract_response(self, example: dict) -> str:
messages = example.get("messages", [])
assistant_msgs = [
m["content"] for m in messages
if m["role"] == "assistant"
]
return assistant_msgs[-1] if assistant_msgs else ""
Supervised Fine-Tuning (SFT)
SFT is the most straightforward fine-tuning approach: you show the model examples of correct behavior and train it to reproduce that behavior. For agentic tasks, SFT teaches the model the correct tool-calling patterns, output formats, and reasoning chains.
import json
from pathlib import Path
class SFTDatasetPreparator:
"""Prepares datasets for Supervised Fine-Tuning."""
def __init__(self, tokenizer, max_seq_length: int = 4096):
self.tokenizer = tokenizer
self.max_seq_length = max_seq_length
def prepare_chat_dataset(
self, examples: list[dict], output_path: str
):
"""Convert examples to the chat format for SFT."""
processed = []
for ex in examples:
messages = ex.get("messages", [])
# Validate token length
formatted = self.tokenizer.apply_chat_template(
messages, tokenize=False
)
tokens = self.tokenizer.encode(formatted)
if len(tokens) > self.max_seq_length:
# Truncate conversation, keeping system + last turns
messages = self._truncate_conversation(
messages, self.max_seq_length
)
processed.append({"messages": messages})
# Write as JSONL
with open(output_path, "w") as f:
for item in processed:
f.write(json.dumps(item) + "\n")
return {
"total_examples": len(processed),
"output_path": output_path,
}
def prepare_tool_calling_dataset(
self, examples: list[dict], output_path: str
):
"""Prepare dataset specifically for tool-calling fine-tuning.
Each example includes the system prompt with tool definitions,
user query, and correct tool call(s) as the target."""
processed = []
for ex in examples:
messages = ex.get("messages", [])
tools = ex.get("tools", [])
# Ensure tools are included in the system message
system_msg = next(
(m for m in messages if m["role"] == "system"),
None,
)
if system_msg and tools:
system_msg["content"] += (
"\n\nAVAILABLE TOOLS:\n"
+ json.dumps(tools, indent=2)
)
processed.append({
"messages": messages,
"tools": tools,
})
with open(output_path, "w") as f:
for item in processed:
f.write(json.dumps(item) + "\n")
return {"total_examples": len(processed)}
def _truncate_conversation(
self, messages: list[dict], max_tokens: int
) -> list[dict]:
"""Keep system message + most recent turns."""
system = [m for m in messages if m["role"] == "system"]
non_system = [m for m in messages if m["role"] != "system"]
# Keep the last N turns that fit
result = list(system)
for msg in reversed(non_system):
candidate = system + [msg] + [
m for m in result if m["role"] != "system"
]
formatted = self.tokenizer.apply_chat_template(
candidate, tokenize=False
)
if len(self.tokenizer.encode(formatted)) <= max_tokens:
result.insert(len(system), msg)
else:
break
return result
SFT Training Configuration
# Example training configuration for SFT with LoRA
sft_config = {
"model_name": "meta-llama/Llama-3-8B-Instruct",
"dataset_path": "./agent_sft_dataset.jsonl",
"output_dir": "./agent-llama-8b-sft",
# LoRA configuration (parameter-efficient fine-tuning)
"lora": {
"r": 64, # LoRA rank
"lora_alpha": 128, # scaling factor
"target_modules": [
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
"lora_dropout": 0.05,
},
# Training hyperparameters
"training": {
"num_epochs": 3,
"batch_size": 4,
"gradient_accumulation_steps": 4,
"learning_rate": 2e-5,
"warmup_ratio": 0.1,
"weight_decay": 0.01,
"max_seq_length": 4096,
"lr_scheduler": "cosine",
},
# Evaluation
"eval_split": 0.1,
"eval_steps": 100,
"save_steps": 200,
}
Direct Preference Optimization (DPO)
DPO aligns the model's outputs with human preferences without requiring a separate reward model. For agentic tasks, DPO teaches the model to prefer correct tool usage, accurate reasoning, and safe behavior over plausible but incorrect alternatives.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
class DPODatasetPreparator:
"""Prepares datasets for Direct Preference Optimization."""
def prepare(
self,
preference_pairs: list[dict],
output_path: str,
):
"""Each pair has: prompt, chosen (good), rejected (bad)."""
processed = []
for pair in preference_pairs:
processed.append({
"prompt": pair["prompt"],
"chosen": pair["chosen"],
"rejected": pair["rejected"],
})
with open(output_path, "w") as f:
for item in processed:
f.write(json.dumps(item) + "\n")
return {"total_pairs": len(processed)}
@staticmethod
def create_preference_pairs_from_traces(
successful_traces: list[dict],
failed_traces: list[dict],
) -> list[dict]:
"""Create DPO pairs from successful vs failed traces.
Match traces by similar tasks and use successful as
'chosen' and failed as 'rejected'."""
pairs = []
for success in successful_traces:
# Find a failed trace with a similar task
best_match = None
best_similarity = 0
for failure in failed_traces:
sim = _task_similarity(
success["task"], failure["task"]
)
if sim > best_similarity:
best_similarity = sim
best_match = failure
if best_match and best_similarity > 0.7:
pairs.append({
"prompt": success["task"],
"chosen": _extract_agent_response(success),
"rejected": _extract_agent_response(best_match),
})
return pairs
# DPO training configuration
dpo_config = {
"model_name": "./agent-llama-8b-sft", # start from SFT model
"dataset_path": "./agent_dpo_dataset.jsonl",
"output_dir": "./agent-llama-8b-dpo",
"dpo": {
"beta": 0.1, # KL penalty coefficient
"loss_type": "sigmoid", # or "hinge"
"label_smoothing": 0.0,
},
"training": {
"num_epochs": 1, # DPO needs fewer epochs
"batch_size": 2,
"learning_rate": 5e-6, # lower LR for DPO
"warmup_ratio": 0.1,
"max_seq_length": 4096,
},
}
RLHF: Reinforcement Learning from Human Feedback
RLHF is more complex than SFT or DPO but can produce the most aligned models. It involves training a reward model on human preferences, then using reinforcement learning (typically PPO) to optimize the agent's behavior against that reward model.
class RewardModelTrainer:
"""Trains a reward model for RLHF from human preferences."""
def prepare_reward_dataset(
self,
comparisons: list[dict],
output_path: str,
):
"""Each comparison: prompt, response_a, response_b,
preference (a or b)."""
processed = []
for comp in comparisons:
if comp["preference"] == "a":
chosen = comp["response_a"]
rejected = comp["response_b"]
else:
chosen = comp["response_b"]
rejected = comp["response_a"]
processed.append({
"prompt": comp["prompt"],
"chosen": chosen,
"rejected": rejected,
})
with open(output_path, "w") as f:
for item in processed:
f.write(json.dumps(item) + "\n")
return {"total_comparisons": len(processed)}
# RLHF pipeline configuration
rlhf_config = {
"phases": {
"sft": {
"model": "meta-llama/Llama-3-8B-Instruct",
"dataset": "./agent_sft_dataset.jsonl",
"epochs": 3,
},
"reward_model": {
"model": "meta-llama/Llama-3-8B-Instruct",
"dataset": "./reward_comparisons.jsonl",
"epochs": 1,
},
"ppo": {
"policy_model": "./agent-llama-8b-sft",
"reward_model": "./agent-reward-model",
"ppo_epochs": 4,
"kl_penalty": 0.02,
"clip_range": 0.2,
"batch_size": 64,
"mini_batch_size": 8,
},
},
}
Evaluation Methodology for Fine-Tuned Agents
Evaluating a fine-tuned agentic model requires task-specific benchmarks, not just general language model benchmarks.
@dataclass
class AgentEvalResult:
task_name: str
success_rate: float
avg_tool_accuracy: float
avg_format_compliance: float
avg_turns_to_complete: float
avg_latency_ms: float
cost_per_task: float
class AgentEvaluator:
"""Evaluates fine-tuned agents on agentic benchmarks."""
def __init__(self, eval_tasks: list[dict]):
self.tasks = eval_tasks
async def evaluate(
self, agent, model_name: str
) -> list[AgentEvalResult]:
results = []
for task in self.tasks:
successes = 0
tool_accuracies = []
format_scores = []
turn_counts = []
latencies = []
for test_case in task["test_cases"]:
import time
start = time.time()
result = await agent.execute(
test_case["input"]
)
latency = (time.time() - start) * 1000
latencies.append(latency)
# Check success
if self._check_success(
result, test_case["expected"]
):
successes += 1
# Check tool accuracy
tool_acc = self._check_tool_calls(
result.get("tool_calls", []),
test_case.get("expected_tools", []),
)
tool_accuracies.append(tool_acc)
# Check format compliance
fmt = self._check_format(
result.get("output", ""),
task.get("format_requirements", {}),
)
format_scores.append(fmt)
turn_counts.append(
result.get("turns", 1)
)
n = len(task["test_cases"])
results.append(AgentEvalResult(
task_name=task["name"],
success_rate=successes / n if n else 0,
avg_tool_accuracy=(
sum(tool_accuracies) / len(tool_accuracies)
if tool_accuracies else 0
),
avg_format_compliance=(
sum(format_scores) / len(format_scores)
if format_scores else 0
),
avg_turns_to_complete=(
sum(turn_counts) / len(turn_counts)
if turn_counts else 0
),
avg_latency_ms=(
sum(latencies) / len(latencies)
if latencies else 0
),
cost_per_task=self._estimate_cost(
model_name, turn_counts
),
))
return results
def _check_success(
self, result: dict, expected: dict
) -> bool:
# Compare key fields
for key, value in expected.items():
if result.get(key) != value:
return False
return True
def _check_tool_calls(
self, actual: list, expected: list
) -> float:
if not expected:
return 1.0 if not actual else 0.0
correct = sum(
1 for a, e in zip(actual, expected)
if a.get("name") == e.get("name")
)
return correct / len(expected)
def _check_format(
self, output: str, requirements: dict
) -> float:
if not requirements:
return 1.0
checks_passed = 0
total_checks = len(requirements)
if requirements.get("json_valid"):
try:
json.loads(output)
checks_passed += 1
except (json.JSONDecodeError, ValueError):
pass
if requirements.get("max_length"):
if len(output) <= requirements["max_length"]:
checks_passed += 1
return checks_passed / total_checks if total_checks else 1.0
def _estimate_cost(
self, model: str, turn_counts: list[int]
) -> float:
avg_turns = (
sum(turn_counts) / len(turn_counts)
if turn_counts else 1
)
cost_per_1k_tokens = {
"gpt-4o": 0.005,
"claude-3-5-sonnet": 0.003,
"llama-3-8b-ft": 0.0002,
"llama-3-70b-ft": 0.001,
}
rate = cost_per_1k_tokens.get(model, 0.001)
avg_tokens_per_turn = 500
return avg_turns * avg_tokens_per_turn * rate / 1000
Cost-Benefit Analysis
The decision to fine-tune should be driven by economics as much as capability:
Fine-tuning costs:
- Dataset creation and curation: 40-100 engineer-hours
- Compute for training: $50-500 for LoRA on 7B-13B models, $2,000-10,000 for full fine-tuning on 70B+
- Evaluation and iteration: 20-40 engineer-hours per iteration
- Ongoing maintenance: Re-tuning quarterly as base models update
Fine-tuning benefits (compared to prompting a frontier model):
- 5-20x lower inference cost per token
- 2-5x lower latency
- Higher consistency on format-heavy tasks (95%+ compliance vs 80-90%)
- Better tool selection accuracy on domain-specific tools (+10-30%)
- Can run on-premises for data-sensitive applications
Break-even calculation: If your frontier model costs $0.01/1K tokens and a fine-tuned 8B model costs $0.0005/1K tokens, you save $0.0095 per 1K tokens. If fine-tuning costs $5,000 total (compute + engineering), you break even at approximately 526 million tokens — roughly 2-3 months for a high-volume agent deployment processing 5,000 interactions per day.
FAQ
Should I fine-tune a small model or continue prompting a frontier model?
Start with prompting a frontier model to establish your quality baseline and collect training data. Fine-tune when: (1) you have at least 1,000 high-quality training examples, (2) the task is well-defined enough that a smaller model can learn it, and (3) cost or latency requirements justify the investment. Many teams find that fine-tuning a 7B-13B model to 90% of frontier quality at 10% of the cost is the right tradeoff for production agents handling routine tasks, while keeping a frontier model as a fallback for complex edge cases.
How much training data do I need for agentic fine-tuning?
The minimum viable dataset depends on task complexity. For simple format compliance (always output JSON with specific fields), 200-500 examples often suffice. For tool-calling accuracy across 10+ tools, 1,000-5,000 examples per tool are needed. For complex multi-step reasoning, 5,000-20,000 examples provide solid results. Quality matters far more than quantity — 1,000 carefully curated examples outperform 10,000 noisy ones. Always start with the smallest effective dataset and scale up only if evaluation metrics demand it.
What is the difference between SFT, RLHF, and DPO for agentic tasks?
SFT teaches the model what good behavior looks like by showing examples. It is the simplest approach and sufficient for most agentic use cases (format compliance, tool calling, domain knowledge). DPO teaches the model to prefer good behavior over bad by showing contrastive pairs — it is particularly useful for reducing undesirable behaviors (hallucination, unsafe tool use) that SFT alone cannot eliminate. RLHF is the most powerful but most complex: it trains a separate reward model and uses RL to optimize behavior. Use RLHF only when you have complex reward signals that cannot be captured by pairwise comparisons (e.g., optimizing for multi-turn task completion rate).
How do I prevent catastrophic forgetting when fine-tuning for agentic tasks?
Catastrophic forgetting — where fine-tuning on a narrow task degrades general capabilities — is a real risk. Three mitigations: (1) Use LoRA instead of full fine-tuning, which modifies only a small fraction of parameters and preserves most base knowledge. (2) Mix your agentic training data with general instruction-following data (10-20% of the training mix) to maintain broad capabilities. (3) Evaluate on both your agentic benchmarks and general benchmarks (MMLU, HumanEval) to detect capability regression early. If you see regression, reduce the learning rate or add more general data to the training mix.
#FineTuning #LLMTraining #AgenticAI #SFT #DPO #RLHF #MachineLearning #AIEngineering
Written by
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.