Fine-Tuning LLMs for Agentic Tasks: When and How to Customize Foundation Models

When Fine-Tuning Beats Prompting for Agents

Prompt engineering is the first tool you should reach for when building AI agents. It is faster, cheaper, and easier to iterate. But there are specific situations where fine-tuning a foundation model delivers dramatically better results for agentic tasks:

Consistent formatting under pressure. When your agent must always produce valid JSON with specific field names, or always follow a particular tool-calling convention, fine-tuning bakes this format into the model's weights rather than relying on instructions that can be ignored under complex reasoning load.

Domain-specific tool selection. An agent operating in a specialized domain (medical coding, financial compliance, industrial control) may need to select from 50+ domain-specific tools. Fine-tuning teaches the model which tool to use for which situation far more reliably than cramming all tool descriptions into the context.

Latency-sensitive deployments. Fine-tuning a smaller model (7B-13B parameters) to match the agentic capabilities of a larger model (70B+) can reduce inference latency by 3-5x while maintaining task-specific accuracy. If your agent needs sub-second response times, this is often the only viable path.

Volume economics. When you are running millions of agent interactions per month, the per-token cost of a smaller fine-tuned model (often 10-20x cheaper than frontier models) compounds into massive savings.

Creating Training Datasets from Agent Traces

The highest-quality training data for agentic fine-tuning comes from your own agent's successful interactions. Here is a systematic approach to collecting and curating this data.

from dataclasses import dataclass, field
from typing import Optional
from datetime import datetime
import json

@dataclass
class AgentTrace:
    trace_id: str
    task: str
    messages: list[dict]
    tool_calls: list[dict]
    outcome: str  # "success", "failure", "partial"
    human_rating: Optional[float] = None  # 1-5
    timestamp: datetime = field(default_factory=datetime.utcnow)
    metadata: dict = field(default_factory=dict)

class TraceCollector:
    """Collects and curates agent traces for fine-tuning."""

    def __init__(self, storage):
        self.storage = storage

    async def log_trace(self, trace: AgentTrace):
        await self.storage.insert({
            "trace_id": trace.trace_id,
            "task": trace.task,
            "messages": trace.messages,
            "tool_calls": trace.tool_calls,
            "outcome": trace.outcome,
            "human_rating": trace.human_rating,
            "timestamp": trace.timestamp.isoformat(),
            "metadata": trace.metadata,
        })

    async def export_training_data(
        self,
        min_rating: float = 4.0,
        outcome_filter: str = "success",
        max_samples: int = 10000,
    ) -> list[dict]:
        """Export high-quality traces as training examples."""
        traces = await self.storage.query(
            filters={
                "outcome": outcome_filter,
                "human_rating": {"$gte": min_rating},
            },
            limit=max_samples,
            sort_by="human_rating",
            sort_order="desc",
        )

        training_examples = []
        for trace in traces:
            example = self._trace_to_training_example(trace)
            if example:
                training_examples.append(example)

        return training_examples

    def _trace_to_training_example(
        self, trace: dict
    ) -> Optional[dict]:
        """Convert a trace into a chat-format training example."""
        messages = trace.get("messages", [])
        if len(messages) < 2:
            return None

        # Filter to keep system prompt + user/assistant turns
        training_messages = []
        for msg in messages:
            role = msg.get("role")
            if role in ("system", "user", "assistant", "tool"):
                training_messages.append({
                    "role": role,
                    "content": msg.get("content", ""),
                })
                # Include tool calls in assistant messages
                if role == "assistant" and msg.get("tool_calls"):
                    training_messages[-1]["tool_calls"] = (
                        msg["tool_calls"]
                    )

        return {"messages": training_messages}


class DatasetCurator:
    """Curates and prepares datasets for fine-tuning."""

    def __init__(self, llm_client):
        self.llm = llm_client

    async def deduplicate(
        self, examples: list[dict], similarity_threshold: float = 0.9
    ) -> list[dict]:
        """Remove near-duplicate training examples."""
        unique = []
        seen_hashes = set()

        for ex in examples:
            content_hash = self._content_hash(ex)
            if content_hash not in seen_hashes:
                seen_hashes.add(content_hash)
                unique.append(ex)

        return unique

    async def augment_with_negatives(
        self, positive_examples: list[dict]
    ) -> list[dict]:
        """Generate contrastive negative examples for DPO."""
        augmented = []

        for example in positive_examples:
            # Generate a plausible but incorrect alternative
            negative = await self._generate_negative(example)
            augmented.append({
                "prompt": self._extract_prompt(example),
                "chosen": self._extract_response(example),
                "rejected": negative,
            })

        return augmented

    async def _generate_negative(
        self, example: dict
    ) -> str:
        """Generate a plausible but incorrect response."""
        prompt = self._extract_prompt(example)
        correct = self._extract_response(example)

        response = await self.llm.chat(messages=[{
            "role": "user",
            "content": (
                f"Given this prompt and the correct response, "
                f"generate a plausible but INCORRECT alternative "
                f"response. The incorrect response should have a "
                f"subtle error: wrong tool selection, incorrect "
                f"parameter, or flawed reasoning.\n\n"
                f"Prompt: {prompt}\n\n"
                f"Correct response: {correct}\n\n"
                f"Generate an incorrect alternative:"
            ),
        }])
        return response.content

    def _content_hash(self, example: dict) -> str:
        import hashlib
        content = json.dumps(
            example, sort_keys=True, default=str
        )
        return hashlib.md5(content.encode()).hexdigest()

    def _extract_prompt(self, example: dict) -> str:
        messages = example.get("messages", [])
        user_msgs = [
            m["content"] for m in messages if m["role"] == "user"
        ]
        return user_msgs[0] if user_msgs else ""

    def _extract_response(self, example: dict) -> str:
        messages = example.get("messages", [])
        assistant_msgs = [
            m["content"] for m in messages
            if m["role"] == "assistant"
        ]
        return assistant_msgs[-1] if assistant_msgs else ""

Supervised Fine-Tuning (SFT)

SFT is the most straightforward fine-tuning approach: you show the model examples of correct behavior and train it to reproduce that behavior. For agentic tasks, SFT teaches the model the correct tool-calling patterns, output formats, and reasoning chains.

import json
from pathlib import Path

class SFTDatasetPreparator:
    """Prepares datasets for Supervised Fine-Tuning."""

    def __init__(self, tokenizer, max_seq_length: int = 4096):
        self.tokenizer = tokenizer
        self.max_seq_length = max_seq_length

    def prepare_chat_dataset(
        self, examples: list[dict], output_path: str
    ):
        """Convert examples to the chat format for SFT."""
        processed = []

        for ex in examples:
            messages = ex.get("messages", [])

            # Validate token length
            formatted = self.tokenizer.apply_chat_template(
                messages, tokenize=False
            )
            tokens = self.tokenizer.encode(formatted)

            if len(tokens) > self.max_seq_length:
                # Truncate conversation, keeping system + last turns
                messages = self._truncate_conversation(
                    messages, self.max_seq_length
                )

            processed.append({"messages": messages})

        # Write as JSONL
        with open(output_path, "w") as f:
            for item in processed:
                f.write(json.dumps(item) + "\n")

        return {
            "total_examples": len(processed),
            "output_path": output_path,
        }

    def prepare_tool_calling_dataset(
        self, examples: list[dict], output_path: str
    ):
        """Prepare dataset specifically for tool-calling fine-tuning.

        Each example includes the system prompt with tool definitions,
        user query, and correct tool call(s) as the target."""
        processed = []

        for ex in examples:
            messages = ex.get("messages", [])
            tools = ex.get("tools", [])

            # Ensure tools are included in the system message
            system_msg = next(
                (m for m in messages if m["role"] == "system"),
                None,
            )
            if system_msg and tools:
                system_msg["content"] += (
                    "\n\nAVAILABLE TOOLS:\n"
                    + json.dumps(tools, indent=2)
                )

            processed.append({
                "messages": messages,
                "tools": tools,
            })

        with open(output_path, "w") as f:
            for item in processed:
                f.write(json.dumps(item) + "\n")

        return {"total_examples": len(processed)}

    def _truncate_conversation(
        self, messages: list[dict], max_tokens: int
    ) -> list[dict]:
        """Keep system message + most recent turns."""
        system = [m for m in messages if m["role"] == "system"]
        non_system = [m for m in messages if m["role"] != "system"]

        # Keep the last N turns that fit
        result = list(system)
        for msg in reversed(non_system):
            candidate = system + [msg] + [
                m for m in result if m["role"] != "system"
            ]
            formatted = self.tokenizer.apply_chat_template(
                candidate, tokenize=False
            )
            if len(self.tokenizer.encode(formatted)) <= max_tokens:
                result.insert(len(system), msg)
            else:
                break

        return result

SFT Training Configuration

# Example training configuration for SFT with LoRA
sft_config = {
    "model_name": "meta-llama/Llama-3-8B-Instruct",
    "dataset_path": "./agent_sft_dataset.jsonl",
    "output_dir": "./agent-llama-8b-sft",

    # LoRA configuration (parameter-efficient fine-tuning)
    "lora": {
        "r": 64,            # LoRA rank
        "lora_alpha": 128,  # scaling factor
        "target_modules": [
            "q_proj", "k_proj", "v_proj", "o_proj",
            "gate_proj", "up_proj", "down_proj",
        ],
        "lora_dropout": 0.05,
    },

    # Training hyperparameters
    "training": {
        "num_epochs": 3,
        "batch_size": 4,
        "gradient_accumulation_steps": 4,
        "learning_rate": 2e-5,
        "warmup_ratio": 0.1,
        "weight_decay": 0.01,
        "max_seq_length": 4096,
        "lr_scheduler": "cosine",
    },

    # Evaluation
    "eval_split": 0.1,
    "eval_steps": 100,
    "save_steps": 200,
}

Direct Preference Optimization (DPO)

DPO aligns the model's outputs with human preferences without requiring a separate reward model. For agentic tasks, DPO teaches the model to prefer correct tool usage, accurate reasoning, and safe behavior over plausible but incorrect alternatives.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

class DPODatasetPreparator:
    """Prepares datasets for Direct Preference Optimization."""

    def prepare(
        self,
        preference_pairs: list[dict],
        output_path: str,
    ):
        """Each pair has: prompt, chosen (good), rejected (bad)."""
        processed = []

        for pair in preference_pairs:
            processed.append({
                "prompt": pair["prompt"],
                "chosen": pair["chosen"],
                "rejected": pair["rejected"],
            })

        with open(output_path, "w") as f:
            for item in processed:
                f.write(json.dumps(item) + "\n")

        return {"total_pairs": len(processed)}

    @staticmethod
    def create_preference_pairs_from_traces(
        successful_traces: list[dict],
        failed_traces: list[dict],
    ) -> list[dict]:
        """Create DPO pairs from successful vs failed traces.

        Match traces by similar tasks and use successful as
        'chosen' and failed as 'rejected'."""
        pairs = []

        for success in successful_traces:
            # Find a failed trace with a similar task
            best_match = None
            best_similarity = 0

            for failure in failed_traces:
                sim = _task_similarity(
                    success["task"], failure["task"]
                )
                if sim > best_similarity:
                    best_similarity = sim
                    best_match = failure

            if best_match and best_similarity > 0.7:
                pairs.append({
                    "prompt": success["task"],
                    "chosen": _extract_agent_response(success),
                    "rejected": _extract_agent_response(best_match),
                })

        return pairs


# DPO training configuration
dpo_config = {
    "model_name": "./agent-llama-8b-sft",  # start from SFT model
    "dataset_path": "./agent_dpo_dataset.jsonl",
    "output_dir": "./agent-llama-8b-dpo",

    "dpo": {
        "beta": 0.1,          # KL penalty coefficient
        "loss_type": "sigmoid",  # or "hinge"
        "label_smoothing": 0.0,
    },

    "training": {
        "num_epochs": 1,       # DPO needs fewer epochs
        "batch_size": 2,
        "learning_rate": 5e-6,  # lower LR for DPO
        "warmup_ratio": 0.1,
        "max_seq_length": 4096,
    },
}

RLHF: Reinforcement Learning from Human Feedback

RLHF is more complex than SFT or DPO but can produce the most aligned models. It involves training a reward model on human preferences, then using reinforcement learning (typically PPO) to optimize the agent's behavior against that reward model.

class RewardModelTrainer:
    """Trains a reward model for RLHF from human preferences."""

    def prepare_reward_dataset(
        self,
        comparisons: list[dict],
        output_path: str,
    ):
        """Each comparison: prompt, response_a, response_b,
        preference (a or b)."""
        processed = []

        for comp in comparisons:
            if comp["preference"] == "a":
                chosen = comp["response_a"]
                rejected = comp["response_b"]
            else:
                chosen = comp["response_b"]
                rejected = comp["response_a"]

            processed.append({
                "prompt": comp["prompt"],
                "chosen": chosen,
                "rejected": rejected,
            })

        with open(output_path, "w") as f:
            for item in processed:
                f.write(json.dumps(item) + "\n")

        return {"total_comparisons": len(processed)}


# RLHF pipeline configuration
rlhf_config = {
    "phases": {
        "sft": {
            "model": "meta-llama/Llama-3-8B-Instruct",
            "dataset": "./agent_sft_dataset.jsonl",
            "epochs": 3,
        },
        "reward_model": {
            "model": "meta-llama/Llama-3-8B-Instruct",
            "dataset": "./reward_comparisons.jsonl",
            "epochs": 1,
        },
        "ppo": {
            "policy_model": "./agent-llama-8b-sft",
            "reward_model": "./agent-reward-model",
            "ppo_epochs": 4,
            "kl_penalty": 0.02,
            "clip_range": 0.2,
            "batch_size": 64,
            "mini_batch_size": 8,
        },
    },
}

Evaluation Methodology for Fine-Tuned Agents

Evaluating a fine-tuned agentic model requires task-specific benchmarks, not just general language model benchmarks.

@dataclass
class AgentEvalResult:
    task_name: str
    success_rate: float
    avg_tool_accuracy: float
    avg_format_compliance: float
    avg_turns_to_complete: float
    avg_latency_ms: float
    cost_per_task: float

class AgentEvaluator:
    """Evaluates fine-tuned agents on agentic benchmarks."""

    def __init__(self, eval_tasks: list[dict]):
        self.tasks = eval_tasks

    async def evaluate(
        self, agent, model_name: str
    ) -> list[AgentEvalResult]:
        results = []

        for task in self.tasks:
            successes = 0
            tool_accuracies = []
            format_scores = []
            turn_counts = []
            latencies = []

            for test_case in task["test_cases"]:
                import time
                start = time.time()

                result = await agent.execute(
                    test_case["input"]
                )

                latency = (time.time() - start) * 1000
                latencies.append(latency)

                # Check success
                if self._check_success(
                    result, test_case["expected"]
                ):
                    successes += 1

                # Check tool accuracy
                tool_acc = self._check_tool_calls(
                    result.get("tool_calls", []),
                    test_case.get("expected_tools", []),
                )
                tool_accuracies.append(tool_acc)

                # Check format compliance
                fmt = self._check_format(
                    result.get("output", ""),
                    task.get("format_requirements", {}),
                )
                format_scores.append(fmt)

                turn_counts.append(
                    result.get("turns", 1)
                )

            n = len(task["test_cases"])
            results.append(AgentEvalResult(
                task_name=task["name"],
                success_rate=successes / n if n else 0,
                avg_tool_accuracy=(
                    sum(tool_accuracies) / len(tool_accuracies)
                    if tool_accuracies else 0
                ),
                avg_format_compliance=(
                    sum(format_scores) / len(format_scores)
                    if format_scores else 0
                ),
                avg_turns_to_complete=(
                    sum(turn_counts) / len(turn_counts)
                    if turn_counts else 0
                ),
                avg_latency_ms=(
                    sum(latencies) / len(latencies)
                    if latencies else 0
                ),
                cost_per_task=self._estimate_cost(
                    model_name, turn_counts
                ),
            ))

        return results

    def _check_success(
        self, result: dict, expected: dict
    ) -> bool:
        # Compare key fields
        for key, value in expected.items():
            if result.get(key) != value:
                return False
        return True

    def _check_tool_calls(
        self, actual: list, expected: list
    ) -> float:
        if not expected:
            return 1.0 if not actual else 0.0

        correct = sum(
            1 for a, e in zip(actual, expected)
            if a.get("name") == e.get("name")
        )
        return correct / len(expected)

    def _check_format(
        self, output: str, requirements: dict
    ) -> float:
        if not requirements:
            return 1.0

        checks_passed = 0
        total_checks = len(requirements)

        if requirements.get("json_valid"):
            try:
                json.loads(output)
                checks_passed += 1
            except (json.JSONDecodeError, ValueError):
                pass

        if requirements.get("max_length"):
            if len(output) <= requirements["max_length"]:
                checks_passed += 1

        return checks_passed / total_checks if total_checks else 1.0

    def _estimate_cost(
        self, model: str, turn_counts: list[int]
    ) -> float:
        avg_turns = (
            sum(turn_counts) / len(turn_counts)
            if turn_counts else 1
        )
        cost_per_1k_tokens = {
            "gpt-4o": 0.005,
            "claude-3-5-sonnet": 0.003,
            "llama-3-8b-ft": 0.0002,
            "llama-3-70b-ft": 0.001,
        }
        rate = cost_per_1k_tokens.get(model, 0.001)
        avg_tokens_per_turn = 500
        return avg_turns * avg_tokens_per_turn * rate / 1000

Cost-Benefit Analysis

The decision to fine-tune should be driven by economics as much as capability:

Fine-tuning costs:

Dataset creation and curation: 40-100 engineer-hours
Compute for training: $50-500 for LoRA on 7B-13B models, $2,000-10,000 for full fine-tuning on 70B+
Evaluation and iteration: 20-40 engineer-hours per iteration
Ongoing maintenance: Re-tuning quarterly as base models update

Fine-tuning benefits (compared to prompting a frontier model):

5-20x lower inference cost per token
2-5x lower latency
Higher consistency on format-heavy tasks (95%+ compliance vs 80-90%)
Better tool selection accuracy on domain-specific tools (+10-30%)
Can run on-premises for data-sensitive applications

Break-even calculation: If your frontier model costs $0.01/1K tokens and a fine-tuned 8B model costs $0.0005/1K tokens, you save $0.0095 per 1K tokens. If fine-tuning costs $5,000 total (compute + engineering), you break even at approximately 526 million tokens — roughly 2-3 months for a high-volume agent deployment processing 5,000 interactions per day.

FAQ

Should I fine-tune a small model or continue prompting a frontier model?

Start with prompting a frontier model to establish your quality baseline and collect training data. Fine-tune when: (1) you have at least 1,000 high-quality training examples, (2) the task is well-defined enough that a smaller model can learn it, and (3) cost or latency requirements justify the investment. Many teams find that fine-tuning a 7B-13B model to 90% of frontier quality at 10% of the cost is the right tradeoff for production agents handling routine tasks, while keeping a frontier model as a fallback for complex edge cases.

How much training data do I need for agentic fine-tuning?

The minimum viable dataset depends on task complexity. For simple format compliance (always output JSON with specific fields), 200-500 examples often suffice. For tool-calling accuracy across 10+ tools, 1,000-5,000 examples per tool are needed. For complex multi-step reasoning, 5,000-20,000 examples provide solid results. Quality matters far more than quantity — 1,000 carefully curated examples outperform 10,000 noisy ones. Always start with the smallest effective dataset and scale up only if evaluation metrics demand it.

What is the difference between SFT, RLHF, and DPO for agentic tasks?

SFT teaches the model what good behavior looks like by showing examples. It is the simplest approach and sufficient for most agentic use cases (format compliance, tool calling, domain knowledge). DPO teaches the model to prefer good behavior over bad by showing contrastive pairs — it is particularly useful for reducing undesirable behaviors (hallucination, unsafe tool use) that SFT alone cannot eliminate. RLHF is the most powerful but most complex: it trains a separate reward model and uses RL to optimize behavior. Use RLHF only when you have complex reward signals that cannot be captured by pairwise comparisons (e.g., optimizing for multi-turn task completion rate).

How do I prevent catastrophic forgetting when fine-tuning for agentic tasks?

Catastrophic forgetting — where fine-tuning on a narrow task degrades general capabilities — is a real risk. Three mitigations: (1) Use LoRA instead of full fine-tuning, which modifies only a small fraction of parameters and preserves most base knowledge. (2) Mix your agentic training data with general instruction-following data (10-20% of the training mix) to maintain broad capabilities. (3) Evaluate on both your agentic benchmarks and general benchmarks (MMLU, HumanEval) to detect capability regression early. If you see regression, reduce the learning rate or add more general data to the training mix.

#FineTuning #LLMTraining #AgenticAI #SFT #DPO #RLHF #MachineLearning #AIEngineering