Continuous Fine-Tuning: Updating Models with New Data Without Catastrophic Forgetting

The Catastrophic Forgetting Problem

You fine-tuned a model on customer support data and it works well. Three months later, you have new data covering product features that launched after the initial training. You fine-tune again on just the new data. The model now handles the new features but has forgotten how to handle the original scenarios.

This is catastrophic forgetting — when training on new data overwrites the patterns learned from previous data. It is the central challenge of continuous fine-tuning. Solving it requires careful data management, training strategies, and automated evaluation gates.

Strategy 1: Replay Buffers

The simplest and most effective approach is to always include a sample of old training data when training on new data. This is called experience replay, borrowed from reinforcement learning.

import json
import random
from pathlib import Path
from dataclasses import dataclass
from typing import Optional

@dataclass
class ReplayBuffer:
    """Manages historical training data for replay during continuous fine-tuning."""
    buffer_path: str
    max_size: int = 10_000

    def __post_init__(self):
        self.buffer_file = Path(self.buffer_path)
        if not self.buffer_file.exists():
            self.buffer_file.write_text("")

    def add_examples(self, examples: list[dict]):
        """Add new training examples to the replay buffer."""
        existing = self._load_all()
        existing.extend(examples)

        # If buffer exceeds max size, keep a diverse sample
        if len(existing) > self.max_size:
            existing = self._diverse_sample(existing, self.max_size)

        with open(self.buffer_path, "w") as f:
            for ex in existing:
                f.write(json.dumps(ex) + "\n")

        print(f"Buffer size: {len(existing)} examples")

    def sample(self, n: int, seed: int = 42) -> list[dict]:
        """Sample n examples from the buffer for replay."""
        all_examples = self._load_all()
        random.seed(seed)
        return random.sample(all_examples, min(n, len(all_examples)))

    def _load_all(self) -> list[dict]:
        examples = []
        if self.buffer_file.exists():
            with open(self.buffer_path, "r") as f:
                for line in f:
                    line = line.strip()
                    if line:
                        examples.append(json.loads(line))
        return examples

    def _diverse_sample(self, examples: list[dict], n: int) -> list[dict]:
        """Sample while maintaining topic diversity."""
        # Group by first few words of user message as a proxy for topic
        groups = {}
        for ex in examples:
            user_msg = ex["messages"][-2]["content"] if len(ex["messages"]) >= 2 else ""
            key = " ".join(user_msg.split()[:5])
            groups.setdefault(key, []).append(ex)

        # Round-robin sample from each group
        sampled = []
        group_lists = list(groups.values())
        random.shuffle(group_lists)
        idx = 0
        while len(sampled) < n and group_lists:
            group = group_lists[idx % len(group_lists)]
            if group:
                sampled.append(group.pop(random.randint(0, len(group) - 1)))
            if not group:
                group_lists.pop(idx % len(group_lists))
                if not group_lists:
                    break
            idx += 1
        return sampled

Building the Training Mix

When training on new data, combine it with replay data in a controlled ratio.

def build_training_mix(
    new_data_path: str,
    replay_buffer: ReplayBuffer,
    replay_ratio: float = 0.3,
    output_path: str = "mixed_training_data.jsonl",
) -> dict:
    """Combine new data with replay buffer data."""
    # Load new data
    new_examples = []
    with open(new_data_path, "r") as f:
        for line in f:
            new_examples.append(json.loads(line.strip()))

    # Calculate replay count
    replay_count = int(len(new_examples) * replay_ratio / (1 - replay_ratio))
    replay_examples = replay_buffer.sample(replay_count)

    # Combine and shuffle
    combined = new_examples + replay_examples
    random.shuffle(combined)

    # Write combined dataset
    with open(output_path, "w") as f:
        for ex in combined:
            f.write(json.dumps(ex) + "\n")

    # Add new data to replay buffer for future rounds
    replay_buffer.add_examples(new_examples)

    return {
        "new_examples": len(new_examples),
        "replay_examples": len(replay_examples),
        "total": len(combined),
        "replay_ratio": f"{len(replay_examples) / len(combined):.1%}",
    }

# Usage
buffer = ReplayBuffer("./replay_buffer.jsonl", max_size=10_000)

# First training round: just new data (no replay yet)
mix_info = build_training_mix(
    "new_product_features.jsonl",
    buffer,
    replay_ratio=0.3,
)
print(f"Training mix: {mix_info}")

Strategy 2: Evaluation Gates

Never deploy a continuously fine-tuned model without verifying it did not regress on existing capabilities. Evaluation gates are automated checks that block deployment if quality drops.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

from dataclasses import dataclass

@dataclass
class EvalGate:
    test_sets: dict[str, str]       # name -> test filepath
    min_scores: dict[str, float]    # name -> minimum acceptable score
    regression_tolerance: float = 0.02

def run_eval_gate(gate: EvalGate, model: str, previous_scores: dict, eval_fn) -> dict:
    """Run evaluation gate and decide whether to deploy."""
    failures = []
    for name, test_path in gate.test_sets.items():
        score = eval_fn(model, test_path)
        if score < gate.min_scores.get(name, 0.0):
            failures.append(f"{name}: {score:.3f} below minimum")
        prev = previous_scores.get(name)
        if prev and score < prev - gate.regression_tolerance:
            failures.append(f"{name}: regressed from {prev:.3f} to {score:.3f}")

    return {"passed": len(failures) == 0, "failures": failures}

Model Versioning

Every continuous fine-tuning iteration produces a new model version. Track versions, their training data, and evaluation scores in a simple registry.

import json
from datetime import datetime
from pathlib import Path

class ModelRegistry:
    """Track model versions for continuous fine-tuning."""

    def __init__(self, registry_path: str = "model_registry.json"):
        self.registry_path = registry_path
        path = Path(registry_path)
        self.versions = json.loads(path.read_text()) if path.exists() else []

    def register(self, model_id: str, parent_id: str, eval_scores: dict) -> dict:
        version = {
            "model_id": model_id,
            "parent_model_id": parent_id,
            "version": len(self.versions) + 1,
            "created_at": datetime.now().isoformat(),
            "eval_scores": eval_scores,
            "status": "candidate",
        }
        self.versions.append(version)
        Path(self.registry_path).write_text(json.dumps(self.versions, indent=2))
        return version

    def promote(self, model_id: str):
        for v in self.versions:
            if v["status"] == "production":
                v["status"] = "retired"
            if v["model_id"] == model_id:
                v["status"] = "production"
        Path(self.registry_path).write_text(json.dumps(self.versions, indent=2))

FAQ

How often should I retrain a fine-tuned model with new data?

Retrain every 2-4 weeks for actively evolving domains, but only when you have enough new data to matter. Monitor production performance — when accuracy drops or new categories appear that the model cannot handle, that is your signal to retrain.

What replay ratio works best for preventing catastrophic forgetting?

Start with 30% replay (old data) and 70% new data. Increase replay to 50% if you see regression on old capabilities. Decrease to 20% if the model is slow to learn new patterns. Let your evaluation gates determine the final ratio.

Can I use continuous fine-tuning with the OpenAI fine-tuning API?

Yes. Use a previously fine-tuned model as the base for a new training job. Combine new JSONL data with replay examples, upload the combined file, and specify the previous model ID as the base. Pair with automated evaluation to catch regressions before switching production traffic.

#ContinuousLearning #CatastrophicForgetting #FineTuning #ModelVersioning #MLOps #AgenticAI #LearnAI #AIEngineering

Continuous Fine-Tuning: Updating Models with New Data Without Catastrophic Forgetting

The Catastrophic Forgetting Problem

Strategy 1: Replay Buffers

Building the Training Mix

Strategy 2: Evaluation Gates

Model Versioning

FAQ

How often should I retrain a fine-tuned model with new data?

What replay ratio works best for preventing catastrophic forgetting?

Can I use continuous fine-tuning with the OpenAI fine-tuning API?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding