Building Self-Improving Agent Teams: Agents That Learn from Each Other's Successes and Failures

Why Agent Teams Should Learn Continuously

Most multi-agent systems are static — agents run with fixed prompts and strategies, never adapting based on outcomes. This means the same mistakes get repeated and successful strategies are never propagated across the team. A self-improving agent team changes this by capturing experiences, identifying what works, and sharing those insights across all agents.

The core insight is that every agent execution produces a training signal: did the output meet quality expectations? How long did it take? Did downstream agents accept or reject the result? By capturing and analyzing these signals, the entire team gets better over time — without any manual prompt tuning.

The Experience Capture System

Every agent interaction produces an experience record that captures the input, output, evaluation, and context.

from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional
import time
import uuid

@dataclass
class Experience:
    experience_id: str = field(default_factory=lambda: str(uuid.uuid4()))
    agent_id: str = ""
    task_type: str = ""
    input_data: Dict[str, Any] = field(default_factory=dict)
    output_data: Dict[str, Any] = field(default_factory=dict)
    success: bool = False
    quality_score: float = 0.0  # 0.0 to 1.0
    duration_ms: float = 0.0
    feedback: Optional[str] = None
    timestamp: float = field(default_factory=time.time)
    context: Dict[str, Any] = field(default_factory=dict)

class ExperienceStore:
    def __init__(self):
        self._experiences: List[Experience] = []

    def record(self, exp: Experience):
        self._experiences.append(exp)

    def get_successes(
        self, task_type: str, min_score: float = 0.8
    ) -> List[Experience]:
        return [
            e for e in self._experiences
            if e.task_type == task_type
            and e.success
            and e.quality_score >= min_score
        ]

    def get_failures(self, task_type: str) -> List[Experience]:
        return [
            e for e in self._experiences
            if e.task_type == task_type and not e.success
        ]

    def get_agent_stats(self, agent_id: str) -> Dict:
        agent_exps = [
            e for e in self._experiences if e.agent_id == agent_id
        ]
        if not agent_exps:
            return {"total": 0, "success_rate": 0, "avg_score": 0}
        successes = sum(1 for e in agent_exps if e.success)
        return {
            "total": len(agent_exps),
            "success_rate": successes / len(agent_exps),
            "avg_score": sum(e.quality_score for e in agent_exps)
                / len(agent_exps),
            "avg_duration_ms": sum(e.duration_ms for e in agent_exps)
                / len(agent_exps),
        }

Cross-Agent Learning: Extracting Lessons

The lesson extractor analyzes experiences across all agents to identify patterns of success and failure.

from openai import AsyncOpenAI

client = AsyncOpenAI()

class LessonExtractor:
    def __init__(self, store: ExperienceStore):
        self.store = store

    async def extract_lessons(
        self, task_type: str, batch_size: int = 10
    ) -> List[str]:
        successes = self.store.get_successes(task_type)[-batch_size:]
        failures = self.store.get_failures(task_type)[-batch_size:]

        prompt = "Analyze these agent experiences and extract lessons.\n\n"
        prompt += "SUCCESSES:\n"
        for exp in successes:
            prompt += (
                f"- Agent {exp.agent_id}: score={exp.quality_score}, "
                f"input={exp.input_data}, approach={exp.context}\n"
            )
        prompt += "\nFAILURES:\n"
        for exp in failures:
            prompt += (
                f"- Agent {exp.agent_id}: feedback={exp.feedback}, "
                f"input={exp.input_data}, approach={exp.context}\n"
            )
        prompt += (
            "\nExtract 3-5 actionable lessons. For each lesson, "
            "state what to do and what to avoid."
        )

        response = await client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": (
                        "You analyze agent performance data and extract "
                        "concise, actionable lessons."
                    ),
                },
                {"role": "user", "content": prompt},
            ],
        )
        lessons_text = response.choices[0].message.content
        return [l.strip() for l in lessons_text.split("\n") if l.strip()]

Dynamic Prompt Enhancement

Once lessons are extracted, they get injected into agent system prompts — making the agent team genuinely self-improving.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

class AdaptiveAgent:
    def __init__(
        self,
        agent_id: str,
        base_system_prompt: str,
        store: ExperienceStore,
        extractor: LessonExtractor,
    ):
        self.agent_id = agent_id
        self.base_prompt = base_system_prompt
        self.store = store
        self.extractor = extractor
        self.learned_lessons: List[str] = []
        self.executions_since_update = 0

    async def maybe_update_lessons(self, task_type: str):
        self.executions_since_update += 1
        if self.executions_since_update >= 10:
            self.learned_lessons = await self.extractor.extract_lessons(
                task_type
            )
            self.executions_since_update = 0

    def get_enhanced_prompt(self) -> str:
        if not self.learned_lessons:
            return self.base_prompt
        lessons_block = "\n".join(
            f"- {lesson}" for lesson in self.learned_lessons
        )
        return (
            f"{self.base_prompt}\n\n"
            f"LESSONS FROM PAST EXPERIENCE:\n{lessons_block}"
        )

    async def execute(self, task_type: str, input_data: Dict) -> Dict:
        await self.maybe_update_lessons(task_type)
        start = time.time()

        response = await client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": self.get_enhanced_prompt()},
                {"role": "user", "content": str(input_data)},
            ],
        )
        output = {"result": response.choices[0].message.content}
        duration = (time.time() - start) * 1000

        self.store.record(Experience(
            agent_id=self.agent_id,
            task_type=task_type,
            input_data=input_data,
            output_data=output,
            success=True,
            quality_score=0.0,  # Updated later by evaluation
            duration_ms=duration,
        ))
        return output

Performance Benchmarking Dashboard

Track how the team improves over time by monitoring key metrics across sliding windows.

class TeamBenchmark:
    def __init__(self, store: ExperienceStore, agent_ids: List[str]):
        self.store = store
        self.agent_ids = agent_ids

    def report(self, window_size: int = 50) -> Dict:
        team_stats = {}
        for agent_id in self.agent_ids:
            stats = self.store.get_agent_stats(agent_id)
            team_stats[agent_id] = stats

        overall_scores = [
            s["avg_score"] for s in team_stats.values() if s["total"] > 0
        ]
        return {
            "agent_stats": team_stats,
            "team_avg_score": (
                sum(overall_scores) / len(overall_scores)
                if overall_scores else 0
            ),
            "team_size": len(self.agent_ids),
            "total_experiences": sum(
                s["total"] for s in team_stats.values()
            ),
        }

Skill Transfer Between Agents

When one agent develops expertise in a specific task type, transfer its lessons to other agents handling similar tasks.

async def transfer_skills(
    source_agent: AdaptiveAgent,
    target_agent: AdaptiveAgent,
    task_type: str,
):
    source_successes = source_agent.store.get_successes(task_type)
    if not source_successes:
        return

    # Extract what made the source agent successful
    lessons = await source_agent.extractor.extract_lessons(task_type)
    target_agent.learned_lessons.extend(lessons)
    print(
        f"Transferred {len(lessons)} lessons from "
        f"{source_agent.agent_id} to {target_agent.agent_id}"
    )

The team evolves: individual agents learn from their own experiences, the lesson extractor identifies cross-agent patterns, and skill transfer propagates successful strategies — creating a feedback loop where the team's collective performance rises continuously.

FAQ

How do I evaluate quality_score automatically?

Use a dedicated evaluator agent that scores outputs against acceptance criteria. For code generation, run the code and check if tests pass. For text generation, use an LLM-as-judge approach with a rubric. For classification tasks, compare against ground truth labels. The key is automating this so every experience gets scored without human involvement.

Won't the learned lessons accumulate and bloat the system prompt?

Yes, if unchecked. Implement a lesson relevance decay: lessons that are older than N executions or that have not correlated with improved scores get pruned. Keep the active lesson set small (5-10 lessons) and periodically consolidate overlapping lessons into summary rules.

How do I prevent negative transfer — where lessons from one context hurt performance in another?

Tag lessons with the task type and input characteristics where they were learned. Only inject lessons that match the current task's context. Additionally, A/B test lesson sets: occasionally run without the learned lessons and compare performance. If lessons are hurting, reset and re-extract from recent data.

#SelfImprovingAI #CollectiveLearning #ExperienceSharing #AgentTeams #MultiAgentSystems #AgenticAI #PythonAI #AdaptiveAgents

Building Self-Improving Agent Teams: Agents That Learn from Each Other's Successes and Failures

Why Agent Teams Should Learn Continuously

The Experience Capture System

Cross-Agent Learning: Extracting Lessons

Dynamic Prompt Enhancement

Performance Benchmarking Dashboard

Skill Transfer Between Agents

FAQ

How do I evaluate quality_score automatically?

Won't the learned lessons accumulate and bloat the system prompt?

How do I prevent negative transfer — where lessons from one context hurt performance in another?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding