Building Self-Improving Agent Teams: Agents That Learn from Each Other's Successes and Failures
Design agent teams that improve collectively through experience sharing, collective memory, skill transfer, and performance benchmarking. Includes Python implementations for experience replay and cross-agent learning.
Why Agent Teams Should Learn Continuously
Most multi-agent systems are static — agents run with fixed prompts and strategies, never adapting based on outcomes. This means the same mistakes get repeated and successful strategies are never propagated across the team. A self-improving agent team changes this by capturing experiences, identifying what works, and sharing those insights across all agents.
The core insight is that every agent execution produces a training signal: did the output meet quality expectations? How long did it take? Did downstream agents accept or reject the result? By capturing and analyzing these signals, the entire team gets better over time — without any manual prompt tuning.
The Experience Capture System
Every agent interaction produces an experience record that captures the input, output, evaluation, and context.
from dataclasses import dataclass, field
from typing import Any, Dict, List, Optional
import time
import uuid
@dataclass
class Experience:
experience_id: str = field(default_factory=lambda: str(uuid.uuid4()))
agent_id: str = ""
task_type: str = ""
input_data: Dict[str, Any] = field(default_factory=dict)
output_data: Dict[str, Any] = field(default_factory=dict)
success: bool = False
quality_score: float = 0.0 # 0.0 to 1.0
duration_ms: float = 0.0
feedback: Optional[str] = None
timestamp: float = field(default_factory=time.time)
context: Dict[str, Any] = field(default_factory=dict)
class ExperienceStore:
def __init__(self):
self._experiences: List[Experience] = []
def record(self, exp: Experience):
self._experiences.append(exp)
def get_successes(
self, task_type: str, min_score: float = 0.8
) -> List[Experience]:
return [
e for e in self._experiences
if e.task_type == task_type
and e.success
and e.quality_score >= min_score
]
def get_failures(self, task_type: str) -> List[Experience]:
return [
e for e in self._experiences
if e.task_type == task_type and not e.success
]
def get_agent_stats(self, agent_id: str) -> Dict:
agent_exps = [
e for e in self._experiences if e.agent_id == agent_id
]
if not agent_exps:
return {"total": 0, "success_rate": 0, "avg_score": 0}
successes = sum(1 for e in agent_exps if e.success)
return {
"total": len(agent_exps),
"success_rate": successes / len(agent_exps),
"avg_score": sum(e.quality_score for e in agent_exps)
/ len(agent_exps),
"avg_duration_ms": sum(e.duration_ms for e in agent_exps)
/ len(agent_exps),
}
Cross-Agent Learning: Extracting Lessons
The lesson extractor analyzes experiences across all agents to identify patterns of success and failure.
from openai import AsyncOpenAI
client = AsyncOpenAI()
class LessonExtractor:
def __init__(self, store: ExperienceStore):
self.store = store
async def extract_lessons(
self, task_type: str, batch_size: int = 10
) -> List[str]:
successes = self.store.get_successes(task_type)[-batch_size:]
failures = self.store.get_failures(task_type)[-batch_size:]
prompt = "Analyze these agent experiences and extract lessons.\n\n"
prompt += "SUCCESSES:\n"
for exp in successes:
prompt += (
f"- Agent {exp.agent_id}: score={exp.quality_score}, "
f"input={exp.input_data}, approach={exp.context}\n"
)
prompt += "\nFAILURES:\n"
for exp in failures:
prompt += (
f"- Agent {exp.agent_id}: feedback={exp.feedback}, "
f"input={exp.input_data}, approach={exp.context}\n"
)
prompt += (
"\nExtract 3-5 actionable lessons. For each lesson, "
"state what to do and what to avoid."
)
response = await client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": (
"You analyze agent performance data and extract "
"concise, actionable lessons."
),
},
{"role": "user", "content": prompt},
],
)
lessons_text = response.choices[0].message.content
return [l.strip() for l in lessons_text.split("\n") if l.strip()]
Dynamic Prompt Enhancement
Once lessons are extracted, they get injected into agent system prompts — making the agent team genuinely self-improving.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
class AdaptiveAgent:
def __init__(
self,
agent_id: str,
base_system_prompt: str,
store: ExperienceStore,
extractor: LessonExtractor,
):
self.agent_id = agent_id
self.base_prompt = base_system_prompt
self.store = store
self.extractor = extractor
self.learned_lessons: List[str] = []
self.executions_since_update = 0
async def maybe_update_lessons(self, task_type: str):
self.executions_since_update += 1
if self.executions_since_update >= 10:
self.learned_lessons = await self.extractor.extract_lessons(
task_type
)
self.executions_since_update = 0
def get_enhanced_prompt(self) -> str:
if not self.learned_lessons:
return self.base_prompt
lessons_block = "\n".join(
f"- {lesson}" for lesson in self.learned_lessons
)
return (
f"{self.base_prompt}\n\n"
f"LESSONS FROM PAST EXPERIENCE:\n{lessons_block}"
)
async def execute(self, task_type: str, input_data: Dict) -> Dict:
await self.maybe_update_lessons(task_type)
start = time.time()
response = await client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": self.get_enhanced_prompt()},
{"role": "user", "content": str(input_data)},
],
)
output = {"result": response.choices[0].message.content}
duration = (time.time() - start) * 1000
self.store.record(Experience(
agent_id=self.agent_id,
task_type=task_type,
input_data=input_data,
output_data=output,
success=True,
quality_score=0.0, # Updated later by evaluation
duration_ms=duration,
))
return output
Performance Benchmarking Dashboard
Track how the team improves over time by monitoring key metrics across sliding windows.
class TeamBenchmark:
def __init__(self, store: ExperienceStore, agent_ids: List[str]):
self.store = store
self.agent_ids = agent_ids
def report(self, window_size: int = 50) -> Dict:
team_stats = {}
for agent_id in self.agent_ids:
stats = self.store.get_agent_stats(agent_id)
team_stats[agent_id] = stats
overall_scores = [
s["avg_score"] for s in team_stats.values() if s["total"] > 0
]
return {
"agent_stats": team_stats,
"team_avg_score": (
sum(overall_scores) / len(overall_scores)
if overall_scores else 0
),
"team_size": len(self.agent_ids),
"total_experiences": sum(
s["total"] for s in team_stats.values()
),
}
Skill Transfer Between Agents
When one agent develops expertise in a specific task type, transfer its lessons to other agents handling similar tasks.
async def transfer_skills(
source_agent: AdaptiveAgent,
target_agent: AdaptiveAgent,
task_type: str,
):
source_successes = source_agent.store.get_successes(task_type)
if not source_successes:
return
# Extract what made the source agent successful
lessons = await source_agent.extractor.extract_lessons(task_type)
target_agent.learned_lessons.extend(lessons)
print(
f"Transferred {len(lessons)} lessons from "
f"{source_agent.agent_id} to {target_agent.agent_id}"
)
The team evolves: individual agents learn from their own experiences, the lesson extractor identifies cross-agent patterns, and skill transfer propagates successful strategies — creating a feedback loop where the team's collective performance rises continuously.
FAQ
How do I evaluate quality_score automatically?
Use a dedicated evaluator agent that scores outputs against acceptance criteria. For code generation, run the code and check if tests pass. For text generation, use an LLM-as-judge approach with a rubric. For classification tasks, compare against ground truth labels. The key is automating this so every experience gets scored without human involvement.
Won't the learned lessons accumulate and bloat the system prompt?
Yes, if unchecked. Implement a lesson relevance decay: lessons that are older than N executions or that have not correlated with improved scores get pruned. Keep the active lesson set small (5-10 lessons) and periodically consolidate overlapping lessons into summary rules.
How do I prevent negative transfer — where lessons from one context hurt performance in another?
Tag lessons with the task type and input characteristics where they were learned. Only inject lessons that match the current task's context. Additionally, A/B test lesson sets: occasionally run without the learned lessons and compare performance. If lessons are hurting, reset and re-extract from recent data.
#SelfImprovingAI #CollectiveLearning #ExperienceSharing #AgentTeams #MultiAgentSystems #AgenticAI #PythonAI #AdaptiveAgents
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.