AI Agent Memory Systems: Short-Term, Long-Term, and Episodic Storage
A comprehensive technical guide to implementing memory systems for AI agents, covering working memory (context window management), long-term memory (vector stores and databases), episodic memory (experience replay), and the architecture patterns that make agents truly persistent.
Why Memory Matters for AI Agents
Without memory, every interaction with an AI agent starts from zero. The agent cannot learn from past mistakes, remember user preferences, or build on previous work. In production, this means lost context, repeated questions, and an inability to improve over time.
AI agent memory systems draw inspiration from human cognitive science, implementing three types of memory that serve different purposes:
- Working memory (short-term): The active context the agent reasons over right now
- Long-term memory: Persistent knowledge that survives across sessions
- Episodic memory: Records of past experiences the agent can recall and learn from
Working Memory: Managing the Context Window
The LLM context window is the agent's working memory. It has a fixed capacity (128K-200K tokens for frontier models), and managing it effectively is the first challenge.
Conversation Summarization
When conversations exceed a threshold, summarize older messages to free up space:
class WorkingMemoryManager:
def __init__(self, llm_client, max_tokens: int = 100_000, summary_threshold: int = 80_000):
self.llm = llm_client
self.max_tokens = max_tokens
self.summary_threshold = summary_threshold
self.messages = []
self.summary = ""
def estimate_tokens(self, messages: list) -> int:
return sum(len(m["content"]) // 4 for m in messages) # Rough estimate
async def add_message(self, message: dict):
self.messages.append(message)
if self.estimate_tokens(self.messages) > self.summary_threshold:
await self._compress()
async def _compress(self):
"""Summarize older messages, keep recent ones intact"""
# Keep the most recent messages (last 20%)
split_point = len(self.messages) // 5 * 4
old_messages = self.messages[:split_point]
recent_messages = self.messages[split_point:]
# Summarize old messages
old_text = "\n".join(
f"{m['role']}: {m['content']}" for m in old_messages
)
response = await self.llm.messages.create(
model="claude-haiku-4-20250514", # Use small model for summaries
max_tokens=1000,
messages=[{
"role": "user",
"content": f"Summarize this conversation, preserving all key "
f"facts, decisions, and action items:\n\n{old_text}"
}],
)
self.summary = response.content[0].text
self.messages = recent_messages
def get_context(self) -> list[dict]:
"""Return the full context for the next LLM call"""
context = []
if self.summary:
context.append({
"role": "user",
"content": f"[Previous conversation summary: {self.summary}]"
})
context.extend(self.messages)
return context
Sliding Window with Importance Scoring
Not all messages are equally important. Score messages by relevance and drop the least important ones first:
class ImportanceBasedMemory:
def __init__(self, llm_client, max_messages: int = 50):
self.llm = llm_client
self.max_messages = max_messages
self.messages = [] # (message, importance_score)
async def add_message(self, message: dict):
# Score importance
importance = await self._score_importance(message)
self.messages.append((message, importance))
# If over limit, remove least important (non-recent) messages
if len(self.messages) > self.max_messages:
# Never remove the last 10 messages (recency matters)
removable = self.messages[:-10]
removable.sort(key=lambda x: x[1])
# Remove the least important one
least_important = removable[0]
self.messages.remove(least_important)
async def _score_importance(self, message: dict) -> float:
"""Score message importance: decisions, facts, and preferences score high"""
content = message["content"]
score = 0.5 # Default
# Heuristic scoring (fast, no LLM call needed)
importance_signals = [
("decided", 0.9), ("agreed", 0.9), ("confirmed", 0.8),
("my name is", 0.95), ("i prefer", 0.85), ("deadline", 0.8),
("error", 0.7), ("bug", 0.7), ("requirement", 0.8),
]
for signal, signal_score in importance_signals:
if signal in content.lower():
score = max(score, signal_score)
return score
Long-Term Memory: Persistent Knowledge Store
Long-term memory persists across sessions. It stores facts, preferences, and knowledge that the agent has learned about the user or domain.
Vector-Based Long-Term Memory
from datetime import datetime
from qdrant_client import QdrantClient, models
from sentence_transformers import SentenceTransformer
class LongTermMemory:
def __init__(self, qdrant_url: str, collection: str = "agent_memory"):
self.client = QdrantClient(qdrant_url)
self.collection = collection
self.embedder = SentenceTransformer("BAAI/bge-small-en-v1.5")
# Create collection if it does not exist
try:
self.client.get_collection(collection)
except Exception:
self.client.create_collection(
collection_name=collection,
vectors_config=models.VectorParams(
size=384, distance=models.Distance.COSINE
),
)
async def store(self, content: str, metadata: dict = None):
"""Store a memory with metadata"""
embedding = self.embedder.encode(content).tolist()
point_id = hash(content + str(datetime.now())) % (2**63)
self.client.upsert(
collection_name=self.collection,
points=[
models.PointStruct(
id=point_id,
vector=embedding,
payload={
"content": content,
"timestamp": datetime.now().isoformat(),
"access_count": 0,
**(metadata or {}),
},
)
],
)
async def recall(self, query: str, top_k: int = 5, min_score: float = 0.7) -> list[dict]:
"""Retrieve relevant memories"""
query_embedding = self.embedder.encode(query).tolist()
results = self.client.query_points(
collection_name=self.collection,
query=query_embedding,
limit=top_k,
score_threshold=min_score,
)
memories = []
for point in results.points:
memories.append({
"content": point.payload["content"],
"score": point.score,
"timestamp": point.payload["timestamp"],
})
# Update access count for memory importance tracking
self.client.set_payload(
collection_name=self.collection,
points=[point.id],
payload={"access_count": point.payload.get("access_count", 0) + 1},
)
return memories
async def forget(self, memory_id: int):
"""Explicitly remove a memory"""
self.client.delete(
collection_name=self.collection,
points_selector=models.PointIdsList(points=[memory_id]),
)
Structured Long-Term Memory with PostgreSQL
For memories that have clear structure (user preferences, facts, relationships), a relational database is more appropriate:
import asyncpg
class StructuredMemory:
def __init__(self, db_pool: asyncpg.Pool):
self.pool = db_pool
async def init_schema(self):
async with self.pool.acquire() as conn:
await conn.execute("""
CREATE TABLE IF NOT EXISTS agent_memories (
id SERIAL PRIMARY KEY,
user_id VARCHAR(255) NOT NULL,
memory_type VARCHAR(50) NOT NULL,
key VARCHAR(255) NOT NULL,
value JSONB NOT NULL,
confidence FLOAT DEFAULT 1.0,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW(),
access_count INT DEFAULT 0,
UNIQUE(user_id, memory_type, key)
);
CREATE INDEX IF NOT EXISTS idx_memories_user
ON agent_memories(user_id, memory_type);
""")
async def remember(self, user_id: str, memory_type: str,
key: str, value: dict, confidence: float = 1.0):
async with self.pool.acquire() as conn:
await conn.execute("""
INSERT INTO agent_memories (user_id, memory_type, key, value, confidence)
VALUES ($1, $2, $3, $4, $5)
ON CONFLICT (user_id, memory_type, key)
DO UPDATE SET value = $4, confidence = $5, updated_at = NOW()
""", user_id, memory_type, key, value, confidence)
async def recall_by_type(self, user_id: str, memory_type: str) -> list[dict]:
async with self.pool.acquire() as conn:
rows = await conn.fetch("""
SELECT key, value, confidence, updated_at
FROM agent_memories
WHERE user_id = $1 AND memory_type = $2
ORDER BY confidence DESC, updated_at DESC
""", user_id, memory_type)
return [dict(row) for row in rows]
# Usage:
# await memory.remember("user_123", "preference", "communication_style",
# {"value": "concise", "context": "User asked to be brief"})
# await memory.remember("user_123", "fact", "company",
# {"value": "Acme Corp", "role": "CTO"})
Episodic Memory: Learning From Experience
Episodic memory records complete agent interactions -- including what worked and what failed -- so the agent can learn from past experiences.
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class Episode:
episode_id: str
task_description: str
steps: list[dict] = field(default_factory=list)
outcome: Optional[str] = None # "success", "failure", "partial"
lessons_learned: list[str] = field(default_factory=list)
started_at: str = ""
completed_at: str = ""
class EpisodicMemory:
def __init__(self, storage, embedder):
self.storage = storage
self.embedder = embedder
async def record_episode(self, episode: Episode):
"""Store a complete episode for future reference"""
# Create searchable embedding from task + outcome + lessons
search_text = (
f"Task: {episode.task_description}. "
f"Outcome: {episode.outcome}. "
f"Lessons: {' '.join(episode.lessons_learned)}"
)
embedding = self.embedder.encode(search_text)
await self.storage.store({
"id": episode.episode_id,
"embedding": embedding,
"data": {
"task": episode.task_description,
"steps": episode.steps,
"outcome": episode.outcome,
"lessons": episode.lessons_learned,
"duration": episode.completed_at,
},
})
async def recall_similar_episodes(
self, current_task: str, top_k: int = 3
) -> list[Episode]:
"""Find past episodes similar to the current task"""
query_embedding = self.embedder.encode(current_task)
results = await self.storage.search(query_embedding, top_k=top_k)
return [self._to_episode(r) for r in results]
async def get_lessons_for_task(self, task: str) -> list[str]:
"""Extract lessons learned from similar past tasks"""
episodes = await self.recall_similar_episodes(task, top_k=5)
lessons = []
for ep in episodes:
if ep.outcome == "failure":
lessons.extend(
[f"[From failed attempt] {l}" for l in ep.lessons_learned]
)
elif ep.outcome == "success":
lessons.extend(
[f"[From successful attempt] {l}" for l in ep.lessons_learned]
)
return lessons
Integrating Episodic Memory Into the Agent Loop
class MemoryAugmentedAgent:
def __init__(self, llm, working_memory, long_term_memory, episodic_memory):
self.llm = llm
self.working = working_memory
self.long_term = long_term_memory
self.episodic = episodic_memory
async def handle_request(self, user_id: str, request: str) -> str:
# Step 1: Recall relevant long-term memories
user_context = await self.long_term.recall(request, top_k=5)
# Step 2: Recall relevant past episodes
past_lessons = await self.episodic.get_lessons_for_task(request)
# Step 3: Build enriched context
memory_context = ""
if user_context:
memory_context += "Relevant memories:\n"
memory_context += "\n".join(m["content"] for m in user_context)
if past_lessons:
memory_context += "\nLessons from past experiences:\n"
memory_context += "\n".join(past_lessons)
# Step 4: Add to working memory and generate response
if memory_context:
await self.working.add_message({
"role": "system",
"content": f"[Memory context]\n{memory_context}"
})
await self.working.add_message({"role": "user", "content": request})
response = await self.llm.messages.create(
model="claude-sonnet-4-20250514",
messages=self.working.get_context(),
max_tokens=4096,
)
result = response.content[0].text
# Step 5: Extract and store new memories
await self._extract_and_store_memories(user_id, request, result)
return result
async def _extract_and_store_memories(self, user_id, request, response):
"""Extract storable facts from the interaction"""
extraction = await self.llm.messages.create(
model="claude-haiku-4-20250514",
max_tokens=500,
messages=[{
"role": "user",
"content": f"""Extract any facts worth remembering from this interaction.
Return JSON array of {{"type": "preference|fact|instruction", "content": "..."}}.
Return empty array if nothing worth storing.
User: {request}
Assistant: {response}"""
}],
)
try:
memories = json.loads(extraction.content[0].text)
for mem in memories:
await self.long_term.store(
content=mem["content"],
metadata={"user_id": user_id, "type": mem["type"]}
)
except (json.JSONDecodeError, KeyError):
pass # Failed to extract -- not critical
Memory Architecture Patterns
| Pattern | Working Memory | Long-Term | Episodic | Best For |
|---|---|---|---|---|
| Stateless | Context window only | None | None | Simple Q&A |
| Session-based | Context + summary | None | None | Chat applications |
| Personalized | Context + summary | Vector store | None | User-facing assistants |
| Full memory | Context + summary | Vector + structured | Experience replay | Complex agents |
Key Takeaways
Memory transforms AI agents from stateless responders into persistent, learning systems. Working memory management (summarization, importance scoring) keeps the context window effective. Long-term memory (vector + structured storage) enables personalization and knowledge retention. Episodic memory (experience recording and replay) allows agents to learn from their own successes and failures. The right combination depends on your use case: simple chatbots need only working memory management, while complex autonomous agents benefit from all three layers working together.
NYC News
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.