Skip to content
Back to Blog
Agentic AI8 min read

AI Agent Memory Systems: Short-Term, Long-Term, and Episodic Storage

A comprehensive technical guide to implementing memory systems for AI agents, covering working memory (context window management), long-term memory (vector stores and databases), episodic memory (experience replay), and the architecture patterns that make agents truly persistent.

Why Memory Matters for AI Agents

Without memory, every interaction with an AI agent starts from zero. The agent cannot learn from past mistakes, remember user preferences, or build on previous work. In production, this means lost context, repeated questions, and an inability to improve over time.

AI agent memory systems draw inspiration from human cognitive science, implementing three types of memory that serve different purposes:

  1. Working memory (short-term): The active context the agent reasons over right now
  2. Long-term memory: Persistent knowledge that survives across sessions
  3. Episodic memory: Records of past experiences the agent can recall and learn from

Working Memory: Managing the Context Window

The LLM context window is the agent's working memory. It has a fixed capacity (128K-200K tokens for frontier models), and managing it effectively is the first challenge.

Conversation Summarization

When conversations exceed a threshold, summarize older messages to free up space:

class WorkingMemoryManager:
    def __init__(self, llm_client, max_tokens: int = 100_000, summary_threshold: int = 80_000):
        self.llm = llm_client
        self.max_tokens = max_tokens
        self.summary_threshold = summary_threshold
        self.messages = []
        self.summary = ""

    def estimate_tokens(self, messages: list) -> int:
        return sum(len(m["content"]) // 4 for m in messages)  # Rough estimate

    async def add_message(self, message: dict):
        self.messages.append(message)

        if self.estimate_tokens(self.messages) > self.summary_threshold:
            await self._compress()

    async def _compress(self):
        """Summarize older messages, keep recent ones intact"""
        # Keep the most recent messages (last 20%)
        split_point = len(self.messages) // 5 * 4
        old_messages = self.messages[:split_point]
        recent_messages = self.messages[split_point:]

        # Summarize old messages
        old_text = "\n".join(
            f"{m['role']}: {m['content']}" for m in old_messages
        )
        response = await self.llm.messages.create(
            model="claude-haiku-4-20250514",  # Use small model for summaries
            max_tokens=1000,
            messages=[{
                "role": "user",
                "content": f"Summarize this conversation, preserving all key "
                           f"facts, decisions, and action items:\n\n{old_text}"
            }],
        )

        self.summary = response.content[0].text
        self.messages = recent_messages

    def get_context(self) -> list[dict]:
        """Return the full context for the next LLM call"""
        context = []
        if self.summary:
            context.append({
                "role": "user",
                "content": f"[Previous conversation summary: {self.summary}]"
            })
        context.extend(self.messages)
        return context

Sliding Window with Importance Scoring

Not all messages are equally important. Score messages by relevance and drop the least important ones first:

class ImportanceBasedMemory:
    def __init__(self, llm_client, max_messages: int = 50):
        self.llm = llm_client
        self.max_messages = max_messages
        self.messages = []  # (message, importance_score)

    async def add_message(self, message: dict):
        # Score importance
        importance = await self._score_importance(message)
        self.messages.append((message, importance))

        # If over limit, remove least important (non-recent) messages
        if len(self.messages) > self.max_messages:
            # Never remove the last 10 messages (recency matters)
            removable = self.messages[:-10]
            removable.sort(key=lambda x: x[1])
            # Remove the least important one
            least_important = removable[0]
            self.messages.remove(least_important)

    async def _score_importance(self, message: dict) -> float:
        """Score message importance: decisions, facts, and preferences score high"""
        content = message["content"]
        score = 0.5  # Default

        # Heuristic scoring (fast, no LLM call needed)
        importance_signals = [
            ("decided", 0.9), ("agreed", 0.9), ("confirmed", 0.8),
            ("my name is", 0.95), ("i prefer", 0.85), ("deadline", 0.8),
            ("error", 0.7), ("bug", 0.7), ("requirement", 0.8),
        ]
        for signal, signal_score in importance_signals:
            if signal in content.lower():
                score = max(score, signal_score)

        return score

Long-Term Memory: Persistent Knowledge Store

Long-term memory persists across sessions. It stores facts, preferences, and knowledge that the agent has learned about the user or domain.

Vector-Based Long-Term Memory

from datetime import datetime
from qdrant_client import QdrantClient, models
from sentence_transformers import SentenceTransformer

class LongTermMemory:
    def __init__(self, qdrant_url: str, collection: str = "agent_memory"):
        self.client = QdrantClient(qdrant_url)
        self.collection = collection
        self.embedder = SentenceTransformer("BAAI/bge-small-en-v1.5")

        # Create collection if it does not exist
        try:
            self.client.get_collection(collection)
        except Exception:
            self.client.create_collection(
                collection_name=collection,
                vectors_config=models.VectorParams(
                    size=384, distance=models.Distance.COSINE
                ),
            )

    async def store(self, content: str, metadata: dict = None):
        """Store a memory with metadata"""
        embedding = self.embedder.encode(content).tolist()
        point_id = hash(content + str(datetime.now())) % (2**63)

        self.client.upsert(
            collection_name=self.collection,
            points=[
                models.PointStruct(
                    id=point_id,
                    vector=embedding,
                    payload={
                        "content": content,
                        "timestamp": datetime.now().isoformat(),
                        "access_count": 0,
                        **(metadata or {}),
                    },
                )
            ],
        )

    async def recall(self, query: str, top_k: int = 5, min_score: float = 0.7) -> list[dict]:
        """Retrieve relevant memories"""
        query_embedding = self.embedder.encode(query).tolist()
        results = self.client.query_points(
            collection_name=self.collection,
            query=query_embedding,
            limit=top_k,
            score_threshold=min_score,
        )

        memories = []
        for point in results.points:
            memories.append({
                "content": point.payload["content"],
                "score": point.score,
                "timestamp": point.payload["timestamp"],
            })
            # Update access count for memory importance tracking
            self.client.set_payload(
                collection_name=self.collection,
                points=[point.id],
                payload={"access_count": point.payload.get("access_count", 0) + 1},
            )

        return memories

    async def forget(self, memory_id: int):
        """Explicitly remove a memory"""
        self.client.delete(
            collection_name=self.collection,
            points_selector=models.PointIdsList(points=[memory_id]),
        )

Structured Long-Term Memory with PostgreSQL

For memories that have clear structure (user preferences, facts, relationships), a relational database is more appropriate:

import asyncpg

class StructuredMemory:
    def __init__(self, db_pool: asyncpg.Pool):
        self.pool = db_pool

    async def init_schema(self):
        async with self.pool.acquire() as conn:
            await conn.execute("""
                CREATE TABLE IF NOT EXISTS agent_memories (
                    id SERIAL PRIMARY KEY,
                    user_id VARCHAR(255) NOT NULL,
                    memory_type VARCHAR(50) NOT NULL,
                    key VARCHAR(255) NOT NULL,
                    value JSONB NOT NULL,
                    confidence FLOAT DEFAULT 1.0,
                    created_at TIMESTAMPTZ DEFAULT NOW(),
                    updated_at TIMESTAMPTZ DEFAULT NOW(),
                    access_count INT DEFAULT 0,
                    UNIQUE(user_id, memory_type, key)
                );
                CREATE INDEX IF NOT EXISTS idx_memories_user
                    ON agent_memories(user_id, memory_type);
            """)

    async def remember(self, user_id: str, memory_type: str,
                       key: str, value: dict, confidence: float = 1.0):
        async with self.pool.acquire() as conn:
            await conn.execute("""
                INSERT INTO agent_memories (user_id, memory_type, key, value, confidence)
                VALUES ($1, $2, $3, $4, $5)
                ON CONFLICT (user_id, memory_type, key)
                DO UPDATE SET value = $4, confidence = $5, updated_at = NOW()
            """, user_id, memory_type, key, value, confidence)

    async def recall_by_type(self, user_id: str, memory_type: str) -> list[dict]:
        async with self.pool.acquire() as conn:
            rows = await conn.fetch("""
                SELECT key, value, confidence, updated_at
                FROM agent_memories
                WHERE user_id = $1 AND memory_type = $2
                ORDER BY confidence DESC, updated_at DESC
            """, user_id, memory_type)
            return [dict(row) for row in rows]

# Usage:
# await memory.remember("user_123", "preference", "communication_style",
#                       {"value": "concise", "context": "User asked to be brief"})
# await memory.remember("user_123", "fact", "company",
#                       {"value": "Acme Corp", "role": "CTO"})

Episodic Memory: Learning From Experience

Episodic memory records complete agent interactions -- including what worked and what failed -- so the agent can learn from past experiences.

from dataclasses import dataclass, field
from typing import Optional

@dataclass
class Episode:
    episode_id: str
    task_description: str
    steps: list[dict] = field(default_factory=list)
    outcome: Optional[str] = None  # "success", "failure", "partial"
    lessons_learned: list[str] = field(default_factory=list)
    started_at: str = ""
    completed_at: str = ""

class EpisodicMemory:
    def __init__(self, storage, embedder):
        self.storage = storage
        self.embedder = embedder

    async def record_episode(self, episode: Episode):
        """Store a complete episode for future reference"""
        # Create searchable embedding from task + outcome + lessons
        search_text = (
            f"Task: {episode.task_description}. "
            f"Outcome: {episode.outcome}. "
            f"Lessons: {' '.join(episode.lessons_learned)}"
        )
        embedding = self.embedder.encode(search_text)

        await self.storage.store({
            "id": episode.episode_id,
            "embedding": embedding,
            "data": {
                "task": episode.task_description,
                "steps": episode.steps,
                "outcome": episode.outcome,
                "lessons": episode.lessons_learned,
                "duration": episode.completed_at,
            },
        })

    async def recall_similar_episodes(
        self, current_task: str, top_k: int = 3
    ) -> list[Episode]:
        """Find past episodes similar to the current task"""
        query_embedding = self.embedder.encode(current_task)
        results = await self.storage.search(query_embedding, top_k=top_k)
        return [self._to_episode(r) for r in results]

    async def get_lessons_for_task(self, task: str) -> list[str]:
        """Extract lessons learned from similar past tasks"""
        episodes = await self.recall_similar_episodes(task, top_k=5)
        lessons = []
        for ep in episodes:
            if ep.outcome == "failure":
                lessons.extend(
                    [f"[From failed attempt] {l}" for l in ep.lessons_learned]
                )
            elif ep.outcome == "success":
                lessons.extend(
                    [f"[From successful attempt] {l}" for l in ep.lessons_learned]
                )
        return lessons

Integrating Episodic Memory Into the Agent Loop

class MemoryAugmentedAgent:
    def __init__(self, llm, working_memory, long_term_memory, episodic_memory):
        self.llm = llm
        self.working = working_memory
        self.long_term = long_term_memory
        self.episodic = episodic_memory

    async def handle_request(self, user_id: str, request: str) -> str:
        # Step 1: Recall relevant long-term memories
        user_context = await self.long_term.recall(request, top_k=5)

        # Step 2: Recall relevant past episodes
        past_lessons = await self.episodic.get_lessons_for_task(request)

        # Step 3: Build enriched context
        memory_context = ""
        if user_context:
            memory_context += "Relevant memories:\n"
            memory_context += "\n".join(m["content"] for m in user_context)
        if past_lessons:
            memory_context += "\nLessons from past experiences:\n"
            memory_context += "\n".join(past_lessons)

        # Step 4: Add to working memory and generate response
        if memory_context:
            await self.working.add_message({
                "role": "system",
                "content": f"[Memory context]\n{memory_context}"
            })

        await self.working.add_message({"role": "user", "content": request})

        response = await self.llm.messages.create(
            model="claude-sonnet-4-20250514",
            messages=self.working.get_context(),
            max_tokens=4096,
        )

        result = response.content[0].text

        # Step 5: Extract and store new memories
        await self._extract_and_store_memories(user_id, request, result)

        return result

    async def _extract_and_store_memories(self, user_id, request, response):
        """Extract storable facts from the interaction"""
        extraction = await self.llm.messages.create(
            model="claude-haiku-4-20250514",
            max_tokens=500,
            messages=[{
                "role": "user",
                "content": f"""Extract any facts worth remembering from this interaction.
Return JSON array of {{"type": "preference|fact|instruction", "content": "..."}}.
Return empty array if nothing worth storing.

User: {request}
Assistant: {response}"""
            }],
        )

        try:
            memories = json.loads(extraction.content[0].text)
            for mem in memories:
                await self.long_term.store(
                    content=mem["content"],
                    metadata={"user_id": user_id, "type": mem["type"]}
                )
        except (json.JSONDecodeError, KeyError):
            pass  # Failed to extract -- not critical

Memory Architecture Patterns

Pattern Working Memory Long-Term Episodic Best For
Stateless Context window only None None Simple Q&A
Session-based Context + summary None None Chat applications
Personalized Context + summary Vector store None User-facing assistants
Full memory Context + summary Vector + structured Experience replay Complex agents

Key Takeaways

Memory transforms AI agents from stateless responders into persistent, learning systems. Working memory management (summarization, importance scoring) keeps the context window effective. Long-term memory (vector + structured storage) enables personalization and knowledge retention. Episodic memory (experience recording and replay) allows agents to learn from their own successes and failures. The right combination depends on your use case: simple chatbots need only working memory management, while complex autonomous agents benefit from all three layers working together.

Share this article
N

NYC News

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.