Hierarchical Memory for AI Agents: Working Memory, Short-Term, and Long-Term Tiers

Why a Single Memory Store Falls Short

Most agent frameworks treat memory as a flat list. Every fact, observation, and user message lives in one undifferentiated pool. This works for toy demos, but in production the agent slows down as the memory grows, retrieval quality degrades, and context windows overflow with irrelevant details.

Human cognition solves this with hierarchical memory. Working memory holds the immediate task context. Short-term memory retains recent interactions. Long-term memory stores consolidated knowledge built up over days and weeks. An AI agent benefits from the same layered approach.

The Three-Tier Model

The hierarchy consists of three tiers, each with distinct capacity, retention, and retrieval characteristics.

Working Memory holds the current task context. It is small, fast, and completely replaced when the agent switches tasks. Think of it as the agent's scratchpad.

Short-Term Memory retains recent conversation turns and observations. It has a fixed window size and uses a FIFO eviction policy. Items that prove important get promoted to long-term storage.

Long-Term Memory stores consolidated facts, user preferences, and learned patterns. It persists across sessions and uses semantic search for retrieval.

from dataclasses import dataclass, field
from datetime import datetime
from typing import Optional
from collections import deque


@dataclass
class MemoryItem:
    content: str
    timestamp: datetime
    importance: float = 0.5
    access_count: int = 0
    metadata: dict = field(default_factory=dict)


class HierarchicalMemory:
    def __init__(
        self,
        working_capacity: int = 5,
        short_term_capacity: int = 50,
    ):
        self.working: list[MemoryItem] = []
        self.short_term: deque[MemoryItem] = deque(
            maxlen=short_term_capacity
        )
        self.long_term: list[MemoryItem] = []
        self.working_capacity = working_capacity
        self.promotion_threshold = 0.7

    def add_to_working(self, content: str, importance: float = 0.5):
        item = MemoryItem(
            content=content,
            timestamp=datetime.now(),
            importance=importance,
        )
        self.working.append(item)
        if len(self.working) > self.working_capacity:
            evicted = self.working.pop(0)
            self.short_term.append(evicted)

    def promote_to_long_term(self, item: MemoryItem):
        """Promote important short-term memories."""
        if item.importance >= self.promotion_threshold:
            self.long_term.append(item)
            return True
        return False

    def sweep_short_term(self):
        """Review short-term memories for promotion."""
        promoted = []
        remaining = deque(maxlen=self.short_term.maxlen)
        for item in self.short_term:
            if self.promote_to_long_term(item):
                promoted.append(item)
            else:
                remaining.append(item)
        self.short_term = remaining
        return promoted

Promotion Rules

Promotion from short-term to long-term should not be arbitrary. Three signals determine whether a memory deserves long-term storage.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

Importance score — memories tagged with high importance during creation (user preferences, explicit instructions) are promoted immediately.

Access frequency — if the agent retrieves a short-term memory multiple times, it is clearly useful and should be promoted.

Recency-weighted relevance — memories that remain relevant after multiple conversation turns have proven their staying power.

def should_promote(self, item: MemoryItem) -> bool:
    importance_signal = item.importance >= self.promotion_threshold
    access_signal = item.access_count >= 3
    age_seconds = (datetime.now() - item.timestamp).total_seconds()
    survived_long = age_seconds > 300 and item.access_count > 0
    return importance_signal or access_signal or survived_long

Eviction Policies

Each tier needs a different eviction strategy. Working memory uses strict replacement — when a new task begins, the entire working memory is flushed. Short-term memory uses FIFO with a promotion check: before an item is evicted, the system evaluates whether it should be promoted. Long-term memory uses importance-decay eviction — items that have not been accessed in a long time and have low importance are candidates for removal.

def evict_long_term(self, max_items: int = 1000):
    if len(self.long_term) <= max_items:
        return
    self.long_term.sort(
        key=lambda m: m.importance * (m.access_count + 1),
        reverse=True,
    )
    self.long_term = self.long_term[:max_items]

Retrieval Priority

When the agent needs to recall information, it searches the tiers in order: working memory first (exact match, no embedding needed), then short-term (recency-weighted), then long-term (semantic search). This mirrors the human pattern where recent, immediately relevant memories surface first.

def retrieve(self, query: str, top_k: int = 5) -> list[MemoryItem]:
    results = []
    # Tier 1: working memory — exact substring match
    for item in self.working:
        if query.lower() in item.content.lower():
            item.access_count += 1
            results.append(item)

    # Tier 2: short-term — recency bias
    for item in sorted(
        self.short_term,
        key=lambda m: m.timestamp,
        reverse=True,
    ):
        if query.lower() in item.content.lower():
            item.access_count += 1
            results.append(item)

    # Tier 3: long-term — would use embedding similarity
    # in production; simplified here for clarity
    for item in self.long_term:
        if query.lower() in item.content.lower():
            item.access_count += 1
            results.append(item)

    return results[:top_k]

FAQ

Why not just use a vector database for everything?

A vector database is excellent for long-term semantic retrieval, but it adds latency. Working memory and short-term memory benefit from in-process data structures that return results in microseconds. The hierarchical approach lets you use the right storage engine for each tier.

How do I decide the capacity for each tier?

Working memory should match the context needed for a single task — typically 3 to 10 items. Short-term memory should cover a full conversation session, usually 30 to 100 items. Long-term capacity depends on your storage budget, but start with 1,000 items and add eviction when you exceed it.

Can I persist all three tiers across agent restarts?

Working memory is ephemeral by design and should be rebuilt from the current task state. Short-term memory can be serialized to a session store like Redis with a TTL. Long-term memory should always be persisted to a database or vector store.

#AgentMemory #MemoryArchitecture #WorkingMemory #Python #AgenticAI #LearnAI #AIEngineering

Hierarchical Memory for AI Agents: Working Memory, Short-Term, and Long-Term Tiers

Why a Single Memory Store Falls Short

The Three-Tier Model

Promotion Rules

Eviction Policies

Retrieval Priority

FAQ

Why not just use a vector database for everything?

How do I decide the capacity for each tier?

Can I persist all three tiers across agent restarts?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding