Hierarchical Memory for AI Agents: Working Memory, Short-Term, and Long-Term Tiers
Learn how to design a three-tier memory architecture for AI agents with working memory, short-term buffers, and long-term stores, including promotion rules, eviction policies, and retrieval priority.
Why a Single Memory Store Falls Short
Most agent frameworks treat memory as a flat list. Every fact, observation, and user message lives in one undifferentiated pool. This works for toy demos, but in production the agent slows down as the memory grows, retrieval quality degrades, and context windows overflow with irrelevant details.
Human cognition solves this with hierarchical memory. Working memory holds the immediate task context. Short-term memory retains recent interactions. Long-term memory stores consolidated knowledge built up over days and weeks. An AI agent benefits from the same layered approach.
The Three-Tier Model
The hierarchy consists of three tiers, each with distinct capacity, retention, and retrieval characteristics.
Working Memory holds the current task context. It is small, fast, and completely replaced when the agent switches tasks. Think of it as the agent's scratchpad.
Short-Term Memory retains recent conversation turns and observations. It has a fixed window size and uses a FIFO eviction policy. Items that prove important get promoted to long-term storage.
Long-Term Memory stores consolidated facts, user preferences, and learned patterns. It persists across sessions and uses semantic search for retrieval.
from dataclasses import dataclass, field
from datetime import datetime
from typing import Optional
from collections import deque
@dataclass
class MemoryItem:
content: str
timestamp: datetime
importance: float = 0.5
access_count: int = 0
metadata: dict = field(default_factory=dict)
class HierarchicalMemory:
def __init__(
self,
working_capacity: int = 5,
short_term_capacity: int = 50,
):
self.working: list[MemoryItem] = []
self.short_term: deque[MemoryItem] = deque(
maxlen=short_term_capacity
)
self.long_term: list[MemoryItem] = []
self.working_capacity = working_capacity
self.promotion_threshold = 0.7
def add_to_working(self, content: str, importance: float = 0.5):
item = MemoryItem(
content=content,
timestamp=datetime.now(),
importance=importance,
)
self.working.append(item)
if len(self.working) > self.working_capacity:
evicted = self.working.pop(0)
self.short_term.append(evicted)
def promote_to_long_term(self, item: MemoryItem):
"""Promote important short-term memories."""
if item.importance >= self.promotion_threshold:
self.long_term.append(item)
return True
return False
def sweep_short_term(self):
"""Review short-term memories for promotion."""
promoted = []
remaining = deque(maxlen=self.short_term.maxlen)
for item in self.short_term:
if self.promote_to_long_term(item):
promoted.append(item)
else:
remaining.append(item)
self.short_term = remaining
return promoted
Promotion Rules
Promotion from short-term to long-term should not be arbitrary. Three signals determine whether a memory deserves long-term storage.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Importance score — memories tagged with high importance during creation (user preferences, explicit instructions) are promoted immediately.
Access frequency — if the agent retrieves a short-term memory multiple times, it is clearly useful and should be promoted.
Recency-weighted relevance — memories that remain relevant after multiple conversation turns have proven their staying power.
def should_promote(self, item: MemoryItem) -> bool:
importance_signal = item.importance >= self.promotion_threshold
access_signal = item.access_count >= 3
age_seconds = (datetime.now() - item.timestamp).total_seconds()
survived_long = age_seconds > 300 and item.access_count > 0
return importance_signal or access_signal or survived_long
Eviction Policies
Each tier needs a different eviction strategy. Working memory uses strict replacement — when a new task begins, the entire working memory is flushed. Short-term memory uses FIFO with a promotion check: before an item is evicted, the system evaluates whether it should be promoted. Long-term memory uses importance-decay eviction — items that have not been accessed in a long time and have low importance are candidates for removal.
def evict_long_term(self, max_items: int = 1000):
if len(self.long_term) <= max_items:
return
self.long_term.sort(
key=lambda m: m.importance * (m.access_count + 1),
reverse=True,
)
self.long_term = self.long_term[:max_items]
Retrieval Priority
When the agent needs to recall information, it searches the tiers in order: working memory first (exact match, no embedding needed), then short-term (recency-weighted), then long-term (semantic search). This mirrors the human pattern where recent, immediately relevant memories surface first.
def retrieve(self, query: str, top_k: int = 5) -> list[MemoryItem]:
results = []
# Tier 1: working memory — exact substring match
for item in self.working:
if query.lower() in item.content.lower():
item.access_count += 1
results.append(item)
# Tier 2: short-term — recency bias
for item in sorted(
self.short_term,
key=lambda m: m.timestamp,
reverse=True,
):
if query.lower() in item.content.lower():
item.access_count += 1
results.append(item)
# Tier 3: long-term — would use embedding similarity
# in production; simplified here for clarity
for item in self.long_term:
if query.lower() in item.content.lower():
item.access_count += 1
results.append(item)
return results[:top_k]
FAQ
Why not just use a vector database for everything?
A vector database is excellent for long-term semantic retrieval, but it adds latency. Working memory and short-term memory benefit from in-process data structures that return results in microseconds. The hierarchical approach lets you use the right storage engine for each tier.
How do I decide the capacity for each tier?
Working memory should match the context needed for a single task — typically 3 to 10 items. Short-term memory should cover a full conversation session, usually 30 to 100 items. Long-term capacity depends on your storage budget, but start with 1,000 items and add eviction when you exceed it.
Can I persist all three tiers across agent restarts?
Working memory is ephemeral by design and should be rebuilt from the current task state. Short-term memory can be serialized to a session store like Redis with a TTL. Long-term memory should always be persisted to a database or vector store.
#AgentMemory #MemoryArchitecture #WorkingMemory #Python #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.