Semantic Memory for AI Agents: Using Embeddings to Remember Relevant Facts
Learn how to build a semantic memory system for AI agents using text embeddings, similarity thresholds, and memory consolidation to retrieve the most relevant facts from past interactions.
What Is Semantic Memory?
In cognitive science, semantic memory is the store of general knowledge and facts — distinct from episodic memory (specific events) and procedural memory (how to do things). For AI agents, semantic memory is a retrieval system that finds stored information based on meaning rather than exact keywords.
The core idea is simple: convert text into numerical vectors (embeddings) that capture semantic meaning, then use vector similarity to find the most relevant stored facts when the agent needs them. A query about "monthly subscription cost" should retrieve a memory stored as "The plan is priced at $49/month" even though the words barely overlap.
Generating Embeddings
Embeddings are produced by specialized models that map text to high-dimensional vectors. Similar meanings produce vectors that are close together in this space.
import openai
import numpy as np
from typing import List
client = openai.OpenAI()
def embed_text(text: str) -> List[float]:
"""Generate an embedding vector for a single text string."""
response = client.embeddings.create(
model="text-embedding-3-small",
input=text,
)
return response.data[0].embedding
def embed_batch(texts: List[str]) -> List[List[float]]:
"""Generate embeddings for multiple texts in one API call."""
response = client.embeddings.create(
model="text-embedding-3-small",
input=texts,
)
return [item.embedding for item in response.data]
def cosine_similarity(a: List[float], b: List[float]) -> float:
"""Compute cosine similarity between two vectors."""
a_arr, b_arr = np.array(a), np.array(b)
return float(np.dot(a_arr, b_arr) / (np.linalg.norm(a_arr) * np.linalg.norm(b_arr)))
The text-embedding-3-small model produces 1536-dimensional vectors and costs fractions of a cent per thousand tokens. For higher accuracy, text-embedding-3-large produces 3072 dimensions.
Building a Semantic Memory Store
Here is a complete semantic memory implementation that stores facts with their embeddings and retrieves them by similarity.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
from dataclasses import dataclass, field
from datetime import datetime
from typing import List, Optional, Tuple
@dataclass
class SemanticMemory:
content: str
embedding: List[float]
category: str
importance: float = 0.5 # 0.0 to 1.0
access_count: int = 0
created_at: datetime = field(default_factory=datetime.utcnow)
last_accessed: datetime = field(default_factory=datetime.utcnow)
class SemanticMemoryStore:
def __init__(self, similarity_threshold: float = 0.7):
self.memories: List[SemanticMemory] = []
self.threshold = similarity_threshold
def add(self, content: str, category: str, importance: float = 0.5):
embedding = embed_text(content)
# Check for duplicates before adding
similar = self._find_similar(embedding, threshold=0.92)
if similar:
# Update existing memory instead of creating duplicate
existing = similar[0][0]
existing.content = content
existing.embedding = embedding
existing.last_accessed = datetime.utcnow()
return existing
memory = SemanticMemory(
content=content,
embedding=embedding,
category=category,
importance=importance,
)
self.memories.append(memory)
return memory
def recall(
self,
query: str,
top_k: int = 5,
category: Optional[str] = None,
) -> List[Tuple[SemanticMemory, float]]:
"""Retrieve the most relevant memories for a query."""
query_embedding = embed_text(query)
results = self._find_similar(
query_embedding, threshold=self.threshold, category=category
)
# Update access metadata
for memory, score in results[:top_k]:
memory.access_count += 1
memory.last_accessed = datetime.utcnow()
return results[:top_k]
def _find_similar(
self,
embedding: List[float],
threshold: float = 0.7,
category: Optional[str] = None,
) -> List[Tuple[SemanticMemory, float]]:
scored = []
for mem in self.memories:
if category and mem.category != category:
continue
score = cosine_similarity(embedding, mem.embedding)
if score >= threshold:
scored.append((mem, score))
scored.sort(key=lambda x: x[1], reverse=True)
return scored
Relevance-Weighted Retrieval
Raw cosine similarity is a good start, but production systems often combine similarity with recency and importance for a composite relevance score.
import math
def compute_relevance(
similarity: float,
memory: SemanticMemory,
recency_weight: float = 0.2,
importance_weight: float = 0.15,
) -> float:
"""Combine similarity, recency, and importance into a single score."""
hours_ago = (datetime.utcnow() - memory.last_accessed).total_seconds() / 3600
recency_score = math.exp(-0.01 * hours_ago) # exponential decay
return (
(1 - recency_weight - importance_weight) * similarity
+ recency_weight * recency_score
+ importance_weight * memory.importance
)
This formula ensures that recent, important memories rank higher when similarity scores are close.
Memory Consolidation
Over time, a semantic memory store accumulates redundant or overlapping entries. Consolidation merges similar memories to keep the store efficient.
async def consolidate_memories(
store: SemanticMemoryStore,
merge_threshold: float = 0.88,
) -> int:
"""Merge highly similar memories to reduce redundancy."""
merged_count = 0
skip_indices = set()
for i, mem_a in enumerate(store.memories):
if i in skip_indices:
continue
for j, mem_b in enumerate(store.memories[i + 1:], start=i + 1):
if j in skip_indices:
continue
sim = cosine_similarity(mem_a.embedding, mem_b.embedding)
if sim >= merge_threshold:
# Keep the more important or more recently accessed one
if mem_b.importance > mem_a.importance:
mem_a.content = mem_b.content
mem_a.embedding = mem_b.embedding
mem_a.importance = max(mem_a.importance, mem_b.importance)
mem_a.access_count += mem_b.access_count
skip_indices.add(j)
merged_count += 1
store.memories = [
m for i, m in enumerate(store.memories) if i not in skip_indices
]
return merged_count
FAQ
How do I choose the right similarity threshold?
Start with 0.7 for general retrieval and tune based on your data. Lower thresholds (0.5-0.6) cast a wider net but include more noise. Higher thresholds (0.8+) are more precise but may miss relevant matches. Test with real queries from your domain and adjust.
Are there alternatives to OpenAI embeddings?
Yes. Open-source models like sentence-transformers/all-MiniLM-L6-v2 run locally with no API costs. Cohere and Voyage AI also offer embedding APIs. The choice depends on your latency, cost, and accuracy requirements.
How do I handle memory that becomes outdated?
Attach a timestamp and optionally a TTL (time-to-live) to each memory. Periodically sweep for expired entries. For facts that change — like a user's address — use the duplicate detection logic to overwrite the old entry rather than creating a conflicting one.
#SemanticMemory #Embeddings #VectorSearch #AIAgents #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.