Skip to content
Learn Agentic AI11 min read0 views

Semantic Memory for AI Agents: Using Embeddings to Remember Relevant Facts

Learn how to build a semantic memory system for AI agents using text embeddings, similarity thresholds, and memory consolidation to retrieve the most relevant facts from past interactions.

What Is Semantic Memory?

In cognitive science, semantic memory is the store of general knowledge and facts — distinct from episodic memory (specific events) and procedural memory (how to do things). For AI agents, semantic memory is a retrieval system that finds stored information based on meaning rather than exact keywords.

The core idea is simple: convert text into numerical vectors (embeddings) that capture semantic meaning, then use vector similarity to find the most relevant stored facts when the agent needs them. A query about "monthly subscription cost" should retrieve a memory stored as "The plan is priced at $49/month" even though the words barely overlap.

Generating Embeddings

Embeddings are produced by specialized models that map text to high-dimensional vectors. Similar meanings produce vectors that are close together in this space.

import openai
import numpy as np
from typing import List

client = openai.OpenAI()

def embed_text(text: str) -> List[float]:
    """Generate an embedding vector for a single text string."""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text,
    )
    return response.data[0].embedding

def embed_batch(texts: List[str]) -> List[List[float]]:
    """Generate embeddings for multiple texts in one API call."""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts,
    )
    return [item.embedding for item in response.data]

def cosine_similarity(a: List[float], b: List[float]) -> float:
    """Compute cosine similarity between two vectors."""
    a_arr, b_arr = np.array(a), np.array(b)
    return float(np.dot(a_arr, b_arr) / (np.linalg.norm(a_arr) * np.linalg.norm(b_arr)))

The text-embedding-3-small model produces 1536-dimensional vectors and costs fractions of a cent per thousand tokens. For higher accuracy, text-embedding-3-large produces 3072 dimensions.

Building a Semantic Memory Store

Here is a complete semantic memory implementation that stores facts with their embeddings and retrieves them by similarity.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

from dataclasses import dataclass, field
from datetime import datetime
from typing import List, Optional, Tuple

@dataclass
class SemanticMemory:
    content: str
    embedding: List[float]
    category: str
    importance: float = 0.5  # 0.0 to 1.0
    access_count: int = 0
    created_at: datetime = field(default_factory=datetime.utcnow)
    last_accessed: datetime = field(default_factory=datetime.utcnow)

class SemanticMemoryStore:
    def __init__(self, similarity_threshold: float = 0.7):
        self.memories: List[SemanticMemory] = []
        self.threshold = similarity_threshold

    def add(self, content: str, category: str, importance: float = 0.5):
        embedding = embed_text(content)

        # Check for duplicates before adding
        similar = self._find_similar(embedding, threshold=0.92)
        if similar:
            # Update existing memory instead of creating duplicate
            existing = similar[0][0]
            existing.content = content
            existing.embedding = embedding
            existing.last_accessed = datetime.utcnow()
            return existing

        memory = SemanticMemory(
            content=content,
            embedding=embedding,
            category=category,
            importance=importance,
        )
        self.memories.append(memory)
        return memory

    def recall(
        self,
        query: str,
        top_k: int = 5,
        category: Optional[str] = None,
    ) -> List[Tuple[SemanticMemory, float]]:
        """Retrieve the most relevant memories for a query."""
        query_embedding = embed_text(query)
        results = self._find_similar(
            query_embedding, threshold=self.threshold, category=category
        )

        # Update access metadata
        for memory, score in results[:top_k]:
            memory.access_count += 1
            memory.last_accessed = datetime.utcnow()

        return results[:top_k]

    def _find_similar(
        self,
        embedding: List[float],
        threshold: float = 0.7,
        category: Optional[str] = None,
    ) -> List[Tuple[SemanticMemory, float]]:
        scored = []
        for mem in self.memories:
            if category and mem.category != category:
                continue
            score = cosine_similarity(embedding, mem.embedding)
            if score >= threshold:
                scored.append((mem, score))
        scored.sort(key=lambda x: x[1], reverse=True)
        return scored

Relevance-Weighted Retrieval

Raw cosine similarity is a good start, but production systems often combine similarity with recency and importance for a composite relevance score.

import math

def compute_relevance(
    similarity: float,
    memory: SemanticMemory,
    recency_weight: float = 0.2,
    importance_weight: float = 0.15,
) -> float:
    """Combine similarity, recency, and importance into a single score."""
    hours_ago = (datetime.utcnow() - memory.last_accessed).total_seconds() / 3600
    recency_score = math.exp(-0.01 * hours_ago)  # exponential decay

    return (
        (1 - recency_weight - importance_weight) * similarity
        + recency_weight * recency_score
        + importance_weight * memory.importance
    )

This formula ensures that recent, important memories rank higher when similarity scores are close.

Memory Consolidation

Over time, a semantic memory store accumulates redundant or overlapping entries. Consolidation merges similar memories to keep the store efficient.

async def consolidate_memories(
    store: SemanticMemoryStore,
    merge_threshold: float = 0.88,
) -> int:
    """Merge highly similar memories to reduce redundancy."""
    merged_count = 0
    skip_indices = set()

    for i, mem_a in enumerate(store.memories):
        if i in skip_indices:
            continue
        for j, mem_b in enumerate(store.memories[i + 1:], start=i + 1):
            if j in skip_indices:
                continue
            sim = cosine_similarity(mem_a.embedding, mem_b.embedding)
            if sim >= merge_threshold:
                # Keep the more important or more recently accessed one
                if mem_b.importance > mem_a.importance:
                    mem_a.content = mem_b.content
                    mem_a.embedding = mem_b.embedding
                    mem_a.importance = max(mem_a.importance, mem_b.importance)
                mem_a.access_count += mem_b.access_count
                skip_indices.add(j)
                merged_count += 1

    store.memories = [
        m for i, m in enumerate(store.memories) if i not in skip_indices
    ]
    return merged_count

FAQ

How do I choose the right similarity threshold?

Start with 0.7 for general retrieval and tune based on your data. Lower thresholds (0.5-0.6) cast a wider net but include more noise. Higher thresholds (0.8+) are more precise but may miss relevant matches. Test with real queries from your domain and adjust.

Are there alternatives to OpenAI embeddings?

Yes. Open-source models like sentence-transformers/all-MiniLM-L6-v2 run locally with no API costs. Cohere and Voyage AI also offer embedding APIs. The choice depends on your latency, cost, and accuracy requirements.

How do I handle memory that becomes outdated?

Attach a timestamp and optionally a TTL (time-to-live) to each memory. Periodically sweep for expired entries. For facts that change — like a user's address — use the duplicate detection logic to overwrite the old entry rather than creating a conflicting one.


#SemanticMemory #Embeddings #VectorSearch #AIAgents #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.