Semantic Memory for AI Agents: Using Embeddings to Remember Relevant Facts

What Is Semantic Memory?

In cognitive science, semantic memory is the store of general knowledge and facts — distinct from episodic memory (specific events) and procedural memory (how to do things). For AI agents, semantic memory is a retrieval system that finds stored information based on meaning rather than exact keywords.

The core idea is simple: convert text into numerical vectors (embeddings) that capture semantic meaning, then use vector similarity to find the most relevant stored facts when the agent needs them. A query about "monthly subscription cost" should retrieve a memory stored as "The plan is priced at $49/month" even though the words barely overlap.

Generating Embeddings

Embeddings are produced by specialized models that map text to high-dimensional vectors. Similar meanings produce vectors that are close together in this space.

import openai
import numpy as np
from typing import List

client = openai.OpenAI()

def embed_text(text: str) -> List[float]:
    """Generate an embedding vector for a single text string."""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text,
    )
    return response.data[0].embedding

def embed_batch(texts: List[str]) -> List[List[float]]:
    """Generate embeddings for multiple texts in one API call."""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts,
    )
    return [item.embedding for item in response.data]

def cosine_similarity(a: List[float], b: List[float]) -> float:
    """Compute cosine similarity between two vectors."""
    a_arr, b_arr = np.array(a), np.array(b)
    return float(np.dot(a_arr, b_arr) / (np.linalg.norm(a_arr) * np.linalg.norm(b_arr)))

The text-embedding-3-small model produces 1536-dimensional vectors and costs fractions of a cent per thousand tokens. For higher accuracy, text-embedding-3-large produces 3072 dimensions.

Building a Semantic Memory Store

Here is a complete semantic memory implementation that stores facts with their embeddings and retrieves them by similarity.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

from dataclasses import dataclass, field
from datetime import datetime
from typing import List, Optional, Tuple

@dataclass
class SemanticMemory:
    content: str
    embedding: List[float]
    category: str
    importance: float = 0.5  # 0.0 to 1.0
    access_count: int = 0
    created_at: datetime = field(default_factory=datetime.utcnow)
    last_accessed: datetime = field(default_factory=datetime.utcnow)

class SemanticMemoryStore:
    def __init__(self, similarity_threshold: float = 0.7):
        self.memories: List[SemanticMemory] = []
        self.threshold = similarity_threshold

    def add(self, content: str, category: str, importance: float = 0.5):
        embedding = embed_text(content)

        # Check for duplicates before adding
        similar = self._find_similar(embedding, threshold=0.92)
        if similar:
            # Update existing memory instead of creating duplicate
            existing = similar[0][0]
            existing.content = content
            existing.embedding = embedding
            existing.last_accessed = datetime.utcnow()
            return existing

        memory = SemanticMemory(
            content=content,
            embedding=embedding,
            category=category,
            importance=importance,
        )
        self.memories.append(memory)
        return memory

    def recall(
        self,
        query: str,
        top_k: int = 5,
        category: Optional[str] = None,
    ) -> List[Tuple[SemanticMemory, float]]:
        """Retrieve the most relevant memories for a query."""
        query_embedding = embed_text(query)
        results = self._find_similar(
            query_embedding, threshold=self.threshold, category=category
        )

        # Update access metadata
        for memory, score in results[:top_k]:
            memory.access_count += 1
            memory.last_accessed = datetime.utcnow()

        return results[:top_k]

    def _find_similar(
        self,
        embedding: List[float],
        threshold: float = 0.7,
        category: Optional[str] = None,
    ) -> List[Tuple[SemanticMemory, float]]:
        scored = []
        for mem in self.memories:
            if category and mem.category != category:
                continue
            score = cosine_similarity(embedding, mem.embedding)
            if score >= threshold:
                scored.append((mem, score))
        scored.sort(key=lambda x: x[1], reverse=True)
        return scored

Relevance-Weighted Retrieval

Raw cosine similarity is a good start, but production systems often combine similarity with recency and importance for a composite relevance score.

import math

def compute_relevance(
    similarity: float,
    memory: SemanticMemory,
    recency_weight: float = 0.2,
    importance_weight: float = 0.15,
) -> float:
    """Combine similarity, recency, and importance into a single score."""
    hours_ago = (datetime.utcnow() - memory.last_accessed).total_seconds() / 3600
    recency_score = math.exp(-0.01 * hours_ago)  # exponential decay

    return (
        (1 - recency_weight - importance_weight) * similarity
        + recency_weight * recency_score
        + importance_weight * memory.importance
    )

This formula ensures that recent, important memories rank higher when similarity scores are close.

Memory Consolidation

Over time, a semantic memory store accumulates redundant or overlapping entries. Consolidation merges similar memories to keep the store efficient.

async def consolidate_memories(
    store: SemanticMemoryStore,
    merge_threshold: float = 0.88,
) -> int:
    """Merge highly similar memories to reduce redundancy."""
    merged_count = 0
    skip_indices = set()

    for i, mem_a in enumerate(store.memories):
        if i in skip_indices:
            continue
        for j, mem_b in enumerate(store.memories[i + 1:], start=i + 1):
            if j in skip_indices:
                continue
            sim = cosine_similarity(mem_a.embedding, mem_b.embedding)
            if sim >= merge_threshold:
                # Keep the more important or more recently accessed one
                if mem_b.importance > mem_a.importance:
                    mem_a.content = mem_b.content
                    mem_a.embedding = mem_b.embedding
                    mem_a.importance = max(mem_a.importance, mem_b.importance)
                mem_a.access_count += mem_b.access_count
                skip_indices.add(j)
                merged_count += 1

    store.memories = [
        m for i, m in enumerate(store.memories) if i not in skip_indices
    ]
    return merged_count

FAQ

How do I choose the right similarity threshold?

Start with 0.7 for general retrieval and tune based on your data. Lower thresholds (0.5-0.6) cast a wider net but include more noise. Higher thresholds (0.8+) are more precise but may miss relevant matches. Test with real queries from your domain and adjust.

Are there alternatives to OpenAI embeddings?

Yes. Open-source models like sentence-transformers/all-MiniLM-L6-v2 run locally with no API costs. Cohere and Voyage AI also offer embedding APIs. The choice depends on your latency, cost, and accuracy requirements.

How do I handle memory that becomes outdated?

Attach a timestamp and optionally a TTL (time-to-live) to each memory. Periodically sweep for expired entries. For facts that change — like a user's address — use the duplicate detection logic to overwrite the old entry rather than creating a conflicting one.

#SemanticMemory #Embeddings #VectorSearch #AIAgents #AgenticAI #LearnAI #AIEngineering

Semantic Memory for AI Agents: Using Embeddings to Remember Relevant Facts

What Is Semantic Memory?

Generating Embeddings

Building a Semantic Memory Store

Relevance-Weighted Retrieval

Memory Consolidation

FAQ

How do I choose the right similarity threshold?

Are there alternatives to OpenAI embeddings?

How do I handle memory that becomes outdated?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding