Skip to content
Learn Agentic AI10 min read0 views

Keyword Extraction and Topic Modeling for Agent Knowledge Organization

Learn keyword extraction with TF-IDF and KeyBERT, topic modeling with BERTopic and LDA, and how to build agent knowledge organization systems that automatically categorize and cluster documents.

Why Agents Need Keyword and Topic Understanding

An AI agent managing a knowledge base of thousands of documents needs to organize, search, and retrieve information efficiently. Keyword extraction identifies the most representative terms in a document. Topic modeling discovers latent themes across a collection of documents. Together, they give agents the ability to automatically tag content, cluster related documents, and route queries to the most relevant knowledge source.

Keyword Extraction with TF-IDF

TF-IDF (Term Frequency-Inverse Document Frequency) remains one of the most reliable keyword extraction methods. It identifies terms that are frequent in a specific document but rare across the corpus — exactly the terms that distinguish one document from another.

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

def extract_keywords_tfidf(
    documents: list[str],
    doc_index: int,
    top_n: int = 10,
) -> list[tuple[str, float]]:
    """Extract top keywords for a specific document using TF-IDF."""
    vectorizer = TfidfVectorizer(
        stop_words="english",
        ngram_range=(1, 2),
        max_features=10000,
    )
    tfidf_matrix = vectorizer.fit_transform(documents)
    feature_names = vectorizer.get_feature_names_out()

    doc_vector = tfidf_matrix[doc_index].toarray().flatten()
    top_indices = np.argsort(doc_vector)[-top_n:][::-1]

    return [
        (feature_names[i], round(doc_vector[i], 4))
        for i in top_indices
        if doc_vector[i] > 0
    ]

documents = [
    "Neural networks use backpropagation for gradient-based optimization.",
    "Kubernetes orchestrates container deployments across clusters.",
    "BERT embeddings capture contextual word representations.",
]

keywords = extract_keywords_tfidf(documents, doc_index=0, top_n=5)
# [('backpropagation', 0.4721), ('gradient', 0.3891), ...]

Keyword Extraction with KeyBERT

KeyBERT uses sentence embeddings to find keywords that are semantically closest to the overall document meaning. It produces more contextually relevant keywords than TF-IDF, especially for short texts.

from keybert import KeyBERT

kw_model = KeyBERT(model="all-MiniLM-L6-v2")

def extract_keywords_bert(
    text: str,
    top_n: int = 10,
    diversity: float = 0.5,
) -> list[tuple[str, float]]:
    """Extract keywords using semantic similarity."""
    keywords = kw_model.extract_keywords(
        text,
        keyphrase_ngram_range=(1, 2),
        stop_words="english",
        top_n=top_n,
        use_mmr=True,           # Maximal Marginal Relevance
        diversity=diversity,     # 0 = most similar, 1 = most diverse
    )
    return keywords

text = """Reinforcement learning agents learn optimal policies through
trial and error, maximizing cumulative reward in an environment.
Policy gradient methods like PPO and SAC are widely used for
continuous control tasks in robotics and game playing."""

keywords = extract_keywords_bert(text, top_n=5)
# [('reinforcement learning', 0.72), ('policy gradient', 0.65),
#  ('cumulative reward', 0.58), ('continuous control', 0.54),
#  ('optimal policies', 0.51)]

The diversity parameter controls the trade-off between relevance and variety. Set it higher when you want keywords that cover different aspects of the document rather than clustering around a single theme.

Topic Modeling with BERTopic

BERTopic is the modern standard for topic modeling. It uses sentence embeddings, dimensionality reduction (UMAP), and clustering (HDBSCAN) to discover topics automatically.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

from bertopic import BERTopic

def discover_topics(
    documents: list[str],
    min_topic_size: int = 5,
) -> tuple[BERTopic, list[int]]:
    """Discover topics in a document collection."""
    topic_model = BERTopic(
        language="english",
        min_topic_size=min_topic_size,
        verbose=False,
    )
    topics, probabilities = topic_model.fit_transform(documents)
    return topic_model, topics

# Example with a collection of support tickets
tickets = [
    "Cannot log in after password reset",
    "Login page shows 500 error",
    "Password reset email never arrived",
    "Invoice amount is incorrect",
    "Charged twice for same subscription",
    "Need a refund for duplicate charge",
    "App crashes on Android 14",
    "Mobile app freezes when uploading photos",
    "App not compatible with my phone",
]

model, topics = discover_topics(tickets, min_topic_size=2)

# View discovered topics
topic_info = model.get_topic_info()
print(topic_info[["Topic", "Count", "Name"]].head())

# Get the topic for a new document
new_topic, _ = model.transform(["My login credentials are not working"])

Classical Topic Modeling with LDA

Latent Dirichlet Allocation (LDA) is a probabilistic model that represents each document as a mixture of topics, where each topic is a distribution over words. It is lighter weight than BERTopic and works well when you need interpretable, fixed-size topic distributions.

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

def lda_topics(
    documents: list[str],
    n_topics: int = 5,
    top_words: int = 8,
) -> list[dict]:
    """Discover topics using LDA."""
    vectorizer = CountVectorizer(
        stop_words="english",
        max_features=5000,
    )
    doc_term_matrix = vectorizer.fit_transform(documents)
    feature_names = vectorizer.get_feature_names_out()

    lda = LatentDirichletAllocation(
        n_components=n_topics,
        random_state=42,
        max_iter=20,
    )
    lda.fit(doc_term_matrix)

    topics = []
    for idx, topic_dist in enumerate(lda.components_):
        top_indices = topic_dist.argsort()[-top_words:][::-1]
        words = [feature_names[i] for i in top_indices]
        topics.append({"topic_id": idx, "keywords": words})

    return topics

Building an Agent Knowledge Organizer

Here is a complete system that an agent can use to automatically organize and retrieve documents by topic.

from dataclasses import dataclass
from keybert import KeyBERT
from bertopic import BERTopic

@dataclass
class TaggedDocument:
    text: str
    keywords: list[str]
    topic_id: int
    topic_label: str

class KnowledgeOrganizer:
    def __init__(self):
        self.keyword_model = KeyBERT(model="all-MiniLM-L6-v2")
        self.topic_model = None
        self.documents: list[TaggedDocument] = []

    def index_documents(self, texts: list[str]) -> list[TaggedDocument]:
        self.topic_model = BERTopic(min_topic_size=3, verbose=False)
        topics, _ = self.topic_model.fit_transform(texts)

        tagged = []
        for text, topic_id in zip(texts, topics):
            keywords = self.keyword_model.extract_keywords(
                text, top_n=5, stop_words="english"
            )
            label = self.topic_model.get_topic(topic_id)
            topic_label = label[0][0] if label and topic_id != -1 else "misc"

            doc = TaggedDocument(
                text=text,
                keywords=[kw for kw, _ in keywords],
                topic_id=topic_id,
                topic_label=topic_label,
            )
            tagged.append(doc)

        self.documents = tagged
        return tagged

    def find_related(self, query: str, top_n: int = 5) -> list[TaggedDocument]:
        topic, _ = self.topic_model.transform([query])
        return [
            doc for doc in self.documents
            if doc.topic_id == topic[0]
        ][:top_n]

FAQ

What is the difference between keyword extraction and topic modeling?

Keyword extraction operates on individual documents, identifying the most important terms within a single text. Topic modeling operates on a collection of documents, discovering shared themes that span multiple documents. Keywords describe what a single document is about. Topics describe what groups of documents have in common.

How do I choose between BERTopic and LDA for my agent?

Use BERTopic when you need high-quality, semantically coherent topics and have access to GPU resources. Use LDA when you need lightweight, interpretable topic distributions, when your documents are short, or when you need deterministic results for reproducibility. BERTopic generally produces better topics but requires more compute and memory.

How do I handle new documents that arrive after the initial topic model is trained?

BERTopic supports incremental topic assignment through its transform() method — pass new documents to get their topic assignments without retraining. For periodic retraining, use BERTopic's merge_models() to combine an existing model with a model trained on new data. Schedule full retraining when topic drift becomes noticeable.


#KeywordExtraction #TopicModeling #BERTopic #KeyBERT #NLP #AIAgents #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.