Keyword Extraction and Topic Modeling for Agent Knowledge Organization
Learn keyword extraction with TF-IDF and KeyBERT, topic modeling with BERTopic and LDA, and how to build agent knowledge organization systems that automatically categorize and cluster documents.
Why Agents Need Keyword and Topic Understanding
An AI agent managing a knowledge base of thousands of documents needs to organize, search, and retrieve information efficiently. Keyword extraction identifies the most representative terms in a document. Topic modeling discovers latent themes across a collection of documents. Together, they give agents the ability to automatically tag content, cluster related documents, and route queries to the most relevant knowledge source.
Keyword Extraction with TF-IDF
TF-IDF (Term Frequency-Inverse Document Frequency) remains one of the most reliable keyword extraction methods. It identifies terms that are frequent in a specific document but rare across the corpus — exactly the terms that distinguish one document from another.
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
def extract_keywords_tfidf(
documents: list[str],
doc_index: int,
top_n: int = 10,
) -> list[tuple[str, float]]:
"""Extract top keywords for a specific document using TF-IDF."""
vectorizer = TfidfVectorizer(
stop_words="english",
ngram_range=(1, 2),
max_features=10000,
)
tfidf_matrix = vectorizer.fit_transform(documents)
feature_names = vectorizer.get_feature_names_out()
doc_vector = tfidf_matrix[doc_index].toarray().flatten()
top_indices = np.argsort(doc_vector)[-top_n:][::-1]
return [
(feature_names[i], round(doc_vector[i], 4))
for i in top_indices
if doc_vector[i] > 0
]
documents = [
"Neural networks use backpropagation for gradient-based optimization.",
"Kubernetes orchestrates container deployments across clusters.",
"BERT embeddings capture contextual word representations.",
]
keywords = extract_keywords_tfidf(documents, doc_index=0, top_n=5)
# [('backpropagation', 0.4721), ('gradient', 0.3891), ...]
Keyword Extraction with KeyBERT
KeyBERT uses sentence embeddings to find keywords that are semantically closest to the overall document meaning. It produces more contextually relevant keywords than TF-IDF, especially for short texts.
from keybert import KeyBERT
kw_model = KeyBERT(model="all-MiniLM-L6-v2")
def extract_keywords_bert(
text: str,
top_n: int = 10,
diversity: float = 0.5,
) -> list[tuple[str, float]]:
"""Extract keywords using semantic similarity."""
keywords = kw_model.extract_keywords(
text,
keyphrase_ngram_range=(1, 2),
stop_words="english",
top_n=top_n,
use_mmr=True, # Maximal Marginal Relevance
diversity=diversity, # 0 = most similar, 1 = most diverse
)
return keywords
text = """Reinforcement learning agents learn optimal policies through
trial and error, maximizing cumulative reward in an environment.
Policy gradient methods like PPO and SAC are widely used for
continuous control tasks in robotics and game playing."""
keywords = extract_keywords_bert(text, top_n=5)
# [('reinforcement learning', 0.72), ('policy gradient', 0.65),
# ('cumulative reward', 0.58), ('continuous control', 0.54),
# ('optimal policies', 0.51)]
The diversity parameter controls the trade-off between relevance and variety. Set it higher when you want keywords that cover different aspects of the document rather than clustering around a single theme.
Topic Modeling with BERTopic
BERTopic is the modern standard for topic modeling. It uses sentence embeddings, dimensionality reduction (UMAP), and clustering (HDBSCAN) to discover topics automatically.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
from bertopic import BERTopic
def discover_topics(
documents: list[str],
min_topic_size: int = 5,
) -> tuple[BERTopic, list[int]]:
"""Discover topics in a document collection."""
topic_model = BERTopic(
language="english",
min_topic_size=min_topic_size,
verbose=False,
)
topics, probabilities = topic_model.fit_transform(documents)
return topic_model, topics
# Example with a collection of support tickets
tickets = [
"Cannot log in after password reset",
"Login page shows 500 error",
"Password reset email never arrived",
"Invoice amount is incorrect",
"Charged twice for same subscription",
"Need a refund for duplicate charge",
"App crashes on Android 14",
"Mobile app freezes when uploading photos",
"App not compatible with my phone",
]
model, topics = discover_topics(tickets, min_topic_size=2)
# View discovered topics
topic_info = model.get_topic_info()
print(topic_info[["Topic", "Count", "Name"]].head())
# Get the topic for a new document
new_topic, _ = model.transform(["My login credentials are not working"])
Classical Topic Modeling with LDA
Latent Dirichlet Allocation (LDA) is a probabilistic model that represents each document as a mixture of topics, where each topic is a distribution over words. It is lighter weight than BERTopic and works well when you need interpretable, fixed-size topic distributions.
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
def lda_topics(
documents: list[str],
n_topics: int = 5,
top_words: int = 8,
) -> list[dict]:
"""Discover topics using LDA."""
vectorizer = CountVectorizer(
stop_words="english",
max_features=5000,
)
doc_term_matrix = vectorizer.fit_transform(documents)
feature_names = vectorizer.get_feature_names_out()
lda = LatentDirichletAllocation(
n_components=n_topics,
random_state=42,
max_iter=20,
)
lda.fit(doc_term_matrix)
topics = []
for idx, topic_dist in enumerate(lda.components_):
top_indices = topic_dist.argsort()[-top_words:][::-1]
words = [feature_names[i] for i in top_indices]
topics.append({"topic_id": idx, "keywords": words})
return topics
Building an Agent Knowledge Organizer
Here is a complete system that an agent can use to automatically organize and retrieve documents by topic.
from dataclasses import dataclass
from keybert import KeyBERT
from bertopic import BERTopic
@dataclass
class TaggedDocument:
text: str
keywords: list[str]
topic_id: int
topic_label: str
class KnowledgeOrganizer:
def __init__(self):
self.keyword_model = KeyBERT(model="all-MiniLM-L6-v2")
self.topic_model = None
self.documents: list[TaggedDocument] = []
def index_documents(self, texts: list[str]) -> list[TaggedDocument]:
self.topic_model = BERTopic(min_topic_size=3, verbose=False)
topics, _ = self.topic_model.fit_transform(texts)
tagged = []
for text, topic_id in zip(texts, topics):
keywords = self.keyword_model.extract_keywords(
text, top_n=5, stop_words="english"
)
label = self.topic_model.get_topic(topic_id)
topic_label = label[0][0] if label and topic_id != -1 else "misc"
doc = TaggedDocument(
text=text,
keywords=[kw for kw, _ in keywords],
topic_id=topic_id,
topic_label=topic_label,
)
tagged.append(doc)
self.documents = tagged
return tagged
def find_related(self, query: str, top_n: int = 5) -> list[TaggedDocument]:
topic, _ = self.topic_model.transform([query])
return [
doc for doc in self.documents
if doc.topic_id == topic[0]
][:top_n]
FAQ
What is the difference between keyword extraction and topic modeling?
Keyword extraction operates on individual documents, identifying the most important terms within a single text. Topic modeling operates on a collection of documents, discovering shared themes that span multiple documents. Keywords describe what a single document is about. Topics describe what groups of documents have in common.
How do I choose between BERTopic and LDA for my agent?
Use BERTopic when you need high-quality, semantically coherent topics and have access to GPU resources. Use LDA when you need lightweight, interpretable topic distributions, when your documents are short, or when you need deterministic results for reproducibility. BERTopic generally produces better topics but requires more compute and memory.
How do I handle new documents that arrive after the initial topic model is trained?
BERTopic supports incremental topic assignment through its transform() method — pass new documents to get their topic assignments without retraining. For periodic retraining, use BERTopic's merge_models() to combine an existing model with a model trained on new data. Schedule full retraining when topic drift becomes noticeable.
#KeywordExtraction #TopicModeling #BERTopic #KeyBERT #NLP #AIAgents #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.