Skip to content
Learn Agentic AI9 min read0 views

Named Entity Recognition for AI Agents: Extracting People, Places, and Organizations

Learn how to implement Named Entity Recognition in AI agent pipelines using spaCy and LLMs, covering entity types, custom entity training, and real-time extraction strategies.

Why Agents Need Named Entity Recognition

When an AI agent receives a message like "Schedule a meeting with Sarah Chen at the Austin office next Tuesday," it must extract three distinct pieces of structured information: a person (Sarah Chen), a location (Austin office), and a date (next Tuesday). Named Entity Recognition (NER) is the NLP technique that performs this extraction automatically.

Without NER, an agent would have to rely entirely on the LLM to parse every incoming message from scratch. While LLMs are capable of entity extraction, dedicated NER pipelines are faster, cheaper, and more predictable for high-volume workloads. The best agent architectures combine both approaches — fast NER for common entities and LLM fallback for ambiguous cases.

Entity Types Every Agent Developer Should Know

The standard NER taxonomy covers these categories:

  • PERSON — individual names (Sarah Chen, Dr. Patel)
  • ORG — companies, agencies, institutions (Acme Corp, FDA)
  • GPE — geopolitical entities like countries, cities, states (Austin, France)
  • DATE — absolute or relative dates (March 15, next Tuesday)
  • MONEY — monetary values ($500, 12.5 million euros)
  • PRODUCT — named products (iPhone 16, Tesla Model Y)

Most pre-trained NER models handle these out of the box. Custom entities — like internal project names, medical codes, or proprietary terms — require additional training.

NER with spaCy: The Fast Path

spaCy provides production-grade NER that runs in milliseconds per document. Here is a complete extraction pipeline.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

import spacy

nlp = spacy.load("en_core_web_trf")  # Transformer-based model

def extract_entities(text: str) -> dict[str, list[str]]:
    """Extract named entities grouped by type."""
    doc = nlp(text)
    entities: dict[str, list[str]] = {}

    for ent in doc.ents:
        if ent.label_ not in entities:
            entities[ent.label_] = []
        if ent.text not in entities[ent.label_]:
            entities[ent.label_].append(ent.text)

    return entities

message = "Tell John at Microsoft to review the Q3 report by March 20th."
result = extract_entities(message)
# {'PERSON': ['John'], 'ORG': ['Microsoft'], 'DATE': ['Q3', 'March 20th']}

Training Custom Entities

When your agent operates in a specialized domain, you need custom entity types. Here is how to train spaCy to recognize a custom PRODUCT_CODE entity.

import spacy
from spacy.training import Example
import random

nlp = spacy.blank("en")
ner = nlp.add_pipe("ner")
ner.add_label("PRODUCT_CODE")

train_data = [
    ("Order SKU-4829 immediately", {"entities": [(6, 14, "PRODUCT_CODE")]}),
    ("Check stock for SKU-1100", {"entities": [(16, 24, "PRODUCT_CODE")]}),
    ("We need SKU-7753 and SKU-2201", {
        "entities": [(8, 16, "PRODUCT_CODE"), (21, 29, "PRODUCT_CODE")]
    }),
]

optimizer = nlp.begin_training()
for epoch in range(30):
    random.shuffle(train_data)
    for text, annotations in train_data:
        example = Example.from_dict(nlp.make_doc(text), annotations)
        nlp.update([example], sgd=optimizer)

doc = nlp("Ship SKU-9981 to warehouse B")
for ent in doc.ents:
    print(f"{ent.text} -> {ent.label_}")
# SKU-9981 -> PRODUCT_CODE

LLM-Based NER for Complex Cases

For ambiguous text or zero-shot entity types, LLMs provide flexible extraction without training data.

import openai

def llm_extract_entities(text: str, entity_types: list[str]) -> dict:
    """Use an LLM for zero-shot entity extraction."""
    prompt = f"""Extract the following entity types from the text.
Return JSON only. Entity types: {', '.join(entity_types)}

Text: {text}"""

    response = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
        temperature=0,
    )
    return response.choices[0].message.content

result = llm_extract_entities(
    "Dr. Amara Osei from Nairobi General prescribed amoxicillin 500mg",
    ["PERSON", "FACILITY", "MEDICATION", "DOSAGE"]
)

Integrating NER into an Agent Pipeline

The most effective pattern is a two-tier approach: fast spaCy extraction first, with LLM fallback for unrecognized patterns.

class NERProcessor:
    def __init__(self):
        self.nlp = spacy.load("en_core_web_sm")
        self.confidence_threshold = 0.85

    def process(self, text: str) -> dict:
        doc = self.nlp(text)
        entities = {}
        low_confidence = []

        for ent in doc.ents:
            if ent.kb_id_ and float(ent.kb_id_) < self.confidence_threshold:
                low_confidence.append(ent.text)
            else:
                entities.setdefault(ent.label_, []).append(ent.text)

        return {
            "entities": entities,
            "needs_llm_review": low_confidence,
        }

FAQ

When should I use spaCy NER versus LLM-based extraction?

Use spaCy for high-throughput scenarios where you need consistent, fast extraction of standard entity types. Use LLM-based extraction when you need zero-shot recognition of novel entity types, when the text is highly ambiguous, or when you cannot invest in training data for a custom model.

How do I handle entities that span multiple tokens or contain special characters?

spaCy handles multi-token entities natively through its span-based architecture. During training, define entity boundaries using character offsets that encompass the full span. For special characters like hyphens or periods in entity names, ensure your tokenizer does not split them incorrectly by adding custom tokenization rules.

Can I combine multiple NER models in a single agent pipeline?

Yes. A common pattern is to run a general-purpose model for standard entities (PERSON, ORG, GPE) and a domain-specific model for specialized entities (MEDICATION, LEGAL_CLAUSE). Merge the results and use a deduplication step to handle overlapping spans, keeping the prediction with higher confidence.


#NER #NLP #SpaCy #EntityExtraction #Python #AIAgents #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.