Extracting Entities from Documents: Names, Dates, Addresses, and Custom Types
Build production-grade entity extraction with LLMs. Learn schema design for names, dates, addresses, and custom entity types, plus batch extraction techniques and accuracy optimization strategies.
Entity Extraction with LLMs vs. Traditional NER
Traditional Named Entity Recognition (NER) models like spaCy's en_core_web_lg are fast and work well for standard entity types: person names, organizations, locations. But they struggle with domain-specific entities (medical codes, legal citations, product SKUs) and they cannot extract structured attributes for each entity.
LLM-based extraction handles arbitrary entity types, extracts attributes, and understands context that statistical models miss. The tradeoff is cost and latency: an LLM call takes 500ms-2s versus 5ms for spaCy. For most business applications, the accuracy gain justifies the cost.
Designing Entity Schemas
Start with a base entity class and specialize for each type:
from pydantic import BaseModel, Field
from typing import List, Optional, Literal
from datetime import date
class PersonEntity(BaseModel):
full_name: str
first_name: Optional[str] = None
last_name: Optional[str] = None
title: Optional[str] = Field(default=None, description="Mr, Mrs, Dr, etc.")
role: Optional[str] = Field(default=None, description="Job title or role")
organization: Optional[str] = None
class DateEntity(BaseModel):
raw_text: str = Field(description="Original date text from document")
normalized: Optional[str] = Field(
default=None,
description="ISO format YYYY-MM-DD when possible"
)
date_type: Literal["exact", "relative", "range", "approximate"]
class AddressEntity(BaseModel):
full_address: str
street: Optional[str] = None
city: Optional[str] = None
state: Optional[str] = None
postal_code: Optional[str] = None
country: Optional[str] = Field(default="US")
class MoneyEntity(BaseModel):
amount: float
currency: str = Field(default="USD")
raw_text: str = Field(description="Original text, e.g., '$1.2 million'")
Keeping the raw_text field alongside normalized values is essential for auditing. When a downstream process questions an extracted value, you can trace it back to the exact source text.
The Extraction Prompt
A well-structured prompt dramatically improves extraction quality:
from openai import OpenAI
import instructor
client = instructor.from_openai(OpenAI())
class DocumentEntities(BaseModel):
people: List[PersonEntity]
dates: List[DateEntity]
addresses: List[AddressEntity]
monetary_values: List[MoneyEntity]
def extract_entities(text: str) -> DocumentEntities:
return client.chat.completions.create(
model="gpt-4o",
response_model=DocumentEntities,
max_retries=2,
messages=[
{
"role": "system",
"content": (
"You are a precise document entity extractor. "
"Extract ALL entities of each type from the text. "
"If an entity attribute is not explicitly stated, use null. "
"Never infer or guess values not present in the text."
)
},
{"role": "user", "content": text}
],
)
The instruction "never infer or guess" is critical. Without it, the model tends to hallucinate plausible-sounding addresses or fill in missing first/last name splits incorrectly.
Custom Entity Types
Define domain-specific entities for your use case. Here is an example for legal document extraction:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
class LegalCitation(BaseModel):
case_name: str
citation: str = Field(description="e.g., '123 F.3d 456'")
court: Optional[str] = None
year: Optional[int] = None
class ContractClause(BaseModel):
clause_type: Literal[
"termination", "liability", "indemnification",
"confidentiality", "payment_terms", "warranty", "other"
]
summary: str
parties_involved: List[str]
key_conditions: List[str]
The Literal type constrains the model to a fixed set of values, which prevents it from inventing clause types that your downstream system cannot handle.
Batch Extraction for Multiple Documents
When processing many documents, use async calls for throughput:
import asyncio
from openai import AsyncOpenAI
async_client = instructor.from_openai(AsyncOpenAI())
async def extract_batch(documents: List[str]) -> List[DocumentEntities]:
tasks = [
async_client.chat.completions.create(
model="gpt-4o",
response_model=DocumentEntities,
max_retries=2,
messages=[
{"role": "system", "content": "Extract all entities from the text."},
{"role": "user", "content": doc}
],
)
for doc in documents
]
# Process in batches of 10 to respect rate limits
results = []
for i in range(0, len(tasks), 10):
batch = tasks[i:i + 10]
results.extend(await asyncio.gather(*batch))
return results
Improving Accuracy with Few-Shot Examples
Include examples in your prompt to calibrate the model:
FEW_SHOT_EXAMPLE = """
Text: "Dr. Sarah Chen, Chief Medical Officer at Valley Health (123 Oak St,
Portland, OR 97201), approved a $2.5M equipment purchase on March 15, 2025."
Expected extraction:
- Person: Dr. Sarah Chen, role=Chief Medical Officer, org=Valley Health
- Address: 123 Oak St, Portland, OR 97201
- Money: $2,500,000 USD (raw: "$2.5M")
- Date: 2025-03-15, type=exact (raw: "March 15, 2025")
"""
def extract_with_examples(text: str) -> DocumentEntities:
return client.chat.completions.create(
model="gpt-4o",
response_model=DocumentEntities,
messages=[
{
"role": "system",
"content": f"Extract entities precisely. Example:\n{FEW_SHOT_EXAMPLE}"
},
{"role": "user", "content": text}
],
)
Few-shot examples improve extraction accuracy by 10-20% on complex documents, especially for ambiguous cases like distinguishing between a person's location and a company's headquarters.
FAQ
How do I handle entities that span sentence boundaries?
Use overlapping chunking when splitting documents, with at least 1-2 sentences of overlap. After extraction, deduplicate entities by comparing normalized names. If an entity appears in the overlap region of two chunks, you will get it from both and can merge the attributes.
When should I use spaCy instead of an LLM for entity extraction?
Use spaCy when you need sub-10ms latency, are extracting only standard entity types (person, org, location), and are processing millions of documents where LLM costs would be prohibitive. Use LLMs when you need custom entity types, attribute extraction, or when context-dependent interpretation is important.
How do I measure extraction accuracy?
Create a gold-standard dataset of 100+ manually annotated documents. For each entity type, compute precision (extracted entities that are correct), recall (real entities that were found), and F1 score. Track accuracy separately per entity type, as some types are harder than others.
#EntityExtraction #NER #StructuredOutputs #Pydantic #Python #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.