Extracting Entities from Documents: Names, Dates, Addresses, and Custom Types

Entity Extraction with LLMs vs. Traditional NER

Traditional Named Entity Recognition (NER) models like spaCy's en_core_web_lg are fast and work well for standard entity types: person names, organizations, locations. But they struggle with domain-specific entities (medical codes, legal citations, product SKUs) and they cannot extract structured attributes for each entity.

LLM-based extraction handles arbitrary entity types, extracts attributes, and understands context that statistical models miss. The tradeoff is cost and latency: an LLM call takes 500ms-2s versus 5ms for spaCy. For most business applications, the accuracy gain justifies the cost.

Designing Entity Schemas

Start with a base entity class and specialize for each type:

from pydantic import BaseModel, Field
from typing import List, Optional, Literal
from datetime import date

class PersonEntity(BaseModel):
    full_name: str
    first_name: Optional[str] = None
    last_name: Optional[str] = None
    title: Optional[str] = Field(default=None, description="Mr, Mrs, Dr, etc.")
    role: Optional[str] = Field(default=None, description="Job title or role")
    organization: Optional[str] = None

class DateEntity(BaseModel):
    raw_text: str = Field(description="Original date text from document")
    normalized: Optional[str] = Field(
        default=None,
        description="ISO format YYYY-MM-DD when possible"
    )
    date_type: Literal["exact", "relative", "range", "approximate"]

class AddressEntity(BaseModel):
    full_address: str
    street: Optional[str] = None
    city: Optional[str] = None
    state: Optional[str] = None
    postal_code: Optional[str] = None
    country: Optional[str] = Field(default="US")

class MoneyEntity(BaseModel):
    amount: float
    currency: str = Field(default="USD")
    raw_text: str = Field(description="Original text, e.g., '$1.2 million'")

Keeping the raw_text field alongside normalized values is essential for auditing. When a downstream process questions an extracted value, you can trace it back to the exact source text.

The Extraction Prompt

A well-structured prompt dramatically improves extraction quality:

from openai import OpenAI
import instructor

client = instructor.from_openai(OpenAI())

class DocumentEntities(BaseModel):
    people: List[PersonEntity]
    dates: List[DateEntity]
    addresses: List[AddressEntity]
    monetary_values: List[MoneyEntity]

def extract_entities(text: str) -> DocumentEntities:
    return client.chat.completions.create(
        model="gpt-4o",
        response_model=DocumentEntities,
        max_retries=2,
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a precise document entity extractor. "
                    "Extract ALL entities of each type from the text. "
                    "If an entity attribute is not explicitly stated, use null. "
                    "Never infer or guess values not present in the text."
                )
            },
            {"role": "user", "content": text}
        ],
    )

The instruction "never infer or guess" is critical. Without it, the model tends to hallucinate plausible-sounding addresses or fill in missing first/last name splits incorrectly.

Custom Entity Types

Define domain-specific entities for your use case. Here is an example for legal document extraction:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

class LegalCitation(BaseModel):
    case_name: str
    citation: str = Field(description="e.g., '123 F.3d 456'")
    court: Optional[str] = None
    year: Optional[int] = None

class ContractClause(BaseModel):
    clause_type: Literal[
        "termination", "liability", "indemnification",
        "confidentiality", "payment_terms", "warranty", "other"
    ]
    summary: str
    parties_involved: List[str]
    key_conditions: List[str]

The Literal type constrains the model to a fixed set of values, which prevents it from inventing clause types that your downstream system cannot handle.

Batch Extraction for Multiple Documents

When processing many documents, use async calls for throughput:

import asyncio
from openai import AsyncOpenAI

async_client = instructor.from_openai(AsyncOpenAI())

async def extract_batch(documents: List[str]) -> List[DocumentEntities]:
    tasks = [
        async_client.chat.completions.create(
            model="gpt-4o",
            response_model=DocumentEntities,
            max_retries=2,
            messages=[
                {"role": "system", "content": "Extract all entities from the text."},
                {"role": "user", "content": doc}
            ],
        )
        for doc in documents
    ]
    # Process in batches of 10 to respect rate limits
    results = []
    for i in range(0, len(tasks), 10):
        batch = tasks[i:i + 10]
        results.extend(await asyncio.gather(*batch))
    return results

Improving Accuracy with Few-Shot Examples

Include examples in your prompt to calibrate the model:

FEW_SHOT_EXAMPLE = """
Text: "Dr. Sarah Chen, Chief Medical Officer at Valley Health (123 Oak St,
Portland, OR 97201), approved a $2.5M equipment purchase on March 15, 2025."

Expected extraction:
- Person: Dr. Sarah Chen, role=Chief Medical Officer, org=Valley Health
- Address: 123 Oak St, Portland, OR 97201
- Money: $2,500,000 USD (raw: "$2.5M")
- Date: 2025-03-15, type=exact (raw: "March 15, 2025")
"""

def extract_with_examples(text: str) -> DocumentEntities:
    return client.chat.completions.create(
        model="gpt-4o",
        response_model=DocumentEntities,
        messages=[
            {
                "role": "system",
                "content": f"Extract entities precisely. Example:\n{FEW_SHOT_EXAMPLE}"
            },
            {"role": "user", "content": text}
        ],
    )

Few-shot examples improve extraction accuracy by 10-20% on complex documents, especially for ambiguous cases like distinguishing between a person's location and a company's headquarters.

FAQ

How do I handle entities that span sentence boundaries?

Use overlapping chunking when splitting documents, with at least 1-2 sentences of overlap. After extraction, deduplicate entities by comparing normalized names. If an entity appears in the overlap region of two chunks, you will get it from both and can merge the attributes.

When should I use spaCy instead of an LLM for entity extraction?

Use spaCy when you need sub-10ms latency, are extracting only standard entity types (person, org, location), and are processing millions of documents where LLM costs would be prohibitive. Use LLMs when you need custom entity types, attribute extraction, or when context-dependent interpretation is important.

How do I measure extraction accuracy?

Create a gold-standard dataset of 100+ manually annotated documents. For each entity type, compute precision (extracted entities that are correct), recall (real entities that were found), and F1 score. Track accuracy separately per entity type, as some types are harder than others.

#EntityExtraction #NER #StructuredOutputs #Pydantic #Python #AgenticAI #LearnAI #AIEngineering

Extracting Entities from Documents: Names, Dates, Addresses, and Custom Types

Entity Extraction with LLMs vs. Traditional NER

Designing Entity Schemas

The Extraction Prompt

Custom Entity Types

Batch Extraction for Multiple Documents

Improving Accuracy with Few-Shot Examples

FAQ

How do I handle entities that span sentence boundaries?

When should I use spaCy instead of an LLM for entity extraction?

How do I measure extraction accuracy?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding