Skip to content
Learn Agentic AI13 min read0 views

Building a Document Comparison Agent: AI-Powered Contract and Document Diff

Build an AI agent that extracts text from documents, aligns corresponding sections, detects meaningful differences between versions, and generates clear summaries highlighting what changed and why it matters.

Beyond Simple Text Diff

Standard diff tools compare text line by line. They will tell you that line 47 changed from "30 days" to "45 days" — but they will not tell you this is a payment terms extension that affects your cash flow. A document comparison agent understands context. It groups changes by section, classifies their significance (cosmetic, substantive, material), and explains the business impact of each change.

This is especially valuable for contract review, policy updates, regulatory filings, and any document where the meaning of changes matters as much as their location.

Text Extraction Tool

Documents arrive in various formats. This tool extracts clean text from PDFs, DOCX files, and plain text:

from pathlib import Path
from agents import Agent, Runner, function_tool

@function_tool
def extract_text(file_path: str) -> str:
    """Extract text content from a document file.
    Supports .txt, .pdf, and .docx formats."""
    path = Path(file_path)
    suffix = path.suffix.lower()

    try:
        if suffix == ".txt":
            return path.read_text(encoding="utf-8")

        elif suffix == ".pdf":
            import pymupdf
            doc = pymupdf.open(file_path)
            pages = []
            for page in doc:
                pages.append(page.get_text())
            doc.close()
            return "\n\n".join(pages)

        elif suffix == ".docx":
            from docx import Document
            doc = Document(file_path)
            paragraphs = [p.text for p in doc.paragraphs if p.text.strip()]
            return "\n\n".join(paragraphs)

        else:
            return f"Unsupported format: {suffix}"

    except Exception as e:
        return f"Extraction error: {e}"

Section Alignment Tool

Contracts and legal documents are structured into sections. This tool splits documents into sections and aligns them between versions:

import re
import difflib

_documents: dict[str, str] = {}

@function_tool
def load_document(label: str, file_path: str) -> str:
    """Load and store a document for comparison. Use labels like
    'original' and 'revised'."""
    from pathlib import Path
    path = Path(file_path)
    if path.suffix == ".txt":
        text = path.read_text()
    else:
        # Delegate to extract_text for other formats
        return f"Use extract_text for {path.suffix} files, then call store_text."

    _documents[label] = text
    word_count = len(text.split())
    section_count = len(re.split(r"\n(?=\d+\.|Section |Article |ARTICLE )", text))
    return f"Loaded '{label}': {word_count} words, ~{section_count} sections."

@function_tool
def store_text(label: str, text: str) -> str:
    """Store already-extracted text under a label for comparison."""
    _documents[label] = text
    return f"Stored '{label}': {len(text.split())} words."

Difference Detection Tool

This tool finds the actual differences between two document versions:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

@function_tool
def compute_diff(label_a: str, label_b: str) -> str:
    """Compute differences between two loaded documents.
    Returns additions, deletions, and modifications."""
    if label_a not in _documents or label_b not in _documents:
        available = ", ".join(_documents.keys())
        return f"Missing document. Available: {available}"

    lines_a = _documents[label_a].splitlines()
    lines_b = _documents[label_b].splitlines()

    differ = difflib.unified_diff(
        lines_a, lines_b,
        fromfile=label_a, tofile=label_b,
        lineterm="",
    )
    diff_lines = list(differ)

    if not diff_lines:
        return "Documents are identical."

    # Summarize changes
    additions = sum(1 for l in diff_lines if l.startswith("+") and not l.startswith("+++"))
    deletions = sum(1 for l in diff_lines if l.startswith("-") and not l.startswith("---"))

    # Extract changed sections (context around changes)
    changes = []
    current_change = []
    for line in diff_lines:
        if line.startswith("@@"):
            if current_change:
                changes.append("\n".join(current_change))
            current_change = [line]
        elif current_change is not None:
            current_change.append(line)
    if current_change:
        changes.append("\n".join(current_change))

    output = (
        f"Diff Summary: {additions} additions, {deletions} deletions, "
        f"{len(changes)} changed sections\n\n"
    )
    # Show first 10 change blocks
    for i, change in enumerate(changes[:10]):
        output += f"--- Change {i+1} ---\n{change}\n\n"

    if len(changes) > 10:
        output += f"... and {len(changes) - 10} more change blocks."

    return output

Similarity Scoring Tool

Quantify how different two documents are overall:

@function_tool
def similarity_score(label_a: str, label_b: str) -> str:
    """Calculate overall similarity between two documents."""
    if label_a not in _documents or label_b not in _documents:
        return "Missing document."

    text_a = _documents[label_a]
    text_b = _documents[label_b]

    # Sequence matcher for overall similarity
    ratio = difflib.SequenceMatcher(None, text_a, text_b).ratio()

    # Word-level comparison
    words_a = set(text_a.lower().split())
    words_b = set(text_b.lower().split())
    jaccard = len(words_a & words_b) / len(words_a | words_b) if (words_a | words_b) else 0

    return (
        f"Similarity between '{label_a}' and '{label_b}':\n"
        f"  Character-level similarity: {ratio:.1%}\n"
        f"  Word overlap (Jaccard): {jaccard:.1%}\n"
        f"  Unique to '{label_a}': {len(words_a - words_b)} words\n"
        f"  Unique to '{label_b}': {len(words_b - words_a)} words"
    )

Assembling the Document Comparison Agent

doc_agent = Agent(
    name="Document Comparator",
    instructions="""You are a document comparison agent specializing in contracts
and legal documents. When given two document versions:

1. Extract text from both documents using extract_text.
2. Store them with store_text using labels 'original' and 'revised'.
3. Call similarity_score for an overall comparison metric.
4. Call compute_diff to get the detailed differences.
5. Analyze each change block and classify it as:
   - Cosmetic: formatting, typos, rephrasing with same meaning
   - Substantive: meaningful change to terms, obligations, or rights
   - Material: high-impact change affecting financial terms, liability,
     termination, or indemnification
6. Produce a report with:
   - Executive Summary (overall similarity, number of material changes)
   - Material Changes (each with before/after text and impact analysis)
   - Substantive Changes (grouped by section)
   - Cosmetic Changes (brief list)
   - Risk Assessment (what the changes mean for the parties involved)""",
    tools=[extract_text, load_document, store_text, compute_diff, similarity_score],
)

Example Usage

result = Runner.run_sync(
    doc_agent,
    "Compare the original contract at /docs/contract_v1.pdf with the "
    "revised version at /docs/contract_v2.pdf. Focus on any changes to "
    "payment terms, liability clauses, and termination conditions.",
)
print(result.final_output)

The agent extracts text from both PDFs, computes a 94.2% similarity score, identifies 12 change blocks, classifies 2 as material (payment terms extended from 30 to 60 days, liability cap increased from $1M to $5M), 5 as substantive (new force majeure clause, updated data handling provisions), and 5 as cosmetic. The risk assessment highlights the cash flow impact of extended payment terms.

FAQ

Can this agent handle scanned PDFs without selectable text?

Not directly — scanned PDFs require OCR. Add a preprocessing step using pytesseract or a cloud OCR service like Google Document AI. Extract the text via OCR first, then feed it to the comparison agent through store_text.

How does the agent handle documents with completely different structures?

The diff tool works best when documents share a similar structure. For documents with reorganized sections, add a section-matching tool that uses semantic similarity (embeddings) to align sections by content rather than position before computing differences.

This agent provides a strong first pass that saves hours of manual review. However, for legally binding decisions, always have a qualified attorney review the agent's findings. The agent excels at surfacing changes that might be missed during manual review, not at replacing legal judgment.


#DocumentComparison #TextExtraction #Contracts #Diff #AIAgents #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.