Building a Document Comparison Agent: AI-Powered Contract and Document Diff
Build an AI agent that extracts text from documents, aligns corresponding sections, detects meaningful differences between versions, and generates clear summaries highlighting what changed and why it matters.
Beyond Simple Text Diff
Standard diff tools compare text line by line. They will tell you that line 47 changed from "30 days" to "45 days" — but they will not tell you this is a payment terms extension that affects your cash flow. A document comparison agent understands context. It groups changes by section, classifies their significance (cosmetic, substantive, material), and explains the business impact of each change.
This is especially valuable for contract review, policy updates, regulatory filings, and any document where the meaning of changes matters as much as their location.
Text Extraction Tool
Documents arrive in various formats. This tool extracts clean text from PDFs, DOCX files, and plain text:
from pathlib import Path
from agents import Agent, Runner, function_tool
@function_tool
def extract_text(file_path: str) -> str:
"""Extract text content from a document file.
Supports .txt, .pdf, and .docx formats."""
path = Path(file_path)
suffix = path.suffix.lower()
try:
if suffix == ".txt":
return path.read_text(encoding="utf-8")
elif suffix == ".pdf":
import pymupdf
doc = pymupdf.open(file_path)
pages = []
for page in doc:
pages.append(page.get_text())
doc.close()
return "\n\n".join(pages)
elif suffix == ".docx":
from docx import Document
doc = Document(file_path)
paragraphs = [p.text for p in doc.paragraphs if p.text.strip()]
return "\n\n".join(paragraphs)
else:
return f"Unsupported format: {suffix}"
except Exception as e:
return f"Extraction error: {e}"
Section Alignment Tool
Contracts and legal documents are structured into sections. This tool splits documents into sections and aligns them between versions:
import re
import difflib
_documents: dict[str, str] = {}
@function_tool
def load_document(label: str, file_path: str) -> str:
"""Load and store a document for comparison. Use labels like
'original' and 'revised'."""
from pathlib import Path
path = Path(file_path)
if path.suffix == ".txt":
text = path.read_text()
else:
# Delegate to extract_text for other formats
return f"Use extract_text for {path.suffix} files, then call store_text."
_documents[label] = text
word_count = len(text.split())
section_count = len(re.split(r"\n(?=\d+\.|Section |Article |ARTICLE )", text))
return f"Loaded '{label}': {word_count} words, ~{section_count} sections."
@function_tool
def store_text(label: str, text: str) -> str:
"""Store already-extracted text under a label for comparison."""
_documents[label] = text
return f"Stored '{label}': {len(text.split())} words."
Difference Detection Tool
This tool finds the actual differences between two document versions:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
@function_tool
def compute_diff(label_a: str, label_b: str) -> str:
"""Compute differences between two loaded documents.
Returns additions, deletions, and modifications."""
if label_a not in _documents or label_b not in _documents:
available = ", ".join(_documents.keys())
return f"Missing document. Available: {available}"
lines_a = _documents[label_a].splitlines()
lines_b = _documents[label_b].splitlines()
differ = difflib.unified_diff(
lines_a, lines_b,
fromfile=label_a, tofile=label_b,
lineterm="",
)
diff_lines = list(differ)
if not diff_lines:
return "Documents are identical."
# Summarize changes
additions = sum(1 for l in diff_lines if l.startswith("+") and not l.startswith("+++"))
deletions = sum(1 for l in diff_lines if l.startswith("-") and not l.startswith("---"))
# Extract changed sections (context around changes)
changes = []
current_change = []
for line in diff_lines:
if line.startswith("@@"):
if current_change:
changes.append("\n".join(current_change))
current_change = [line]
elif current_change is not None:
current_change.append(line)
if current_change:
changes.append("\n".join(current_change))
output = (
f"Diff Summary: {additions} additions, {deletions} deletions, "
f"{len(changes)} changed sections\n\n"
)
# Show first 10 change blocks
for i, change in enumerate(changes[:10]):
output += f"--- Change {i+1} ---\n{change}\n\n"
if len(changes) > 10:
output += f"... and {len(changes) - 10} more change blocks."
return output
Similarity Scoring Tool
Quantify how different two documents are overall:
@function_tool
def similarity_score(label_a: str, label_b: str) -> str:
"""Calculate overall similarity between two documents."""
if label_a not in _documents or label_b not in _documents:
return "Missing document."
text_a = _documents[label_a]
text_b = _documents[label_b]
# Sequence matcher for overall similarity
ratio = difflib.SequenceMatcher(None, text_a, text_b).ratio()
# Word-level comparison
words_a = set(text_a.lower().split())
words_b = set(text_b.lower().split())
jaccard = len(words_a & words_b) / len(words_a | words_b) if (words_a | words_b) else 0
return (
f"Similarity between '{label_a}' and '{label_b}':\n"
f" Character-level similarity: {ratio:.1%}\n"
f" Word overlap (Jaccard): {jaccard:.1%}\n"
f" Unique to '{label_a}': {len(words_a - words_b)} words\n"
f" Unique to '{label_b}': {len(words_b - words_a)} words"
)
Assembling the Document Comparison Agent
doc_agent = Agent(
name="Document Comparator",
instructions="""You are a document comparison agent specializing in contracts
and legal documents. When given two document versions:
1. Extract text from both documents using extract_text.
2. Store them with store_text using labels 'original' and 'revised'.
3. Call similarity_score for an overall comparison metric.
4. Call compute_diff to get the detailed differences.
5. Analyze each change block and classify it as:
- Cosmetic: formatting, typos, rephrasing with same meaning
- Substantive: meaningful change to terms, obligations, or rights
- Material: high-impact change affecting financial terms, liability,
termination, or indemnification
6. Produce a report with:
- Executive Summary (overall similarity, number of material changes)
- Material Changes (each with before/after text and impact analysis)
- Substantive Changes (grouped by section)
- Cosmetic Changes (brief list)
- Risk Assessment (what the changes mean for the parties involved)""",
tools=[extract_text, load_document, store_text, compute_diff, similarity_score],
)
Example Usage
result = Runner.run_sync(
doc_agent,
"Compare the original contract at /docs/contract_v1.pdf with the "
"revised version at /docs/contract_v2.pdf. Focus on any changes to "
"payment terms, liability clauses, and termination conditions.",
)
print(result.final_output)
The agent extracts text from both PDFs, computes a 94.2% similarity score, identifies 12 change blocks, classifies 2 as material (payment terms extended from 30 to 60 days, liability cap increased from $1M to $5M), 5 as substantive (new force majeure clause, updated data handling provisions), and 5 as cosmetic. The risk assessment highlights the cash flow impact of extended payment terms.
FAQ
Can this agent handle scanned PDFs without selectable text?
Not directly — scanned PDFs require OCR. Add a preprocessing step using pytesseract or a cloud OCR service like Google Document AI. Extract the text via OCR first, then feed it to the comparison agent through store_text.
How does the agent handle documents with completely different structures?
The diff tool works best when documents share a similar structure. For documents with reorganized sections, add a section-matching tool that uses semantic similarity (embeddings) to align sections by content rather than position before computing differences.
Is this suitable for comparing legal contracts in production?
This agent provides a strong first pass that saves hours of manual review. However, for legally binding decisions, always have a qualified attorney review the agent's findings. The agent excels at surfacing changes that might be missed during manual review, not at replacing legal judgment.
#DocumentComparison #TextExtraction #Contracts #Diff #AIAgents #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.