AI-Powered Document Comparison: Redline Generation and Change Tracking with Vision

Why Document Comparison Needs AI

Traditional diff tools work character-by-character or line-by-line. That works for code but fails for documents. When a lawyer restructures a paragraph — moving sentences around, changing "shall" to "must," and splitting a clause into two — a naive diff shows the entire paragraph as deleted and re-added. What you actually want is a semantic understanding of what changed and whether those changes matter.

AI-powered document comparison works at the meaning level. It aligns paragraphs across document versions, detects rewording versus substantive changes, and generates human-readable summaries of what shifted and why it might matter.

The Comparison Pipeline

The system works in four stages: text extraction from both documents, alignment of corresponding sections, change detection and classification, and output generation (redlines, annotations, summary).

Text Extraction and Segmentation

First, extract and segment both documents into comparable units:

import pdfplumber
from dataclasses import dataclass


@dataclass
class DocumentSection:
    index: int
    heading: str | None
    text: str
    page: int
    section_type: str  # "heading", "paragraph", "list", "table"


def extract_sections(pdf_path: str) -> list[DocumentSection]:
    """Extract structured sections from a PDF document."""
    sections = []
    current_idx = 0

    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages):
            text = page.extract_text() or ""
            paragraphs = text.split("\n\n")

            for para in paragraphs:
                para = para.strip()
                if not para:
                    continue

                section_type = classify_section(para)
                heading = para if section_type == "heading" else None

                sections.append(DocumentSection(
                    index=current_idx,
                    heading=heading,
                    text=para,
                    page=page_num + 1,
                    section_type=section_type,
                ))
                current_idx += 1

    return sections


def classify_section(text: str) -> str:
    """Classify a text block as heading, paragraph, or list."""
    lines = text.strip().split("\n")

    if len(lines) == 1 and len(text) < 80 and text.isupper():
        return "heading"
    if all(line.strip().startswith(("-", "*", "•")) for line in lines):
        return "list"
    if any(char.isdigit() and "." in text[:5] for char in text[:3]):
        return "heading"

    return "paragraph"

Section Alignment with Semantic Similarity

Align sections between the two document versions using embedding-based similarity:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

from openai import OpenAI
import numpy as np


def get_embeddings(texts: list[str]) -> list[list[float]]:
    """Get embeddings for a list of text sections."""
    client = OpenAI()
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts,
    )
    return [item.embedding for item in response.data]


def cosine_similarity(a: list[float], b: list[float]) -> float:
    """Compute cosine similarity between two vectors."""
    a_arr, b_arr = np.array(a), np.array(b)
    return float(
        np.dot(a_arr, b_arr) /
        (np.linalg.norm(a_arr) * np.linalg.norm(b_arr) + 1e-10)
    )


def align_sections(
    old_sections: list[DocumentSection],
    new_sections: list[DocumentSection],
    threshold: float = 0.75,
) -> list[dict]:
    """Align sections between old and new document versions."""
    old_texts = [s.text for s in old_sections]
    new_texts = [s.text for s in new_sections]

    old_embeds = get_embeddings(old_texts)
    new_embeds = get_embeddings(new_texts)

    alignments = []
    used_new = set()

    for i, old_embed in enumerate(old_embeds):
        best_score = 0.0
        best_j = -1

        for j, new_embed in enumerate(new_embeds):
            if j in used_new:
                continue
            score = cosine_similarity(old_embed, new_embed)
            if score > best_score:
                best_score = score
                best_j = j

        if best_score >= threshold:
            alignments.append({
                "old": old_sections[i],
                "new": new_sections[best_j],
                "similarity": best_score,
                "status": "modified" if best_score < 0.98 else "unchanged",
            })
            used_new.add(best_j)
        else:
            alignments.append({
                "old": old_sections[i],
                "new": None,
                "similarity": 0.0,
                "status": "deleted",
            })

    # Find sections only in the new version
    for j, section in enumerate(new_sections):
        if j not in used_new:
            alignments.append({
                "old": None,
                "new": section,
                "similarity": 0.0,
                "status": "added",
            })

    return alignments

Change Classification

Not all changes are equal. Distinguish between cosmetic rewording and substantive changes:

from enum import Enum


class ChangeType(Enum):
    COSMETIC = "cosmetic"       # Rewording without meaning change
    SUBSTANTIVE = "substantive"  # Meaning or obligation changed
    STRUCTURAL = "structural"    # Section moved or reorganized
    ADDITION = "addition"
    DELETION = "deletion"


def classify_change(alignment: dict) -> dict:
    """Classify the type and severity of a detected change."""
    if alignment["status"] == "added":
        return {**alignment, "change_type": ChangeType.ADDITION, "severity": "high"}
    if alignment["status"] == "deleted":
        return {**alignment, "change_type": ChangeType.DELETION, "severity": "high"}
    if alignment["status"] == "unchanged":
        return {**alignment, "change_type": None, "severity": "none"}

    # For modified sections, use LLM to classify
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": (
                "Compare these two text versions and classify the change as "
                "'cosmetic' (rewording without meaning change), "
                "'substantive' (meaning, obligation, or number changed), "
                "or 'structural' (reorganized but same content). "
                "Respond with just the classification and a one-sentence explanation."
            )},
            {"role": "user", "content": (
                f"OLD: {alignment['old'].text}\n\n"
                f"NEW: {alignment['new'].text}"
            )},
        ],
    )

    classification = response.choices[0].message.content.lower()
    if "substantive" in classification:
        change_type = ChangeType.SUBSTANTIVE
        severity = "high"
    elif "structural" in classification:
        change_type = ChangeType.STRUCTURAL
        severity = "medium"
    else:
        change_type = ChangeType.COSMETIC
        severity = "low"

    return {
        **alignment,
        "change_type": change_type,
        "severity": severity,
        "explanation": response.choices[0].message.content,
    }

Generating the Redline Output

Produce an HTML redline document showing additions in green and deletions in red:

import difflib


def generate_redline_html(
    classified_changes: list[dict],
) -> str:
    """Generate an HTML redline document from classified changes."""
    html_parts = [
        "<html><head><style>",
        ".added { background: #d4edda; color: #155724; }",
        ".deleted { background: #f8d7da; color: #721c24; text-decoration: line-through; }",
        ".modified { background: #fff3cd; color: #856404; }",
        ".section { margin: 16px 0; padding: 12px; border-left: 4px solid #ccc; }",
        ".severity-high { border-left-color: #dc3545; }",
        ".severity-medium { border-left-color: #ffc107; }",
        ".severity-low { border-left-color: #28a745; }",
        "</style></head><body>",
    ]

    for change in classified_changes:
        severity = change.get("severity", "none")

        if change["status"] == "unchanged":
            html_parts.append(f'<div class="section">{change["old"].text}</div>')
        elif change["status"] == "added":
            html_parts.append(
                f'<div class="section severity-{severity}">'
                f'<span class="added">{change["new"].text}</span></div>'
            )
        elif change["status"] == "deleted":
            html_parts.append(
                f'<div class="section severity-{severity}">'
                f'<span class="deleted">{change["old"].text}</span></div>'
            )
        elif change["status"] == "modified":
            old_words = change["old"].text.split()
            new_words = change["new"].text.split()
            diff = difflib.ndiff(old_words, new_words)

            diff_html = []
            for token in diff:
                if token.startswith("- "):
                    diff_html.append(f'<span class="deleted">{token[2:]}</span>')
                elif token.startswith("+ "):
                    diff_html.append(f'<span class="added">{token[2:]}</span>')
                elif token.startswith("  "):
                    diff_html.append(token[2:])

            html_parts.append(
                f'<div class="section severity-{severity}">'
                f'{" ".join(diff_html)}</div>'
            )

    html_parts.append("</body></html>")
    return "\n".join(html_parts)

Change Summary Generation

Produce a high-level summary for reviewers who need the highlights without reading every redline:

def generate_change_summary(
    classified_changes: list[dict],
) -> str:
    """Generate a human-readable summary of all changes."""
    substantive = [c for c in classified_changes if c.get("change_type") == ChangeType.SUBSTANTIVE]
    additions = [c for c in classified_changes if c["status"] == "added"]
    deletions = [c for c in classified_changes if c["status"] == "deleted"]

    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": (
                "Summarize the key changes between two document versions. "
                "Focus on substantive changes that affect meaning, "
                "obligations, or numbers. Be concise and precise."
            )},
            {"role": "user", "content": (
                f"Substantive changes ({len(substantive)}):\n" +
                "\n".join(c.get("explanation", "") for c in substantive) +
                f"\n\nNew sections added: {len(additions)}" +
                f"\nSections removed: {len(deletions)}"
            )},
        ],
    )

    return response.choices[0].message.content

FAQ

How does semantic comparison differ from traditional diff tools?

Traditional diff tools operate at the character or line level — they see every reworded sentence as a delete-then-add. Semantic comparison uses embeddings to understand meaning, so it can recognize that "The vendor shall deliver goods within 30 days" and "Goods must be delivered by the vendor within thirty days" are the same clause with cosmetic rewording, not a deletion and addition.

Can this handle comparing documents in different formats (Word vs PDF)?

Yes, but you need format-specific extractors. Use python-docx for Word files and pdfplumber for PDFs. The key insight is that comparison happens at the extracted text level, not the file format level. Extract sections from both documents into the same DocumentSection structure, then the rest of the pipeline works identically regardless of source format.

What about legal documents with numbered clause references?

Clause renumbering is a common trap. When a new clause is inserted, all subsequent numbers shift, making every following clause appear "changed." Handle this by stripping clause numbers before comparison and treating numbering as metadata. After alignment, regenerate the numbering analysis as a separate section of the change report.

#DocumentComparison #RedlineGeneration #ChangeTracking #LegalAI #NLP #AgenticAI #Python #ContractReview

AI-Powered Document Comparison: Redline Generation and Change Tracking with Vision

Why Document Comparison Needs AI

The Comparison Pipeline

Text Extraction and Segmentation

Section Alignment with Semantic Similarity

Change Classification

Generating the Redline Output

Change Summary Generation

FAQ

How does semantic comparison differ from traditional diff tools?

Can this handle comparing documents in different formats (Word vs PDF)?

What about legal documents with numbered clause references?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding