Building a Document Intelligence Agent: OCR, Layout Analysis, and Information Extraction

Why Document Intelligence Needs More Than OCR

Traditional OCR converts pixels to characters, but that is only the first step. Real document intelligence requires understanding the spatial layout — headers, paragraphs, tables, footnotes — and extracting structured information that downstream systems can consume. A document intelligence agent orchestrates these stages, deciding which regions need deeper analysis and which extraction strategy fits each zone.

The core pipeline follows four stages: image preprocessing, OCR with confidence scoring, layout analysis to identify semantic zones, and structured extraction that maps content to fields your application expects.

Setting Up the Foundation

Install the necessary libraries for the full pipeline:

pip install pytesseract Pillow layoutparser opencv-python-headless pydantic openai

Make sure Tesseract is installed on your system:

# Ubuntu/Debian
sudo apt-get install tesseract-ocr tesseract-ocr-eng

# macOS
brew install tesseract

Building the Document Preprocessing Layer

Raw scans often arrive skewed, poorly lit, or at inconsistent resolutions. Preprocessing normalizes images before OCR:

import cv2
import numpy as np
from PIL import Image


def preprocess_document(image_path: str) -> np.ndarray:
    """Prepare a document image for OCR and layout analysis."""
    img = cv2.imread(image_path)

    # Convert to grayscale
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    # Deskew: detect angle and rotate
    coords = np.column_stack(np.where(gray > 0))
    angle = cv2.minAreaRect(coords)[-1]
    if angle < -45:
        angle = 90 + angle
    if abs(angle) > 0.5:
        h, w = gray.shape
        center = (w // 2, h // 2)
        matrix = cv2.getRotationMatrix2D(center, angle, 1.0)
        gray = cv2.warpAffine(
            gray, matrix, (w, h),
            flags=cv2.INTER_CUBIC,
            borderMode=cv2.BORDER_REPLICATE
        )

    # Adaptive thresholding for variable lighting
    binary = cv2.adaptiveThreshold(
        gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
        cv2.THRESH_BINARY, 11, 2
    )

    # Noise removal
    denoised = cv2.medianBlur(binary, 3)

    return denoised

OCR with Confidence Scoring

Tesseract provides word-level confidence scores through its detailed output mode. This lets the agent flag low-confidence regions for human review:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

import pytesseract
from dataclasses import dataclass


@dataclass
class OCRResult:
    text: str
    confidence: float
    bbox: tuple  # (x, y, width, height)
    block_num: int
    line_num: int


def extract_with_confidence(image: np.ndarray) -> list[OCRResult]:
    """Run OCR and return word-level results with confidence."""
    data = pytesseract.image_to_data(
        image, output_type=pytesseract.Output.DICT
    )

    results = []
    for i in range(len(data["text"])):
        text = data["text"][i].strip()
        conf = int(data["conf"][i])

        if text and conf > 0:
            results.append(OCRResult(
                text=text,
                confidence=conf / 100.0,
                bbox=(
                    data["left"][i], data["top"][i],
                    data["width"][i], data["height"][i]
                ),
                block_num=data["block_num"][i],
                line_num=data["line_num"][i],
            ))

    return results

Zone Classification with Layout Analysis

Layout analysis segments the page into semantic regions — title, body text, table, figure, footer — so the agent can apply the right extraction strategy per zone:

from enum import Enum


class ZoneType(Enum):
    HEADER = "header"
    BODY = "body"
    TABLE = "table"
    FOOTER = "footer"
    SIDEBAR = "sidebar"


def classify_zones(
    ocr_results: list[OCRResult],
    page_height: int
) -> dict[ZoneType, list[OCRResult]]:
    """Classify OCR results into semantic zones by position."""
    zones: dict[ZoneType, list[OCRResult]] = {z: [] for z in ZoneType}

    for result in ocr_results:
        y_ratio = result.bbox[1] / page_height

        if y_ratio < 0.1:
            zones[ZoneType.HEADER].append(result)
        elif y_ratio > 0.9:
            zones[ZoneType.FOOTER].append(result)
        else:
            zones[ZoneType.BODY].append(result)

    return zones

The Agent Orchestrator

The agent ties all stages together, using an LLM to interpret extracted content and produce structured output:

from pydantic import BaseModel
from openai import OpenAI


class DocumentFields(BaseModel):
    title: str | None = None
    date: str | None = None
    author: str | None = None
    summary: str | None = None
    key_entities: list[str] = []
    confidence_score: float = 0.0


def run_document_agent(image_path: str) -> DocumentFields:
    """Full pipeline: preprocess, OCR, classify, extract."""
    preprocessed = preprocess_document(image_path)
    ocr_results = extract_with_confidence(preprocessed)

    h, _ = preprocessed.shape[:2]
    zones = classify_zones(ocr_results, h)

    header_text = " ".join(r.text for r in zones[ZoneType.HEADER])
    body_text = " ".join(r.text for r in zones[ZoneType.BODY])
    avg_conf = np.mean([r.confidence for r in ocr_results]) if ocr_results else 0

    client = OpenAI()
    response = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": (
                "Extract structured fields from this document text. "
                "Return title, date, author, summary, and key entities."
            )},
            {"role": "user", "content": (
                f"HEADER: {header_text}\n\nBODY: {body_text}"
            )},
        ],
        response_format=DocumentFields,
    )

    result = response.choices[0].message.parsed
    result.confidence_score = round(avg_conf, 3)
    return result

Handling Low-Confidence Regions

A production agent should flag uncertain results rather than silently producing bad data:

def identify_review_regions(
    ocr_results: list[OCRResult],
    threshold: float = 0.6
) -> list[dict]:
    """Flag regions where OCR confidence is below threshold."""
    flagged = []
    for result in ocr_results:
        if result.confidence < threshold:
            flagged.append({
                "text": result.text,
                "confidence": result.confidence,
                "bbox": result.bbox,
                "suggestion": "Route to human reviewer",
            })
    return flagged

This human-in-the-loop pattern is essential for any document processing system where accuracy is critical, such as legal or financial documents.

FAQ

How accurate is Tesseract compared to cloud OCR services?

Tesseract v5 achieves 95-98% accuracy on clean printed text but drops to 70-85% on degraded scans, handwriting, or unusual fonts. Cloud services like Google Document AI and AWS Textract often outperform it on difficult inputs because they use deep learning models trained on massive datasets. However, Tesseract is free, runs locally, and handles most standard business documents well.

Can layout analysis work on multi-column documents?

Yes, but it requires more sophisticated approaches than simple Y-coordinate thresholding. Libraries like LayoutParser use deep learning models trained on document layout datasets (PubLayNet, DocBank) to detect columns, tables, and figures regardless of their position. For production systems, combining LayoutParser with Tesseract yields much better results on complex layouts.

How should I handle documents in multiple languages?

Tesseract supports over 100 languages. Install the relevant language packs and either specify the language explicitly or use a language detection step first. For mixed-language documents, run OCR multiple times with different language hints and merge results by comparing confidence scores per region.

#DocumentAI #OCR #Tesseract #LayoutAnalysis #InformationExtraction #VisionAI #AgenticAI #Python

Building a Document Intelligence Agent: OCR, Layout Analysis, and Information Extraction

Why Document Intelligence Needs More Than OCR

Setting Up the Foundation

Building the Document Preprocessing Layer

OCR with Confidence Scoring

Zone Classification with Layout Analysis

The Agent Orchestrator

Handling Low-Confidence Regions

FAQ

How accurate is Tesseract compared to cloud OCR services?

Can layout analysis work on multi-column documents?

How should I handle documents in multiple languages?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding