Building a Document Intelligence Agent: OCR, Layout Analysis, and Information Extraction
Learn how to build an end-to-end document intelligence agent that combines Tesseract OCR, layout detection, zone classification, and structured information extraction to process any document type automatically.
Why Document Intelligence Needs More Than OCR
Traditional OCR converts pixels to characters, but that is only the first step. Real document intelligence requires understanding the spatial layout — headers, paragraphs, tables, footnotes — and extracting structured information that downstream systems can consume. A document intelligence agent orchestrates these stages, deciding which regions need deeper analysis and which extraction strategy fits each zone.
The core pipeline follows four stages: image preprocessing, OCR with confidence scoring, layout analysis to identify semantic zones, and structured extraction that maps content to fields your application expects.
Setting Up the Foundation
Install the necessary libraries for the full pipeline:
pip install pytesseract Pillow layoutparser opencv-python-headless pydantic openai
Make sure Tesseract is installed on your system:
# Ubuntu/Debian
sudo apt-get install tesseract-ocr tesseract-ocr-eng
# macOS
brew install tesseract
Building the Document Preprocessing Layer
Raw scans often arrive skewed, poorly lit, or at inconsistent resolutions. Preprocessing normalizes images before OCR:
import cv2
import numpy as np
from PIL import Image
def preprocess_document(image_path: str) -> np.ndarray:
"""Prepare a document image for OCR and layout analysis."""
img = cv2.imread(image_path)
# Convert to grayscale
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Deskew: detect angle and rotate
coords = np.column_stack(np.where(gray > 0))
angle = cv2.minAreaRect(coords)[-1]
if angle < -45:
angle = 90 + angle
if abs(angle) > 0.5:
h, w = gray.shape
center = (w // 2, h // 2)
matrix = cv2.getRotationMatrix2D(center, angle, 1.0)
gray = cv2.warpAffine(
gray, matrix, (w, h),
flags=cv2.INTER_CUBIC,
borderMode=cv2.BORDER_REPLICATE
)
# Adaptive thresholding for variable lighting
binary = cv2.adaptiveThreshold(
gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY, 11, 2
)
# Noise removal
denoised = cv2.medianBlur(binary, 3)
return denoised
OCR with Confidence Scoring
Tesseract provides word-level confidence scores through its detailed output mode. This lets the agent flag low-confidence regions for human review:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
import pytesseract
from dataclasses import dataclass
@dataclass
class OCRResult:
text: str
confidence: float
bbox: tuple # (x, y, width, height)
block_num: int
line_num: int
def extract_with_confidence(image: np.ndarray) -> list[OCRResult]:
"""Run OCR and return word-level results with confidence."""
data = pytesseract.image_to_data(
image, output_type=pytesseract.Output.DICT
)
results = []
for i in range(len(data["text"])):
text = data["text"][i].strip()
conf = int(data["conf"][i])
if text and conf > 0:
results.append(OCRResult(
text=text,
confidence=conf / 100.0,
bbox=(
data["left"][i], data["top"][i],
data["width"][i], data["height"][i]
),
block_num=data["block_num"][i],
line_num=data["line_num"][i],
))
return results
Zone Classification with Layout Analysis
Layout analysis segments the page into semantic regions — title, body text, table, figure, footer — so the agent can apply the right extraction strategy per zone:
from enum import Enum
class ZoneType(Enum):
HEADER = "header"
BODY = "body"
TABLE = "table"
FOOTER = "footer"
SIDEBAR = "sidebar"
def classify_zones(
ocr_results: list[OCRResult],
page_height: int
) -> dict[ZoneType, list[OCRResult]]:
"""Classify OCR results into semantic zones by position."""
zones: dict[ZoneType, list[OCRResult]] = {z: [] for z in ZoneType}
for result in ocr_results:
y_ratio = result.bbox[1] / page_height
if y_ratio < 0.1:
zones[ZoneType.HEADER].append(result)
elif y_ratio > 0.9:
zones[ZoneType.FOOTER].append(result)
else:
zones[ZoneType.BODY].append(result)
return zones
The Agent Orchestrator
The agent ties all stages together, using an LLM to interpret extracted content and produce structured output:
from pydantic import BaseModel
from openai import OpenAI
class DocumentFields(BaseModel):
title: str | None = None
date: str | None = None
author: str | None = None
summary: str | None = None
key_entities: list[str] = []
confidence_score: float = 0.0
def run_document_agent(image_path: str) -> DocumentFields:
"""Full pipeline: preprocess, OCR, classify, extract."""
preprocessed = preprocess_document(image_path)
ocr_results = extract_with_confidence(preprocessed)
h, _ = preprocessed.shape[:2]
zones = classify_zones(ocr_results, h)
header_text = " ".join(r.text for r in zones[ZoneType.HEADER])
body_text = " ".join(r.text for r in zones[ZoneType.BODY])
avg_conf = np.mean([r.confidence for r in ocr_results]) if ocr_results else 0
client = OpenAI()
response = client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{"role": "system", "content": (
"Extract structured fields from this document text. "
"Return title, date, author, summary, and key entities."
)},
{"role": "user", "content": (
f"HEADER: {header_text}\n\nBODY: {body_text}"
)},
],
response_format=DocumentFields,
)
result = response.choices[0].message.parsed
result.confidence_score = round(avg_conf, 3)
return result
Handling Low-Confidence Regions
A production agent should flag uncertain results rather than silently producing bad data:
def identify_review_regions(
ocr_results: list[OCRResult],
threshold: float = 0.6
) -> list[dict]:
"""Flag regions where OCR confidence is below threshold."""
flagged = []
for result in ocr_results:
if result.confidence < threshold:
flagged.append({
"text": result.text,
"confidence": result.confidence,
"bbox": result.bbox,
"suggestion": "Route to human reviewer",
})
return flagged
This human-in-the-loop pattern is essential for any document processing system where accuracy is critical, such as legal or financial documents.
FAQ
How accurate is Tesseract compared to cloud OCR services?
Tesseract v5 achieves 95-98% accuracy on clean printed text but drops to 70-85% on degraded scans, handwriting, or unusual fonts. Cloud services like Google Document AI and AWS Textract often outperform it on difficult inputs because they use deep learning models trained on massive datasets. However, Tesseract is free, runs locally, and handles most standard business documents well.
Can layout analysis work on multi-column documents?
Yes, but it requires more sophisticated approaches than simple Y-coordinate thresholding. Libraries like LayoutParser use deep learning models trained on document layout datasets (PubLayNet, DocBank) to detect columns, tables, and figures regardless of their position. For production systems, combining LayoutParser with Tesseract yields much better results on complex layouts.
How should I handle documents in multiple languages?
Tesseract supports over 100 languages. Install the relevant language packs and either specify the language explicitly or use a language detection step first. For mixed-language documents, run OCR multiple times with different language hints and merge results by comparing confidence scores per region.
#DocumentAI #OCR #Tesseract #LayoutAnalysis #InformationExtraction #VisionAI #AgenticAI #Python
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.