Handwriting Recognition with AI Agents: Processing Handwritten Forms and Notes
Build an AI agent pipeline for handwriting recognition that processes handwritten forms and notes, extracts field values with confidence scoring, and routes low-confidence results to human reviewers for correction.
The Handwriting Problem
Despite decades of digitization, handwritten documents remain everywhere: patient intake forms, field inspection reports, warehouse inventory sheets, insurance claims, and school exams. These documents contain critical information locked in a format that traditional OCR struggles with.
Handwriting recognition (HTR — Handwritten Text Recognition) differs from printed text OCR in fundamental ways. Characters are connected, spacing is irregular, the same person writes the same letter differently depending on context, and individual writing styles vary enormously. Modern deep learning approaches have made HTR dramatically more capable, but building a production pipeline still requires careful engineering around confidence scoring, field extraction, and human review routing.
Setting Up the HTR Pipeline
pip install pytesseract opencv-python-headless Pillow torch torchvision transformers openai pydantic
Preprocessing Handwritten Documents
Handwritten forms need more aggressive preprocessing than printed documents:
import cv2
import numpy as np
def preprocess_handwriting(image_path: str) -> np.ndarray:
"""Preprocess handwritten document for recognition."""
img = cv2.imread(image_path)
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# Remove ruled lines (common in forms)
horizontal_kernel = cv2.getStructuringElement(
cv2.MORPH_RECT, (40, 1)
)
detected_lines = cv2.morphologyEx(
gray, cv2.MORPH_OPEN, horizontal_kernel
)
# Subtract lines from image
clean = cv2.subtract(gray, detected_lines)
# Adaptive binarization works better for variable ink density
binary = cv2.adaptiveThreshold(
clean, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY_INV, 21, 10
)
# Remove small noise blobs
kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (2, 2))
cleaned = cv2.morphologyEx(binary, cv2.MORPH_OPEN, kernel)
return cleaned
Line and Word Segmentation
Before recognition, segment the document into individual lines and words:
from dataclasses import dataclass
@dataclass
class TextLine:
image: np.ndarray
bbox: tuple # (x, y, w, h)
line_number: int
@dataclass
class Word:
image: np.ndarray
bbox: tuple
line_number: int
word_index: int
def segment_lines(binary_image: np.ndarray) -> list[TextLine]:
"""Segment handwritten text into individual lines."""
# Horizontal projection to find line boundaries
h_projection = np.sum(binary_image, axis=1)
lines = []
in_line = False
start = 0
line_num = 0
for y, val in enumerate(h_projection):
if not in_line and val > 0:
start = y
in_line = True
elif in_line and val == 0:
if y - start > 10: # Minimum line height
line_img = binary_image[start:y, :]
x_nonzero = np.where(np.sum(line_img, axis=0) > 0)[0]
if len(x_nonzero) > 0:
x_start = x_nonzero[0]
x_end = x_nonzero[-1]
lines.append(TextLine(
image=line_img[:, x_start:x_end + 1],
bbox=(x_start, start, x_end - x_start, y - start),
line_number=line_num,
))
line_num += 1
in_line = False
return lines
def segment_words(line: TextLine) -> list[Word]:
"""Segment a text line into individual words."""
v_projection = np.sum(line.image, axis=0)
words = []
in_word = False
start = 0
word_idx = 0
gap_threshold = 15 # Pixels between words
gaps = []
current_gap = 0
for x, val in enumerate(v_projection):
if val == 0:
current_gap += 1
else:
if current_gap > 0:
gaps.append((x - current_gap, current_gap))
current_gap = 0
# Use larger gaps as word boundaries
if gaps:
median_gap = np.median([g[1] for g in gaps])
gap_threshold = max(median_gap * 1.5, 10)
in_word = False
for x, val in enumerate(v_projection):
if not in_word and val > 0:
start = x
in_word = True
elif in_word and val == 0:
if x - start > 5:
# Check if next ink is far enough to be a new word
next_ink = np.argmax(v_projection[x:] > 0) if x < len(v_projection) else 0
if next_ink > gap_threshold or x == len(v_projection) - 1:
word_img = line.image[:, start:x]
words.append(Word(
image=word_img,
bbox=(
line.bbox[0] + start,
line.bbox[1],
x - start,
line.bbox[3],
),
line_number=line.line_number,
word_index=word_idx,
))
word_idx += 1
in_word = False
return words
Multi-Engine Recognition with Confidence
Use multiple recognition approaches and compare results for higher accuracy:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
import pytesseract
from PIL import Image
@dataclass
class RecognitionResult:
text: str
confidence: float
engine: str
def recognize_with_tesseract(
word_image: np.ndarray,
) -> RecognitionResult:
"""Recognize handwriting using Tesseract HTR mode."""
pil_img = Image.fromarray(word_image)
# PSM 8 = single word, OEM 1 = LSTM engine
data = pytesseract.image_to_data(
pil_img,
config="--psm 8 --oem 1",
output_type=pytesseract.Output.DICT,
)
words = [t for t, c in zip(data["text"], data["conf"])
if t.strip() and int(c) > 0]
confs = [int(c) / 100.0 for t, c in zip(data["text"], data["conf"])
if t.strip() and int(c) > 0]
text = " ".join(words) if words else ""
conf = sum(confs) / len(confs) if confs else 0.0
return RecognitionResult(
text=text, confidence=conf, engine="tesseract"
)
def recognize_with_vision_llm(
word_image: np.ndarray,
) -> RecognitionResult:
"""Use a vision LLM for difficult handwriting."""
import base64
pil_img = Image.fromarray(word_image)
import io
buffer = io.BytesIO()
pil_img.save(buffer, format="PNG")
b64_image = base64.b64encode(buffer.getvalue()).decode()
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "user", "content": [
{"type": "text", "text": (
"Read the handwritten text in this image. "
"Return ONLY the text, nothing else."
)},
{"type": "image_url", "image_url": {
"url": f"data:image/png;base64,{b64_image}"
}},
]},
],
)
return RecognitionResult(
text=response.choices[0].message.content.strip(),
confidence=0.85, # Vision LLMs are generally reliable
engine="gpt-4o-vision",
)
Confidence-Based Routing
Route results based on confidence to either automated processing or human review:
from enum import Enum
class ReviewDecision(Enum):
AUTO_ACCEPT = "auto_accept"
HUMAN_REVIEW = "human_review"
REJECT = "reject"
def decide_review_route(
results: list[RecognitionResult],
high_threshold: float = 0.85,
low_threshold: float = 0.4,
) -> dict:
"""Decide whether to auto-accept, route for review, or reject."""
best = max(results, key=lambda r: r.confidence)
# Check agreement between engines
texts = [r.text.lower().strip() for r in results if r.text]
agreement = len(set(texts)) == 1 if texts else False
if best.confidence >= high_threshold and agreement:
return {
"decision": ReviewDecision.AUTO_ACCEPT,
"text": best.text,
"confidence": best.confidence,
"reason": "High confidence with engine agreement",
}
elif best.confidence < low_threshold:
return {
"decision": ReviewDecision.REJECT,
"text": best.text,
"confidence": best.confidence,
"reason": "Confidence too low for reliable extraction",
}
else:
return {
"decision": ReviewDecision.HUMAN_REVIEW,
"text": best.text,
"confidence": best.confidence,
"alternatives": [r.text for r in results],
"reason": "Moderate confidence — needs human verification",
}
Form Field Extraction
For structured forms, map recognized text to specific fields:
def extract_form_fields(
image_path: str,
field_definitions: list[dict],
) -> dict:
"""Extract named fields from a handwritten form."""
preprocessed = preprocess_handwriting(image_path)
results = {}
for field_def in field_definitions:
x, y, w, h = field_def["bbox"]
field_image = preprocessed[y:y+h, x:x+w]
tesseract_result = recognize_with_tesseract(field_image)
if tesseract_result.confidence < 0.6:
vision_result = recognize_with_vision_llm(field_image)
route = decide_review_route([tesseract_result, vision_result])
else:
route = decide_review_route([tesseract_result])
results[field_def["name"]] = {
"value": route["text"],
"confidence": route["confidence"],
"review_status": route["decision"].value,
}
return results
FAQ
How accurate is modern handwriting recognition?
On clean, legible handwriting, modern HTR systems achieve 85-95% character-level accuracy and 75-90% word-level accuracy. Accuracy drops significantly with cursive writing, poor ink quality, or unusual handwriting styles. The key to production reliability is confidence scoring combined with human review for uncertain results rather than trying to achieve perfect automated accuracy.
Should I use Tesseract or a deep learning model for handwriting?
Tesseract LSTM (OEM 1) handles neat handwriting reasonably well and runs locally without GPU. For messy or cursive handwriting, deep learning models like TrOCR (from Microsoft) or vision LLMs significantly outperform Tesseract. The best production approach uses Tesseract as a fast first pass and escalates to a vision LLM only when Tesseract confidence is low.
How do I handle checkboxes and filled circles on handwritten forms?
Checkboxes and radio buttons need a different detection approach than text. Look for the pre-printed checkbox outline using template matching, then analyze the fill level inside the boundary. A filled ratio above 30-40% typically indicates a checked box. For ambiguous cases, use the same human review routing as low-confidence text.
#HandwritingRecognition #HTR #FormProcessing #OCR #HumanInTheLoop #DocumentAI #AgenticAI #Python
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.