Handwriting Recognition with AI Agents: Processing Handwritten Forms and Notes

The Handwriting Problem

Despite decades of digitization, handwritten documents remain everywhere: patient intake forms, field inspection reports, warehouse inventory sheets, insurance claims, and school exams. These documents contain critical information locked in a format that traditional OCR struggles with.

Handwriting recognition (HTR — Handwritten Text Recognition) differs from printed text OCR in fundamental ways. Characters are connected, spacing is irregular, the same person writes the same letter differently depending on context, and individual writing styles vary enormously. Modern deep learning approaches have made HTR dramatically more capable, but building a production pipeline still requires careful engineering around confidence scoring, field extraction, and human review routing.

Setting Up the HTR Pipeline

pip install pytesseract opencv-python-headless Pillow torch torchvision transformers openai pydantic

Preprocessing Handwritten Documents

Handwritten forms need more aggressive preprocessing than printed documents:

import cv2
import numpy as np


def preprocess_handwriting(image_path: str) -> np.ndarray:
    """Preprocess handwritten document for recognition."""
    img = cv2.imread(image_path)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    # Remove ruled lines (common in forms)
    horizontal_kernel = cv2.getStructuringElement(
        cv2.MORPH_RECT, (40, 1)
    )
    detected_lines = cv2.morphologyEx(
        gray, cv2.MORPH_OPEN, horizontal_kernel
    )
    # Subtract lines from image
    clean = cv2.subtract(gray, detected_lines)

    # Adaptive binarization works better for variable ink density
    binary = cv2.adaptiveThreshold(
        clean, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
        cv2.THRESH_BINARY_INV, 21, 10
    )

    # Remove small noise blobs
    kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (2, 2))
    cleaned = cv2.morphologyEx(binary, cv2.MORPH_OPEN, kernel)

    return cleaned

Line and Word Segmentation

Before recognition, segment the document into individual lines and words:

from dataclasses import dataclass


@dataclass
class TextLine:
    image: np.ndarray
    bbox: tuple  # (x, y, w, h)
    line_number: int


@dataclass
class Word:
    image: np.ndarray
    bbox: tuple
    line_number: int
    word_index: int


def segment_lines(binary_image: np.ndarray) -> list[TextLine]:
    """Segment handwritten text into individual lines."""
    # Horizontal projection to find line boundaries
    h_projection = np.sum(binary_image, axis=1)

    lines = []
    in_line = False
    start = 0
    line_num = 0

    for y, val in enumerate(h_projection):
        if not in_line and val > 0:
            start = y
            in_line = True
        elif in_line and val == 0:
            if y - start > 10:  # Minimum line height
                line_img = binary_image[start:y, :]
                x_nonzero = np.where(np.sum(line_img, axis=0) > 0)[0]
                if len(x_nonzero) > 0:
                    x_start = x_nonzero[0]
                    x_end = x_nonzero[-1]
                    lines.append(TextLine(
                        image=line_img[:, x_start:x_end + 1],
                        bbox=(x_start, start, x_end - x_start, y - start),
                        line_number=line_num,
                    ))
                    line_num += 1
            in_line = False

    return lines


def segment_words(line: TextLine) -> list[Word]:
    """Segment a text line into individual words."""
    v_projection = np.sum(line.image, axis=0)

    words = []
    in_word = False
    start = 0
    word_idx = 0
    gap_threshold = 15  # Pixels between words

    gaps = []
    current_gap = 0

    for x, val in enumerate(v_projection):
        if val == 0:
            current_gap += 1
        else:
            if current_gap > 0:
                gaps.append((x - current_gap, current_gap))
            current_gap = 0

    # Use larger gaps as word boundaries
    if gaps:
        median_gap = np.median([g[1] for g in gaps])
        gap_threshold = max(median_gap * 1.5, 10)

    in_word = False
    for x, val in enumerate(v_projection):
        if not in_word and val > 0:
            start = x
            in_word = True
        elif in_word and val == 0:
            if x - start > 5:
                # Check if next ink is far enough to be a new word
                next_ink = np.argmax(v_projection[x:] > 0) if x < len(v_projection) else 0
                if next_ink > gap_threshold or x == len(v_projection) - 1:
                    word_img = line.image[:, start:x]
                    words.append(Word(
                        image=word_img,
                        bbox=(
                            line.bbox[0] + start,
                            line.bbox[1],
                            x - start,
                            line.bbox[3],
                        ),
                        line_number=line.line_number,
                        word_index=word_idx,
                    ))
                    word_idx += 1
            in_word = False

    return words

Multi-Engine Recognition with Confidence

Use multiple recognition approaches and compare results for higher accuracy:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

import pytesseract
from PIL import Image


@dataclass
class RecognitionResult:
    text: str
    confidence: float
    engine: str


def recognize_with_tesseract(
    word_image: np.ndarray,
) -> RecognitionResult:
    """Recognize handwriting using Tesseract HTR mode."""
    pil_img = Image.fromarray(word_image)

    # PSM 8 = single word, OEM 1 = LSTM engine
    data = pytesseract.image_to_data(
        pil_img,
        config="--psm 8 --oem 1",
        output_type=pytesseract.Output.DICT,
    )

    words = [t for t, c in zip(data["text"], data["conf"])
             if t.strip() and int(c) > 0]
    confs = [int(c) / 100.0 for t, c in zip(data["text"], data["conf"])
             if t.strip() and int(c) > 0]

    text = " ".join(words) if words else ""
    conf = sum(confs) / len(confs) if confs else 0.0

    return RecognitionResult(
        text=text, confidence=conf, engine="tesseract"
    )


def recognize_with_vision_llm(
    word_image: np.ndarray,
) -> RecognitionResult:
    """Use a vision LLM for difficult handwriting."""
    import base64

    pil_img = Image.fromarray(word_image)
    import io
    buffer = io.BytesIO()
    pil_img.save(buffer, format="PNG")
    b64_image = base64.b64encode(buffer.getvalue()).decode()

    from openai import OpenAI
    client = OpenAI()

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "user", "content": [
                {"type": "text", "text": (
                    "Read the handwritten text in this image. "
                    "Return ONLY the text, nothing else."
                )},
                {"type": "image_url", "image_url": {
                    "url": f"data:image/png;base64,{b64_image}"
                }},
            ]},
        ],
    )

    return RecognitionResult(
        text=response.choices[0].message.content.strip(),
        confidence=0.85,  # Vision LLMs are generally reliable
        engine="gpt-4o-vision",
    )

Confidence-Based Routing

Route results based on confidence to either automated processing or human review:

from enum import Enum


class ReviewDecision(Enum):
    AUTO_ACCEPT = "auto_accept"
    HUMAN_REVIEW = "human_review"
    REJECT = "reject"


def decide_review_route(
    results: list[RecognitionResult],
    high_threshold: float = 0.85,
    low_threshold: float = 0.4,
) -> dict:
    """Decide whether to auto-accept, route for review, or reject."""
    best = max(results, key=lambda r: r.confidence)

    # Check agreement between engines
    texts = [r.text.lower().strip() for r in results if r.text]
    agreement = len(set(texts)) == 1 if texts else False

    if best.confidence >= high_threshold and agreement:
        return {
            "decision": ReviewDecision.AUTO_ACCEPT,
            "text": best.text,
            "confidence": best.confidence,
            "reason": "High confidence with engine agreement",
        }
    elif best.confidence < low_threshold:
        return {
            "decision": ReviewDecision.REJECT,
            "text": best.text,
            "confidence": best.confidence,
            "reason": "Confidence too low for reliable extraction",
        }
    else:
        return {
            "decision": ReviewDecision.HUMAN_REVIEW,
            "text": best.text,
            "confidence": best.confidence,
            "alternatives": [r.text for r in results],
            "reason": "Moderate confidence — needs human verification",
        }

Form Field Extraction

For structured forms, map recognized text to specific fields:

def extract_form_fields(
    image_path: str,
    field_definitions: list[dict],
) -> dict:
    """Extract named fields from a handwritten form."""
    preprocessed = preprocess_handwriting(image_path)
    results = {}

    for field_def in field_definitions:
        x, y, w, h = field_def["bbox"]
        field_image = preprocessed[y:y+h, x:x+w]

        tesseract_result = recognize_with_tesseract(field_image)

        if tesseract_result.confidence < 0.6:
            vision_result = recognize_with_vision_llm(field_image)
            route = decide_review_route([tesseract_result, vision_result])
        else:
            route = decide_review_route([tesseract_result])

        results[field_def["name"]] = {
            "value": route["text"],
            "confidence": route["confidence"],
            "review_status": route["decision"].value,
        }

    return results

FAQ

How accurate is modern handwriting recognition?

On clean, legible handwriting, modern HTR systems achieve 85-95% character-level accuracy and 75-90% word-level accuracy. Accuracy drops significantly with cursive writing, poor ink quality, or unusual handwriting styles. The key to production reliability is confidence scoring combined with human review for uncertain results rather than trying to achieve perfect automated accuracy.

Should I use Tesseract or a deep learning model for handwriting?

Tesseract LSTM (OEM 1) handles neat handwriting reasonably well and runs locally without GPU. For messy or cursive handwriting, deep learning models like TrOCR (from Microsoft) or vision LLMs significantly outperform Tesseract. The best production approach uses Tesseract as a fast first pass and escalates to a vision LLM only when Tesseract confidence is low.

How do I handle checkboxes and filled circles on handwritten forms?

Checkboxes and radio buttons need a different detection approach than text. Look for the pre-printed checkbox outline using template matching, then analyze the fill level inside the boundary. A filled ratio above 30-40% typically indicates a checked box. For ambiguous cases, use the same human review routing as low-confidence text.

#HandwritingRecognition #HTR #FormProcessing #OCR #HumanInTheLoop #DocumentAI #AgenticAI #Python

Handwriting Recognition with AI Agents: Processing Handwritten Forms and Notes

The Handwriting Problem

Setting Up the HTR Pipeline

Preprocessing Handwritten Documents

Line and Word Segmentation

Multi-Engine Recognition with Confidence

Confidence-Based Routing

Form Field Extraction

FAQ

How accurate is modern handwriting recognition?

Should I use Tesseract or a deep learning model for handwriting?

How do I handle checkboxes and filled circles on handwritten forms?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding