Table Extraction from Images and PDFs with AI: Building Reliable Data Pipelines

The Table Extraction Challenge

Tables are one of the most information-dense structures in documents, yet they are among the hardest to extract reliably. A table in a PDF might be a true table object with embedded coordinates, a scanned image of a printed table, or text that is visually aligned but has no structural markup at all. Each case requires a different extraction strategy.

A reliable table extraction pipeline needs four stages: detection (finding tables on the page), structure recognition (identifying rows, columns, and cell boundaries), content extraction (reading the text in each cell), and output formatting (producing clean structured data).

Setting Up the Pipeline

pip install camelot-py[cv] tabula-py pdfplumber img2table opencv-python-headless pandas

For image-based table extraction, you also need Tesseract installed on your system.

Stage 1: Table Detection

The first step is locating tables within a document. For PDFs with embedded structure, pdfplumber excels:

import pdfplumber
from dataclasses import dataclass


@dataclass
class DetectedTable:
    page_number: int
    bbox: tuple  # (x0, y0, x1, y1)
    row_count: int
    col_count: int
    source: str  # "native" or "image"


def detect_tables_native(pdf_path: str) -> list[DetectedTable]:
    """Detect tables in PDFs with embedded structure."""
    detected = []

    with pdfplumber.open(pdf_path) as pdf:
        for i, page in enumerate(pdf.pages):
            tables = page.find_tables()
            for table in tables:
                rows = table.extract()
                if rows and len(rows) > 1:
                    detected.append(DetectedTable(
                        page_number=i + 1,
                        bbox=table.bbox,
                        row_count=len(rows),
                        col_count=max(len(r) for r in rows),
                        source="native",
                    ))

    return detected

For scanned documents where tables exist only as images, use contour-based detection:

import cv2
import numpy as np


def detect_tables_in_image(image_path: str) -> list[dict]:
    """Detect table regions in scanned document images."""
    img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
    binary = cv2.adaptiveThreshold(
        img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
        cv2.THRESH_BINARY_INV, 15, 5
    )

    # Detect horizontal lines
    h_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (40, 1))
    h_lines = cv2.morphologyEx(binary, cv2.MORPH_OPEN, h_kernel)

    # Detect vertical lines
    v_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, 40))
    v_lines = cv2.morphologyEx(binary, cv2.MORPH_OPEN, v_kernel)

    # Combine to find grid intersections
    table_mask = cv2.add(h_lines, v_lines)

    contours, _ = cv2.findContours(
        table_mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE
    )

    tables = []
    for contour in contours:
        x, y, w, h = cv2.boundingRect(contour)
        if w > 100 and h > 50:  # Filter noise
            tables.append({
                "bbox": (x, y, x + w, y + h),
                "area": w * h,
            })

    return sorted(tables, key=lambda t: t["area"], reverse=True)

Stage 2: Structure Recognition

Once a table region is identified, the next step is figuring out the row-column structure:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

def extract_grid_structure(
    binary_image: np.ndarray,
    bbox: tuple
) -> dict:
    """Identify row and column boundaries within a table region."""
    x0, y0, x1, y1 = bbox
    table_region = binary_image[y0:y1, x0:x1]

    # Project horizontally to find row boundaries
    h_projection = np.sum(table_region, axis=1)
    row_boundaries = find_boundaries(h_projection, axis="horizontal")

    # Project vertically to find column boundaries
    v_projection = np.sum(table_region, axis=0)
    col_boundaries = find_boundaries(v_projection, axis="vertical")

    return {
        "rows": row_boundaries,
        "cols": col_boundaries,
        "cell_count": (len(row_boundaries) - 1) * (len(col_boundaries) - 1),
    }


def find_boundaries(projection: np.ndarray, axis: str) -> list[int]:
    """Find row or column boundaries from pixel projection."""
    threshold = np.max(projection) * 0.3
    in_gap = True
    boundaries = [0]

    for i, val in enumerate(projection):
        if in_gap and val > threshold:
            boundaries.append(i)
            in_gap = False
        elif not in_gap and val <= threshold:
            in_gap = True

    boundaries.append(len(projection))
    return boundaries

Stage 3: Cell Content Extraction

With the grid structure known, extract text from each cell using OCR:

import pytesseract
from PIL import Image


def extract_cell_contents(
    image: np.ndarray,
    rows: list[int],
    cols: list[int],
    table_offset: tuple
) -> list[list[str]]:
    """Extract text from each cell in the detected grid."""
    ox, oy = table_offset[0], table_offset[1]
    table_data = []

    for r in range(len(rows) - 1):
        row_data = []
        for c in range(len(cols) - 1):
            cell = image[
                oy + rows[r]:oy + rows[r + 1],
                ox + cols[c]:ox + cols[c + 1]
            ]

            cell_pil = Image.fromarray(cell)
            text = pytesseract.image_to_string(
                cell_pil, config="--psm 6"
            ).strip()

            row_data.append(text)
        table_data.append(row_data)

    return table_data

Stage 4: Output Formatting

Convert the extracted data to a clean DataFrame with header detection:

import pandas as pd


def table_to_dataframe(
    raw_data: list[list[str]],
    has_header: bool = True
) -> pd.DataFrame:
    """Convert extracted table data to a pandas DataFrame."""
    if not raw_data:
        return pd.DataFrame()

    if has_header:
        headers = [
            cell.replace("\n", " ").strip()
            for cell in raw_data[0]
        ]
        df = pd.DataFrame(raw_data[1:], columns=headers)
    else:
        df = pd.DataFrame(raw_data)

    # Clean up whitespace and empty columns
    df = df.apply(lambda col: col.str.strip() if col.dtype == "object" else col)
    df = df.dropna(axis=1, how="all")

    return df


def export_tables(tables: list[pd.DataFrame], output_dir: str):
    """Export extracted tables to CSV files."""
    for i, df in enumerate(tables):
        path = f"{output_dir}/table_{i + 1}.csv"
        df.to_csv(path, index=False)
        print(f"Exported {len(df)} rows to {path}")

Combining Native and Image Pipelines

A robust agent should automatically choose the right extraction strategy:

def extract_tables_auto(pdf_path: str) -> list[pd.DataFrame]:
    """Automatically select the best extraction method."""
    native_tables = detect_tables_native(pdf_path)

    if native_tables:
        # Use pdfplumber for native PDF tables
        results = []
        with pdfplumber.open(pdf_path) as pdf:
            for page in pdf.pages:
                for table in page.find_tables():
                    rows = table.extract()
                    if rows:
                        results.append(table_to_dataframe(rows))
        return results
    else:
        # Fallback to image-based extraction
        print("No native tables found, using image-based extraction")
        return extract_tables_from_images(pdf_path)

FAQ

How do I handle merged cells in tables?

Merged cells are one of the hardest problems in table extraction. When a cell spans multiple rows or columns, the grid structure becomes irregular. The best approach is to detect merged cells by looking for cells where the boundary lines are absent, then use spanning metadata to reconstruct the logical structure. Libraries like img2table handle this better than raw contour detection.

What accuracy can I expect from table extraction?

On clean, well-formatted tables with clear gridlines, extraction accuracy typically reaches 95%+ for both structure and content. Borderless tables drop to 70-85% accuracy because column alignment must be inferred from whitespace. Always validate extracted data by checking row/column counts against expectations and flagging anomalies.

Can this pipeline handle tables that span multiple pages?

Yes, but it requires additional logic to detect continuation tables. Look for tables that start at the top of a page without a header row, or tables on consecutive pages with matching column counts and widths. Merge them by concatenating rows and deduplicating any repeated header rows.

#TableExtraction #PDFProcessing #DataPipelines #DocumentAI #ComputerVision #OCR #Python #AgenticAI

Table Extraction from Images and PDFs with AI: Building Reliable Data Pipelines

The Table Extraction Challenge

Setting Up the Pipeline

Stage 1: Table Detection

Stage 2: Structure Recognition

Stage 3: Cell Content Extraction

Stage 4: Output Formatting

Combining Native and Image Pipelines

FAQ

How do I handle merged cells in tables?

What accuracy can I expect from table extraction?

Can this pipeline handle tables that span multiple pages?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding