Table Extraction from Images and PDFs with AI: Building Reliable Data Pipelines
Build an AI-powered table extraction pipeline that detects tables in images and PDFs, recognizes cell boundaries, infers structure, and outputs clean CSV data for downstream consumption.
The Table Extraction Challenge
Tables are one of the most information-dense structures in documents, yet they are among the hardest to extract reliably. A table in a PDF might be a true table object with embedded coordinates, a scanned image of a printed table, or text that is visually aligned but has no structural markup at all. Each case requires a different extraction strategy.
A reliable table extraction pipeline needs four stages: detection (finding tables on the page), structure recognition (identifying rows, columns, and cell boundaries), content extraction (reading the text in each cell), and output formatting (producing clean structured data).
Setting Up the Pipeline
pip install camelot-py[cv] tabula-py pdfplumber img2table opencv-python-headless pandas
For image-based table extraction, you also need Tesseract installed on your system.
Stage 1: Table Detection
The first step is locating tables within a document. For PDFs with embedded structure, pdfplumber excels:
import pdfplumber
from dataclasses import dataclass
@dataclass
class DetectedTable:
page_number: int
bbox: tuple # (x0, y0, x1, y1)
row_count: int
col_count: int
source: str # "native" or "image"
def detect_tables_native(pdf_path: str) -> list[DetectedTable]:
"""Detect tables in PDFs with embedded structure."""
detected = []
with pdfplumber.open(pdf_path) as pdf:
for i, page in enumerate(pdf.pages):
tables = page.find_tables()
for table in tables:
rows = table.extract()
if rows and len(rows) > 1:
detected.append(DetectedTable(
page_number=i + 1,
bbox=table.bbox,
row_count=len(rows),
col_count=max(len(r) for r in rows),
source="native",
))
return detected
For scanned documents where tables exist only as images, use contour-based detection:
import cv2
import numpy as np
def detect_tables_in_image(image_path: str) -> list[dict]:
"""Detect table regions in scanned document images."""
img = cv2.imread(image_path, cv2.IMREAD_GRAYSCALE)
binary = cv2.adaptiveThreshold(
img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY_INV, 15, 5
)
# Detect horizontal lines
h_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (40, 1))
h_lines = cv2.morphologyEx(binary, cv2.MORPH_OPEN, h_kernel)
# Detect vertical lines
v_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, 40))
v_lines = cv2.morphologyEx(binary, cv2.MORPH_OPEN, v_kernel)
# Combine to find grid intersections
table_mask = cv2.add(h_lines, v_lines)
contours, _ = cv2.findContours(
table_mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE
)
tables = []
for contour in contours:
x, y, w, h = cv2.boundingRect(contour)
if w > 100 and h > 50: # Filter noise
tables.append({
"bbox": (x, y, x + w, y + h),
"area": w * h,
})
return sorted(tables, key=lambda t: t["area"], reverse=True)
Stage 2: Structure Recognition
Once a table region is identified, the next step is figuring out the row-column structure:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
def extract_grid_structure(
binary_image: np.ndarray,
bbox: tuple
) -> dict:
"""Identify row and column boundaries within a table region."""
x0, y0, x1, y1 = bbox
table_region = binary_image[y0:y1, x0:x1]
# Project horizontally to find row boundaries
h_projection = np.sum(table_region, axis=1)
row_boundaries = find_boundaries(h_projection, axis="horizontal")
# Project vertically to find column boundaries
v_projection = np.sum(table_region, axis=0)
col_boundaries = find_boundaries(v_projection, axis="vertical")
return {
"rows": row_boundaries,
"cols": col_boundaries,
"cell_count": (len(row_boundaries) - 1) * (len(col_boundaries) - 1),
}
def find_boundaries(projection: np.ndarray, axis: str) -> list[int]:
"""Find row or column boundaries from pixel projection."""
threshold = np.max(projection) * 0.3
in_gap = True
boundaries = [0]
for i, val in enumerate(projection):
if in_gap and val > threshold:
boundaries.append(i)
in_gap = False
elif not in_gap and val <= threshold:
in_gap = True
boundaries.append(len(projection))
return boundaries
Stage 3: Cell Content Extraction
With the grid structure known, extract text from each cell using OCR:
import pytesseract
from PIL import Image
def extract_cell_contents(
image: np.ndarray,
rows: list[int],
cols: list[int],
table_offset: tuple
) -> list[list[str]]:
"""Extract text from each cell in the detected grid."""
ox, oy = table_offset[0], table_offset[1]
table_data = []
for r in range(len(rows) - 1):
row_data = []
for c in range(len(cols) - 1):
cell = image[
oy + rows[r]:oy + rows[r + 1],
ox + cols[c]:ox + cols[c + 1]
]
cell_pil = Image.fromarray(cell)
text = pytesseract.image_to_string(
cell_pil, config="--psm 6"
).strip()
row_data.append(text)
table_data.append(row_data)
return table_data
Stage 4: Output Formatting
Convert the extracted data to a clean DataFrame with header detection:
import pandas as pd
def table_to_dataframe(
raw_data: list[list[str]],
has_header: bool = True
) -> pd.DataFrame:
"""Convert extracted table data to a pandas DataFrame."""
if not raw_data:
return pd.DataFrame()
if has_header:
headers = [
cell.replace("\n", " ").strip()
for cell in raw_data[0]
]
df = pd.DataFrame(raw_data[1:], columns=headers)
else:
df = pd.DataFrame(raw_data)
# Clean up whitespace and empty columns
df = df.apply(lambda col: col.str.strip() if col.dtype == "object" else col)
df = df.dropna(axis=1, how="all")
return df
def export_tables(tables: list[pd.DataFrame], output_dir: str):
"""Export extracted tables to CSV files."""
for i, df in enumerate(tables):
path = f"{output_dir}/table_{i + 1}.csv"
df.to_csv(path, index=False)
print(f"Exported {len(df)} rows to {path}")
Combining Native and Image Pipelines
A robust agent should automatically choose the right extraction strategy:
def extract_tables_auto(pdf_path: str) -> list[pd.DataFrame]:
"""Automatically select the best extraction method."""
native_tables = detect_tables_native(pdf_path)
if native_tables:
# Use pdfplumber for native PDF tables
results = []
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
for table in page.find_tables():
rows = table.extract()
if rows:
results.append(table_to_dataframe(rows))
return results
else:
# Fallback to image-based extraction
print("No native tables found, using image-based extraction")
return extract_tables_from_images(pdf_path)
FAQ
How do I handle merged cells in tables?
Merged cells are one of the hardest problems in table extraction. When a cell spans multiple rows or columns, the grid structure becomes irregular. The best approach is to detect merged cells by looking for cells where the boundary lines are absent, then use spanning metadata to reconstruct the logical structure. Libraries like img2table handle this better than raw contour detection.
What accuracy can I expect from table extraction?
On clean, well-formatted tables with clear gridlines, extraction accuracy typically reaches 95%+ for both structure and content. Borderless tables drop to 70-85% accuracy because column alignment must be inferred from whitespace. Always validate extracted data by checking row/column counts against expectations and flagging anomalies.
Can this pipeline handle tables that span multiple pages?
Yes, but it requires additional logic to detect continuation tables. Look for tables that start at the top of a page without a header row, or tables on consecutive pages with matching column counts and widths. Merge them by concatenating rows and deduplicating any repeated header rows.
#TableExtraction #PDFProcessing #DataPipelines #DocumentAI #ComputerVision #OCR #Python #AgenticAI
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.