Building an Image Analysis Agent: OCR, Object Detection, and Visual QA
Build a Python-based image analysis agent that performs OCR text extraction, object detection, and visual question answering. Includes preprocessing pipelines and structured output formatting.
What an Image Analysis Agent Does
An image analysis agent accepts an image and a natural language question, then uses a combination of computer vision tools — OCR, object detection, and visual question answering — to produce a structured answer. Unlike a simple API call to a vision model, an agent can decide which tools to apply based on the question, chain multiple analysis steps, and format results according to the user's needs.
Setting Up the Vision Toolbox
The agent needs three core capabilities. Start by installing the dependencies:
pip install openai pillow pytesseract ultralytics
Each tool serves a distinct purpose:
- OCR (Tesseract) — extracts text from images, useful for documents, signs, and labels
- Object Detection (YOLO) — identifies and locates objects with bounding boxes
- Visual QA (GPT-4o) — answers open-ended questions about image content
Image Preprocessing Pipeline
Raw images often need preprocessing before analysis. Resizing, normalization, and format conversion improve accuracy across all tools:
from PIL import Image, ImageEnhance, ImageFilter
import io
def preprocess_image(
image_bytes: bytes,
max_dimension: int = 2048,
enhance_for_ocr: bool = False,
) -> Image.Image:
"""Preprocess an image for analysis."""
img = Image.open(io.BytesIO(image_bytes))
# Convert to RGB if necessary
if img.mode != "RGB":
img = img.convert("RGB")
# Resize if too large (preserves aspect ratio)
if max(img.size) > max_dimension:
ratio = max_dimension / max(img.size)
new_size = (int(img.width * ratio), int(img.height * ratio))
img = img.resize(new_size, Image.LANCZOS)
# Enhance for OCR: sharpen and increase contrast
if enhance_for_ocr:
img = img.filter(ImageFilter.SHARPEN)
enhancer = ImageEnhance.Contrast(img)
img = enhancer.enhance(1.5)
return img
Building the OCR Tool
Tesseract handles text extraction. Wrap it as an agent tool with structured output:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
import pytesseract
from dataclasses import dataclass
@dataclass
class OCRResult:
full_text: str
confidence: float
word_count: int
blocks: list[dict]
def extract_text(img: Image.Image) -> OCRResult:
"""Extract text from an image using Tesseract OCR."""
# Get detailed data including confidence scores
data = pytesseract.image_to_data(
img, output_type=pytesseract.Output.DICT
)
words = []
confidences = []
for i, text in enumerate(data["text"]):
conf = int(data["conf"][i])
if conf > 0 and text.strip():
words.append(text.strip())
confidences.append(conf)
full_text = " ".join(words)
avg_confidence = (
sum(confidences) / len(confidences) if confidences else 0.0
)
# Build text blocks by grouping lines
blocks = []
current_block = []
current_block_num = -1
for i, text in enumerate(data["text"]):
if not text.strip():
continue
block_num = data["block_num"][i]
if block_num != current_block_num:
if current_block:
blocks.append({"text": " ".join(current_block)})
current_block = [text.strip()]
current_block_num = block_num
else:
current_block.append(text.strip())
if current_block:
blocks.append({"text": " ".join(current_block)})
return OCRResult(
full_text=full_text,
confidence=avg_confidence,
word_count=len(words),
blocks=blocks,
)
Object Detection with YOLO
The YOLO model identifies objects and their locations within an image:
from ultralytics import YOLO
@dataclass
class DetectedObject:
label: str
confidence: float
bbox: tuple[int, int, int, int] # x1, y1, x2, y2
def detect_objects(
img: Image.Image, confidence_threshold: float = 0.5
) -> list[DetectedObject]:
"""Detect objects in an image using YOLOv8."""
model = YOLO("yolov8n.pt") # nano model for speed
results = model(img, verbose=False)
detected = []
for result in results:
for box in result.boxes:
conf = float(box.conf[0])
if conf >= confidence_threshold:
x1, y1, x2, y2 = box.xyxy[0].tolist()
label = result.names[int(box.cls[0])]
detected.append(DetectedObject(
label=label,
confidence=round(conf, 3),
bbox=(int(x1), int(y1), int(x2), int(y2)),
))
return detected
The Agent: Routing Questions to Tools
The agent decides which tools to use based on the user's question. A keyword-based router works well for most cases:
import openai
import base64
class ImageAnalysisAgent:
def __init__(self):
self.client = openai.AsyncOpenAI()
def _select_tools(self, question: str) -> list[str]:
"""Select which tools to run based on the question."""
q = question.lower()
tools = []
if any(kw in q for kw in ["text", "read", "ocr", "written", "says"]):
tools.append("ocr")
if any(kw in q for kw in ["object", "detect", "find", "count", "how many"]):
tools.append("detection")
# Always include VQA as the reasoning backbone
tools.append("vqa")
return tools
async def analyze(
self, image_bytes: bytes, question: str
) -> dict:
selected_tools = self._select_tools(question)
context_parts = []
img = preprocess_image(image_bytes)
if "ocr" in selected_tools:
ocr_result = extract_text(
preprocess_image(image_bytes, enhance_for_ocr=True)
)
context_parts.append(
f"OCR extracted text ({ocr_result.word_count} words, "
f"confidence {ocr_result.confidence:.1f}%): "
f"{ocr_result.full_text}"
)
if "detection" in selected_tools:
objects = detect_objects(img)
obj_summary = ", ".join(
f"{o.label} ({o.confidence:.0%})" for o in objects
)
context_parts.append(
f"Detected objects: {obj_summary or 'none'}"
)
# VQA with GPT-4o, enriched by tool outputs
b64 = base64.b64encode(image_bytes).decode()
tool_context = "\n".join(context_parts)
response = await self.client.chat.completions.create(
model="gpt-4o",
messages=[{
"role": "user",
"content": [
{
"type": "text",
"text": (
f"Tool analysis results:\n{tool_context}\n\n"
f"Question: {question}"
),
},
{
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{b64}"},
},
],
}],
)
return {
"answer": response.choices[0].message.content,
"tools_used": selected_tools,
}
Structured Output Formatting
For programmatic consumers, format the analysis results as structured JSON:
from pydantic import BaseModel
class ImageAnalysisResult(BaseModel):
answer: str
extracted_text: str | None = None
detected_objects: list[dict] | None = None
tools_used: list[str]
confidence: float
FAQ
When should I use OCR versus a vision language model for text extraction?
Use Tesseract OCR when you need precise character-level extraction from clean documents, invoices, or printed text. Use a vision language model like GPT-4o when the text is embedded in complex scenes, handwritten, or when you also need to understand the context around the text. For best results, run both and let the agent cross-reference the outputs.
How do I handle images that are too large for the API?
Resize images to a maximum dimension of 2048 pixels while preserving the aspect ratio, as shown in the preprocessing function. For GPT-4o specifically, the API automatically handles resizing, but sending smaller images reduces latency and cost. If detail is critical for a specific region, crop that region and send it as a separate analysis request.
Can this agent process multiple images in a single request?
Yes. Extend the analyze method to accept a list of image bytes. Process each image independently through the tool pipeline, then send all results along with all images to the VQA step. GPT-4o supports multiple images in a single message, so the reasoning model can compare and cross-reference across images.
#ImageAnalysis #OCR #ObjectDetection #VisualQA #ComputerVision #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.