Building an Image Analysis Agent: OCR, Object Detection, and Visual QA

What an Image Analysis Agent Does

An image analysis agent accepts an image and a natural language question, then uses a combination of computer vision tools — OCR, object detection, and visual question answering — to produce a structured answer. Unlike a simple API call to a vision model, an agent can decide which tools to apply based on the question, chain multiple analysis steps, and format results according to the user's needs.

Setting Up the Vision Toolbox

The agent needs three core capabilities. Start by installing the dependencies:

pip install openai pillow pytesseract ultralytics

Each tool serves a distinct purpose:

OCR (Tesseract) — extracts text from images, useful for documents, signs, and labels
Object Detection (YOLO) — identifies and locates objects with bounding boxes
Visual QA (GPT-4o) — answers open-ended questions about image content

Image Preprocessing Pipeline

Raw images often need preprocessing before analysis. Resizing, normalization, and format conversion improve accuracy across all tools:

from PIL import Image, ImageEnhance, ImageFilter
import io


def preprocess_image(
    image_bytes: bytes,
    max_dimension: int = 2048,
    enhance_for_ocr: bool = False,
) -> Image.Image:
    """Preprocess an image for analysis."""
    img = Image.open(io.BytesIO(image_bytes))

    # Convert to RGB if necessary
    if img.mode != "RGB":
        img = img.convert("RGB")

    # Resize if too large (preserves aspect ratio)
    if max(img.size) > max_dimension:
        ratio = max_dimension / max(img.size)
        new_size = (int(img.width * ratio), int(img.height * ratio))
        img = img.resize(new_size, Image.LANCZOS)

    # Enhance for OCR: sharpen and increase contrast
    if enhance_for_ocr:
        img = img.filter(ImageFilter.SHARPEN)
        enhancer = ImageEnhance.Contrast(img)
        img = enhancer.enhance(1.5)

    return img

Building the OCR Tool

Tesseract handles text extraction. Wrap it as an agent tool with structured output:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

import pytesseract
from dataclasses import dataclass


@dataclass
class OCRResult:
    full_text: str
    confidence: float
    word_count: int
    blocks: list[dict]


def extract_text(img: Image.Image) -> OCRResult:
    """Extract text from an image using Tesseract OCR."""
    # Get detailed data including confidence scores
    data = pytesseract.image_to_data(
        img, output_type=pytesseract.Output.DICT
    )

    words = []
    confidences = []
    for i, text in enumerate(data["text"]):
        conf = int(data["conf"][i])
        if conf > 0 and text.strip():
            words.append(text.strip())
            confidences.append(conf)

    full_text = " ".join(words)
    avg_confidence = (
        sum(confidences) / len(confidences) if confidences else 0.0
    )

    # Build text blocks by grouping lines
    blocks = []
    current_block = []
    current_block_num = -1
    for i, text in enumerate(data["text"]):
        if not text.strip():
            continue
        block_num = data["block_num"][i]
        if block_num != current_block_num:
            if current_block:
                blocks.append({"text": " ".join(current_block)})
            current_block = [text.strip()]
            current_block_num = block_num
        else:
            current_block.append(text.strip())
    if current_block:
        blocks.append({"text": " ".join(current_block)})

    return OCRResult(
        full_text=full_text,
        confidence=avg_confidence,
        word_count=len(words),
        blocks=blocks,
    )

Object Detection with YOLO

The YOLO model identifies objects and their locations within an image:

from ultralytics import YOLO


@dataclass
class DetectedObject:
    label: str
    confidence: float
    bbox: tuple[int, int, int, int]  # x1, y1, x2, y2


def detect_objects(
    img: Image.Image, confidence_threshold: float = 0.5
) -> list[DetectedObject]:
    """Detect objects in an image using YOLOv8."""
    model = YOLO("yolov8n.pt")  # nano model for speed
    results = model(img, verbose=False)

    detected = []
    for result in results:
        for box in result.boxes:
            conf = float(box.conf[0])
            if conf >= confidence_threshold:
                x1, y1, x2, y2 = box.xyxy[0].tolist()
                label = result.names[int(box.cls[0])]
                detected.append(DetectedObject(
                    label=label,
                    confidence=round(conf, 3),
                    bbox=(int(x1), int(y1), int(x2), int(y2)),
                ))
    return detected

The Agent: Routing Questions to Tools

The agent decides which tools to use based on the user's question. A keyword-based router works well for most cases:

import openai
import base64


class ImageAnalysisAgent:
    def __init__(self):
        self.client = openai.AsyncOpenAI()

    def _select_tools(self, question: str) -> list[str]:
        """Select which tools to run based on the question."""
        q = question.lower()
        tools = []
        if any(kw in q for kw in ["text", "read", "ocr", "written", "says"]):
            tools.append("ocr")
        if any(kw in q for kw in ["object", "detect", "find", "count", "how many"]):
            tools.append("detection")
        # Always include VQA as the reasoning backbone
        tools.append("vqa")
        return tools

    async def analyze(
        self, image_bytes: bytes, question: str
    ) -> dict:
        selected_tools = self._select_tools(question)
        context_parts = []

        img = preprocess_image(image_bytes)

        if "ocr" in selected_tools:
            ocr_result = extract_text(
                preprocess_image(image_bytes, enhance_for_ocr=True)
            )
            context_parts.append(
                f"OCR extracted text ({ocr_result.word_count} words, "
                f"confidence {ocr_result.confidence:.1f}%): "
                f"{ocr_result.full_text}"
            )

        if "detection" in selected_tools:
            objects = detect_objects(img)
            obj_summary = ", ".join(
                f"{o.label} ({o.confidence:.0%})" for o in objects
            )
            context_parts.append(
                f"Detected objects: {obj_summary or 'none'}"
            )

        # VQA with GPT-4o, enriched by tool outputs
        b64 = base64.b64encode(image_bytes).decode()
        tool_context = "\n".join(context_parts)

        response = await self.client.chat.completions.create(
            model="gpt-4o",
            messages=[{
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": (
                            f"Tool analysis results:\n{tool_context}\n\n"
                            f"Question: {question}"
                        ),
                    },
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/png;base64,{b64}"},
                    },
                ],
            }],
        )
        return {
            "answer": response.choices[0].message.content,
            "tools_used": selected_tools,
        }

Structured Output Formatting

For programmatic consumers, format the analysis results as structured JSON:

from pydantic import BaseModel


class ImageAnalysisResult(BaseModel):
    answer: str
    extracted_text: str | None = None
    detected_objects: list[dict] | None = None
    tools_used: list[str]
    confidence: float

FAQ

When should I use OCR versus a vision language model for text extraction?

Use Tesseract OCR when you need precise character-level extraction from clean documents, invoices, or printed text. Use a vision language model like GPT-4o when the text is embedded in complex scenes, handwritten, or when you also need to understand the context around the text. For best results, run both and let the agent cross-reference the outputs.

How do I handle images that are too large for the API?

Resize images to a maximum dimension of 2048 pixels while preserving the aspect ratio, as shown in the preprocessing function. For GPT-4o specifically, the API automatically handles resizing, but sending smaller images reduces latency and cost. If detail is critical for a specific region, crop that region and send it as a separate analysis request.

Can this agent process multiple images in a single request?

Yes. Extend the analyze method to accept a list of image bytes. Process each image independently through the tool pipeline, then send all results along with all images to the VQA step. GPT-4o supports multiple images in a single message, so the reasoning model can compare and cross-reference across images.

#ImageAnalysis #OCR #ObjectDetection #VisualQA #ComputerVision #AgenticAI #LearnAI #AIEngineering

Building an Image Analysis Agent: OCR, Object Detection, and Visual QA

What an Image Analysis Agent Does

Setting Up the Vision Toolbox

Image Preprocessing Pipeline

Building the OCR Tool

Object Detection with YOLO

The Agent: Routing Questions to Tools

Structured Output Formatting

FAQ

When should I use OCR versus a vision language model for text extraction?

How do I handle images that are too large for the API?

Can this agent process multiple images in a single request?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding