Building a Visual QA Agent: Answering Natural Language Questions About Any Image

What Is Visual Question Answering?

Visual Question Answering (VQA) is the task of answering natural language questions about an image. "How many people are in this photo?" "What color is the car?" "Is the door open or closed?" "What brand is the laptop?" These questions seem trivial to humans but require an AI to combine visual perception with language understanding and commonsense reasoning.

A Visual QA agent goes beyond simple VQA models by routing questions to specialized tools when needed, maintaining conversation context across multiple questions about the same image, and providing explanations for its answers.

Agent Architecture

The agent has three main components:

Question classifier — determines what type of analysis the question requires
Specialized analyzers — focused tools for counting, color analysis, text reading, spatial reasoning, etc.
Answer generator — synthesizes analyzer outputs into a natural language response

The Question Router

Not all questions need the same analysis. "What text is on the sign?" needs OCR. "How many cars?" needs object detection. "Is this a happy scene?" needs sentiment analysis. Route accordingly:

from enum import Enum
from dataclasses import dataclass
from openai import OpenAI


class QuestionType(Enum):
    COUNTING = "counting"          # "How many X?"
    COLOR = "color"                # "What color is X?"
    SPATIAL = "spatial"            # "Where is X?" "Is X near Y?"
    TEXT_READING = "text_reading"  # "What does the sign say?"
    IDENTIFICATION = "identification"  # "What is this?" "What brand?"
    COMPARISON = "comparison"      # "Which is bigger?"
    SCENE = "scene"                # "Describe the scene"
    YES_NO = "yes_no"             # "Is there a X?"
    GENERAL = "general"            # Anything else


@dataclass
class ClassifiedQuestion:
    original: str
    question_type: QuestionType
    target_objects: list[str]
    requires_tools: list[str]


def classify_question(question: str) -> ClassifiedQuestion:
    """Classify a visual question to determine analysis strategy."""
    client = OpenAI()

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": (
                "Classify this visual question. Return JSON with: "
                "question_type (counting, color, spatial, text_reading, "
                "identification, comparison, scene, yes_no, general), "
                "target_objects (list of objects mentioned), "
                "requires_tools (list from: object_detection, ocr, "
                "color_analysis, spatial_analysis, scene_description). "
                "Return ONLY valid JSON."
            )},
            {"role": "user", "content": question},
        ],
    )

    import json
    parsed = json.loads(response.choices[0].message.content)

    return ClassifiedQuestion(
        original=question,
        question_type=QuestionType(parsed["question_type"]),
        target_objects=parsed.get("target_objects", []),
        requires_tools=parsed.get("requires_tools", []),
    )

Specialized Analysis Tools

Each tool handles a specific type of visual analysis:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

import cv2
import numpy as np
from PIL import Image
import base64
import io


def tool_count_objects(
    image: np.ndarray,
    target_class: str,
) -> dict:
    """Count specific objects in an image using detection."""
    # Use YOLO or similar detector
    blob = cv2.dnn.blobFromImage(
        image, 1/255.0, (416, 416), swapRB=True, crop=False
    )
    # Detection code similar to previous post
    # Returns count and locations
    detections = run_detection(image)

    matching = [d for d in detections
                if d["class"].lower() == target_class.lower()]

    return {
        "count": len(matching),
        "target": target_class,
        "locations": [d["bbox"] for d in matching],
        "confidence": np.mean([d["score"] for d in matching]) if matching else 0,
    }


def tool_analyze_colors(
    image: np.ndarray,
    region: tuple | None = None,
) -> dict:
    """Analyze dominant colors in an image or region."""
    if region:
        x1, y1, x2, y2 = region
        roi = image[y1:y2, x1:x2]
    else:
        roi = image

    # Convert to RGB and reshape for clustering
    rgb = cv2.cvtColor(roi, cv2.COLOR_BGR2RGB)
    pixels = rgb.reshape(-1, 3).astype(np.float32)

    # K-means clustering for dominant colors
    criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 20, 1.0)
    k = 5
    _, labels, centers = cv2.kmeans(
        pixels, k, None, criteria, 10, cv2.KMEANS_RANDOM_CENTERS
    )

    # Count pixels per cluster
    counts = np.bincount(labels.flatten())
    dominant_idx = np.argsort(-counts)

    colors = []
    for idx in dominant_idx[:3]:
        r, g, b = centers[idx].astype(int)
        color_name = rgb_to_name(r, g, b)
        colors.append({
            "name": color_name,
            "rgb": (int(r), int(g), int(b)),
            "percentage": float(counts[idx] / len(labels) * 100),
        })

    return {"dominant_colors": colors}


def rgb_to_name(r: int, g: int, b: int) -> str:
    """Convert RGB values to a human-readable color name."""
    colors = {
        "red": (255, 0, 0), "green": (0, 128, 0),
        "blue": (0, 0, 255), "yellow": (255, 255, 0),
        "white": (255, 255, 255), "black": (0, 0, 0),
        "orange": (255, 165, 0), "purple": (128, 0, 128),
        "brown": (139, 69, 19), "gray": (128, 128, 128),
        "pink": (255, 192, 203),
    }

    min_dist = float("inf")
    closest = "unknown"

    for name, (cr, cg, cb) in colors.items():
        dist = np.sqrt((r - cr)**2 + (g - cg)**2 + (b - cb)**2)
        if dist < min_dist:
            min_dist = dist
            closest = name

    return closest


def tool_read_text(image: np.ndarray) -> dict:
    """Extract text from an image using OCR."""
    import pytesseract

    pil_img = Image.fromarray(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
    text = pytesseract.image_to_string(pil_img).strip()

    return {
        "text_found": bool(text),
        "text": text,
        "word_count": len(text.split()) if text else 0,
    }

The Visual QA Agent

The agent orchestrates the full pipeline — receive a question, classify it, run the appropriate tools, and generate an answer:

class VisualQAAgent:
    """Agent that answers natural language questions about images."""

    def __init__(self):
        self.client = OpenAI()
        self.conversation_history: list[dict] = []
        self.current_image: np.ndarray | None = None
        self.image_b64: str | None = None

    def load_image(self, image_path: str):
        """Load an image for analysis."""
        self.current_image = cv2.imread(image_path)
        pil_img = Image.open(image_path)
        buffer = io.BytesIO()
        pil_img.save(buffer, format="PNG")
        self.image_b64 = base64.b64encode(
            buffer.getvalue()
        ).decode()
        self.conversation_history = []

    def ask(self, question: str) -> str:
        """Answer a question about the loaded image."""
        if self.current_image is None:
            return "No image loaded. Call load_image() first."

        # Classify the question
        classified = classify_question(question)

        # Run specialized tools
        tool_results = {}
        for tool_name in classified.requires_tools:
            if tool_name == "object_detection":
                for target in classified.target_objects:
                    result = tool_count_objects(
                        self.current_image, target
                    )
                    tool_results[f"detection_{target}"] = result

            elif tool_name == "color_analysis":
                tool_results["colors"] = tool_analyze_colors(
                    self.current_image
                )

            elif tool_name == "ocr":
                tool_results["text"] = tool_read_text(
                    self.current_image
                )

        # Generate answer using vision LLM + tool results
        return self._generate_answer(question, tool_results)

    def _generate_answer(
        self,
        question: str,
        tool_results: dict,
    ) -> str:
        """Generate a natural language answer."""
        messages = [
            {"role": "system", "content": (
                "You are a visual question answering assistant. "
                "Answer the question about the image using the provided "
                "analysis results when available. Be concise and accurate. "
                "If uncertain, say so. Do not hallucinate details."
            )},
        ]

        # Include conversation history for context
        messages.extend(self.conversation_history[-6:])

        user_content = [
            {"type": "image_url", "image_url": {
                "url": f"data:image/png;base64,{self.image_b64}",
            }},
            {"type": "text", "text": question},
        ]

        if tool_results:
            import json
            user_content.append({
                "type": "text",
                "text": f"Analysis results: {json.dumps(tool_results)}",
            })

        messages.append({"role": "user", "content": user_content})

        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
        )

        answer = response.choices[0].message.content

        # Update conversation history
        self.conversation_history.append(
            {"role": "user", "content": question}
        )
        self.conversation_history.append(
            {"role": "assistant", "content": answer}
        )

        return answer

Using the Agent

agent = VisualQAAgent()
agent.load_image("street_scene.jpg")

print(agent.ask("How many cars are in this image?"))
print(agent.ask("What color is the largest car?"))
print(agent.ask("Are there any pedestrians near it?"))
print(agent.ask("What does the street sign say?"))

The agent maintains conversation context, so follow-up questions like "Are there any pedestrians near it?" correctly resolve "it" to the car discussed in the previous answer.

Handling Edge Cases

def validate_answer_confidence(
    question: str,
    answer: str,
    tool_results: dict,
) -> dict:
    """Assess confidence in the generated answer."""
    has_tool_support = bool(tool_results)
    tool_confidence = np.mean([
        r.get("confidence", 0.5)
        for r in tool_results.values()
        if isinstance(r, dict) and "confidence" in r
    ]) if tool_results else 0.0

    hedging_words = ["might", "possibly", "uncertain", "not sure", "unclear"]
    has_hedging = any(w in answer.lower() for w in hedging_words)

    return {
        "answer": answer,
        "has_tool_support": has_tool_support,
        "tool_confidence": round(tool_confidence, 2),
        "answer_has_hedging": has_hedging,
        "overall_confidence": "high" if (
            has_tool_support and tool_confidence > 0.8 and not has_hedging
        ) else "medium" if has_tool_support else "low",
    }

FAQ

How do vision LLMs handle ambiguous or trick questions?

Modern vision LLMs like GPT-4o are reasonably good at recognizing ambiguous questions and qualifying their answers. For example, if asked "What time is it?" about a photo with a blurry clock, it will typically say the time is difficult to read rather than guessing. However, they can still hallucinate details, especially about small or partially occluded objects. Always validate critical answers with specialized tools rather than relying solely on the LLM.

Can the agent handle questions that require world knowledge beyond the image?

Yes, this is one of the strengths of using an LLM as the answer generator. Questions like "Is this car expensive?" or "Is this building Art Deco style?" require knowledge beyond what is in the pixels. The LLM brings world knowledge to bear, combining what it sees in the image with what it knows about car brands, architectural styles, and cultural context.

How do I optimize latency for real-time VQA applications?

The main bottleneck is the vision LLM call. Optimize by: (1) caching image embeddings so repeated questions about the same image skip re-encoding, (2) running specialized tools only when the question classifier says they are needed, (3) using smaller/faster models for simple questions (yes/no, counting) and reserving the full vision LLM for complex reasoning questions. Pre-computing a scene description when the image is first loaded also helps — many questions can be answered from the cached description without another LLM call.

#VisualQA #MultiModalAI #ImageUnderstanding #VLM #QuestionAnswering #ComputerVision #AgenticAI #Python

Building a Visual QA Agent: Answering Natural Language Questions About Any Image

What Is Visual Question Answering?

Agent Architecture

The Question Router

Specialized Analysis Tools

The Visual QA Agent

Using the Agent

Handling Edge Cases

FAQ

How do vision LLMs handle ambiguous or trick questions?

Can the agent handle questions that require world knowledge beyond the image?

How do I optimize latency for real-time VQA applications?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding