Multi-Modal Prompting: Combining Text, Images, and Code in Single Prompts

Beyond Text-Only Interactions

Multi-modal models like GPT-4o, Claude, and Gemini accept not just text but images, documents, and structured data in a single prompt. This opens up use cases that were impossible with text-only prompting — analyzing screenshots, interpreting charts, debugging UI layouts, and reasoning over diagrams alongside natural language instructions.

The challenge is learning how to structure these mixed-modality prompts effectively. A poorly structured multi-modal prompt wastes the model's attention on irrelevant visual details or fails to connect the image content to the text instructions.

Vision Plus Text: The Basics

The most common multi-modal pattern combines an image with a text instruction. The key is being specific about what the model should focus on in the image:

import openai
import base64
from pathlib import Path

client = openai.OpenAI()

def encode_image(image_path: str) -> str:
    """Encode an image to base64 for the API."""
    image_data = Path(image_path).read_bytes()
    return base64.b64encode(image_data).decode("utf-8")

def analyze_image(
    image_path: str,
    instruction: str,
    detail: str = "high",
) -> str:
    """Analyze an image with a specific text instruction."""
    base64_image = encode_image(image_path)

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "user", "content": [
                {"type": "text", "text": instruction},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{base64_image}",
                        "detail": detail,
                    },
                },
            ]},
        ],
    )
    return response.choices[0].message.content

# Specific instruction beats generic "describe this image"
result = analyze_image(
    "dashboard_screenshot.png",
    "Identify all error states visible in this dashboard screenshot. "
    "For each error, note the component name, the error message, "
    "and suggest a likely root cause based on the displayed data."
)

The detail parameter matters for cost and quality. Use "high" when the image contains small text, code, or fine details. Use "low" for simple diagrams or when you only need a general understanding.

Multi-Image Comparison Prompts

You can include multiple images in a single prompt for comparison tasks:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

def compare_designs(
    before_path: str,
    after_path: str,
    focus_areas: list[str],
) -> str:
    """Compare two UI designs and identify differences."""
    before_b64 = encode_image(before_path)
    after_b64 = encode_image(after_path)
    focus = ", ".join(focus_areas)

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "user", "content": [
                {"type": "text", "text": (
                    "Compare these two UI designs. The first image is "
                    "the BEFORE state and the second is the AFTER state. "
                    f"Focus specifically on: {focus}. "
                    "List every visual difference you find."
                )},
                {"type": "image_url", "image_url": {
                    "url": f"data:image/png;base64,{before_b64}",
                    "detail": "high",
                }},
                {"type": "image_url", "image_url": {
                    "url": f"data:image/png;base64,{after_b64}",
                    "detail": "high",
                }},
            ]},
        ],
    )
    return response.choices[0].message.content

Code Plus Text: Structured Analysis

Combining code snippets with natural language context produces better analysis than either alone:

def review_code_with_context(
    code: str,
    language: str,
    architecture_description: str,
    review_focus: list[str],
) -> str:
    """Review code with architectural context."""
    focus_items = "\n".join(f"- {f}" for f in review_focus)

    prompt = (
        f"## Architecture Context\n\n{architecture_description}\n\n"
        f"## Code to Review\n\n"
        f"~~~{language}\n{code}\n~~~\n\n"
        f"## Review Focus Areas\n\n{focus_items}\n\n"
        "Provide a structured review addressing each focus area. "
        "Reference specific line numbers and suggest concrete fixes."
    )

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
    )
    return response.choices[0].message.content

For complex tasks, structure your multi-modal prompt with clear sections:

def structured_multimodal_prompt(
    text_context: str,
    image_paths: list[str],
    code_snippet: str,
    task: str,
) -> str:
    """Build a structured multi-modal prompt."""
    content = [
        {"type": "text", "text": f"## Task\n\n{task}\n\n## Context\n\n{text_context}\n\n## Relevant Code\n\n~~~python\n{code_snippet}\n~~~\n\n## Visual References\n\nAnalyze the following images in order:"},
    ]

    for i, path in enumerate(image_paths):
        content.append(
            {"type": "text", "text": f"\nImage {i + 1}:"}
        )
        content.append({
            "type": "image_url",
            "image_url": {
                "url": f"data:image/png;base64,{encode_image(path)}",
                "detail": "high",
            },
        })

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": content}],
    )
    return response.choices[0].message.content

The pattern of labeling images ("Image 1:", "Image 2:") and providing context before the images helps the model understand the relationship between modalities. Without this structure, the model may describe each image independently rather than integrating information across all inputs.

FAQ

No. The API format varies by provider. OpenAI uses content arrays with type: "text" and type: "image_url" objects. Anthropic uses type: "image" with base64 data in a source block. Google Gemini uses inline_data with mime_type. Always check the provider's documentation for the exact format.

How does image resolution affect quality and cost?

Higher resolution images consume more tokens. GPT-4o's detail: "high" mode tiles the image into 512x512 patches and processes each separately, costing roughly 85 tokens per tile. A 2048x2048 image uses about 1360 tokens. Use detail: "low" (85 tokens flat) when fine detail is not needed to save significantly on cost.

Can I combine images with tool-use in a single interaction?

Yes. Multi-modal inputs work alongside function calling and tool use. A practical example is an agent that receives a screenshot, uses vision to understand the UI state, calls a tool to interact with the application, and then takes another screenshot to verify the result — all within a single agent loop.

#PromptEngineering #MultiModal #Vision #GPT4o #Python #AgenticAI #LearnAI #AIEngineering

Multi-Modal Prompting: Combining Text, Images, and Code in Single Prompts

Beyond Text-Only Interactions

Vision Plus Text: The Basics

Multi-Image Comparison Prompts

Code Plus Text: Structured Analysis

FAQ

How does image resolution affect quality and cost?

Can I combine images with tool-use in a single interaction?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding

Beyond Text-Only Interactions

Vision Plus Text: The Basics

Multi-Image Comparison Prompts

Code Plus Text: Structured Analysis

Structured Multi-Modal Inputs

FAQ

Do all models support multi-modal prompts the same way?

How does image resolution affect quality and cost?

Can I combine images with tool-use in a single interaction?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding