Multi-Modal Prompting: Combining Text, Images, and Code in Single Prompts
Master multi-modal prompting techniques that combine text, images, and code inputs in a single prompt to unlock more capable and context-rich LLM interactions.
Beyond Text-Only Interactions
Multi-modal models like GPT-4o, Claude, and Gemini accept not just text but images, documents, and structured data in a single prompt. This opens up use cases that were impossible with text-only prompting — analyzing screenshots, interpreting charts, debugging UI layouts, and reasoning over diagrams alongside natural language instructions.
The challenge is learning how to structure these mixed-modality prompts effectively. A poorly structured multi-modal prompt wastes the model's attention on irrelevant visual details or fails to connect the image content to the text instructions.
Vision Plus Text: The Basics
The most common multi-modal pattern combines an image with a text instruction. The key is being specific about what the model should focus on in the image:
import openai
import base64
from pathlib import Path
client = openai.OpenAI()
def encode_image(image_path: str) -> str:
"""Encode an image to base64 for the API."""
image_data = Path(image_path).read_bytes()
return base64.b64encode(image_data).decode("utf-8")
def analyze_image(
image_path: str,
instruction: str,
detail: str = "high",
) -> str:
"""Analyze an image with a specific text instruction."""
base64_image = encode_image(image_path)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "user", "content": [
{"type": "text", "text": instruction},
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{base64_image}",
"detail": detail,
},
},
]},
],
)
return response.choices[0].message.content
# Specific instruction beats generic "describe this image"
result = analyze_image(
"dashboard_screenshot.png",
"Identify all error states visible in this dashboard screenshot. "
"For each error, note the component name, the error message, "
"and suggest a likely root cause based on the displayed data."
)
The detail parameter matters for cost and quality. Use "high" when the image contains small text, code, or fine details. Use "low" for simple diagrams or when you only need a general understanding.
Multi-Image Comparison Prompts
You can include multiple images in a single prompt for comparison tasks:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
def compare_designs(
before_path: str,
after_path: str,
focus_areas: list[str],
) -> str:
"""Compare two UI designs and identify differences."""
before_b64 = encode_image(before_path)
after_b64 = encode_image(after_path)
focus = ", ".join(focus_areas)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "user", "content": [
{"type": "text", "text": (
"Compare these two UI designs. The first image is "
"the BEFORE state and the second is the AFTER state. "
f"Focus specifically on: {focus}. "
"List every visual difference you find."
)},
{"type": "image_url", "image_url": {
"url": f"data:image/png;base64,{before_b64}",
"detail": "high",
}},
{"type": "image_url", "image_url": {
"url": f"data:image/png;base64,{after_b64}",
"detail": "high",
}},
]},
],
)
return response.choices[0].message.content
Code Plus Text: Structured Analysis
Combining code snippets with natural language context produces better analysis than either alone:
def review_code_with_context(
code: str,
language: str,
architecture_description: str,
review_focus: list[str],
) -> str:
"""Review code with architectural context."""
focus_items = "\n".join(f"- {f}" for f in review_focus)
prompt = (
f"## Architecture Context\n\n{architecture_description}\n\n"
f"## Code to Review\n\n"
f"~~~{language}\n{code}\n~~~\n\n"
f"## Review Focus Areas\n\n{focus_items}\n\n"
"Provide a structured review addressing each focus area. "
"Reference specific line numbers and suggest concrete fixes."
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
)
return response.choices[0].message.content
Structured Multi-Modal Inputs
For complex tasks, structure your multi-modal prompt with clear sections:
def structured_multimodal_prompt(
text_context: str,
image_paths: list[str],
code_snippet: str,
task: str,
) -> str:
"""Build a structured multi-modal prompt."""
content = [
{"type": "text", "text": f"## Task\n\n{task}\n\n## Context\n\n{text_context}\n\n## Relevant Code\n\n~~~python\n{code_snippet}\n~~~\n\n## Visual References\n\nAnalyze the following images in order:"},
]
for i, path in enumerate(image_paths):
content.append(
{"type": "text", "text": f"\nImage {i + 1}:"}
)
content.append({
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{encode_image(path)}",
"detail": "high",
},
})
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": content}],
)
return response.choices[0].message.content
The pattern of labeling images ("Image 1:", "Image 2:") and providing context before the images helps the model understand the relationship between modalities. Without this structure, the model may describe each image independently rather than integrating information across all inputs.
FAQ
Do all models support multi-modal prompts the same way?
No. The API format varies by provider. OpenAI uses content arrays with type: "text" and type: "image_url" objects. Anthropic uses type: "image" with base64 data in a source block. Google Gemini uses inline_data with mime_type. Always check the provider's documentation for the exact format.
How does image resolution affect quality and cost?
Higher resolution images consume more tokens. GPT-4o's detail: "high" mode tiles the image into 512x512 patches and processes each separately, costing roughly 85 tokens per tile. A 2048x2048 image uses about 1360 tokens. Use detail: "low" (85 tokens flat) when fine detail is not needed to save significantly on cost.
Can I combine images with tool-use in a single interaction?
Yes. Multi-modal inputs work alongside function calling and tool use. A practical example is an agent that receives a screenshot, uses vision to understand the UI state, calls a tool to interact with the application, and then takes another screenshot to verify the result — all within a single agent loop.
#PromptEngineering #MultiModal #Vision #GPT4o #Python #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.