Gemini Multi-Modal Agents: Processing Images, Video, and Audio Together

Text-only agents miss most of the information in the real world. Documents contain charts and diagrams. Customer support involves screenshots. Security systems produce video feeds. Call centers generate hours of audio. Gemini processes all of these natively in a single model — no separate OCR, speech-to-text, or vision pipelines required.

This unified approach means your agent can reason across modalities. It can look at a screenshot of an error, read the stack trace in the image, correlate it with code you provide as text, and explain the fix — all in one inference call.

Processing Images

The simplest multi-modal interaction sends an image with a text prompt:

import google.generativeai as genai
from pathlib import Path
import os

genai.configure(api_key=os.environ["GOOGLE_API_KEY"])

model = genai.GenerativeModel("gemini-2.0-flash")

# Load image from file
image_path = Path("screenshot.png")
image_data = image_path.read_bytes()

response = model.generate_content([
    "Analyze this UI screenshot. Identify any usability issues and suggest improvements.",
    {"mime_type": "image/png", "data": image_data},
])

print(response.text)

You can also pass multiple images in a single request for comparison tasks:

before = Path("ui_before.png").read_bytes()
after = Path("ui_after.png").read_bytes()

response = model.generate_content([
    "Compare these two UI designs. What changed? Which is better for accessibility?",
    {"mime_type": "image/png", "data": before},
    {"mime_type": "image/png", "data": after},
])

Uploading Large Files with the Files API

For files larger than 20MB, or when you want to reuse media across multiple requests, use the Files API:

# Upload a video file
video_file = genai.upload_file(
    path="meeting_recording.mp4",
    display_name="Team standup March 17",
)

# Wait for processing to complete
import time
while video_file.state.name == "PROCESSING":
    time.sleep(5)
    video_file = genai.get_file(video_file.name)

if video_file.state.name == "FAILED":
    raise ValueError(f"File processing failed: {video_file.state.name}")

print(f"File ready: {video_file.uri}")

Once uploaded, reference the file in your requests:

response = model.generate_content([
    video_file,
    "Summarize this meeting. List action items with the person responsible for each.",
])

print(response.text)

Video Analysis with Timestamps

Gemini can analyze video content and reference specific timestamps:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

model = genai.GenerativeModel(
    "gemini-2.0-flash",
    system_instruction="""You are a video analysis agent. When referencing
    moments in the video, always include the timestamp in MM:SS format.""",
)

response = model.generate_content([
    video_file,
    "Identify all the key moments in this product demo. "
    "For each moment, provide the timestamp, what is shown, and why it matters.",
])

print(response.text)

Gemini samples video at approximately 1 frame per second, so it captures visual changes effectively. A 1-hour video uses approximately 258K tokens for video frames plus additional tokens for any audio track.

Audio Transcription and Analysis

Gemini handles audio natively — no separate speech-to-text step required:

audio_file = genai.upload_file(path="customer_call.wav")

# Wait for processing
import time
while audio_file.state.name == "PROCESSING":
    time.sleep(3)
    audio_file = genai.get_file(audio_file.name)

response = model.generate_content([
    audio_file,
    "Transcribe this customer call. Then analyze the sentiment, "
    "identify the main issue, and rate the agent's performance.",
])

print(response.text)

Supported audio formats include WAV, MP3, AIFF, AAC, OGG, and FLAC. Audio is processed at a rate of approximately 32 tokens per second.

Here is a complete agent that processes mixed media inputs:

import google.generativeai as genai
from pathlib import Path
import os

genai.configure(api_key=os.environ["GOOGLE_API_KEY"])

class MultiModalAgent:
    def __init__(self):
        self.model = genai.GenerativeModel(
            "gemini-2.0-flash",
            system_instruction=(
                "You are a helpful assistant that can analyze text, "
                "images, audio, and video. Always describe what you "
                "observe in each media type before answering questions."
            ),
        )
        self.chat = self.model.start_chat()

    def send(self, text: str, media_paths: list[str] = None) -> str:
        parts = []
        if media_paths:
            for path in media_paths:
                file_obj = genai.upload_file(path=path)
                # Poll until ready
                import time
                while file_obj.state.name == "PROCESSING":
                    time.sleep(2)
                    file_obj = genai.get_file(file_obj.name)
                parts.append(file_obj)
        parts.append(text)

        response = self.chat.send_message(parts)
        return response.text

agent = MultiModalAgent()

# Analyze an image and audio together
result = agent.send(
    "The image shows our server dashboard and the audio is an alert notification. "
    "What is the server status and is the alert critical?",
    media_paths=["dashboard.png", "alert.wav"],
)
print(result)

FAQ

What are the file size limits for Gemini media uploads?

Inline data (passed directly in the request) is limited to 20MB. The Files API supports uploads up to 2GB per file. Uploaded files are stored for 48 hours and then automatically deleted.

Can Gemini process live video streams?

Gemini's standard API processes pre-recorded media. For real-time processing, the Gemini Live API supports streaming audio and video input with low-latency responses. This is available through the Vertex AI platform.

How many images can I include in a single request?

Gemini supports up to 3,600 image files in a single request, though practical limits depend on total token count. Each image consumes approximately 258 tokens. For most agent applications, sending 5-20 images per request is the practical sweet spot.

#GoogleGemini #MultiModalAI #ComputerVision #AudioProcessing #Python #AgenticAI #LearnAI #AIEngineering

Gemini Multi-Modal Agents: Processing Images, Video, and Audio Together

Processing Images

Uploading Large Files with the Files API

Video Analysis with Timestamps

Audio Transcription and Analysis

FAQ

What are the file size limits for Gemini media uploads?

Can Gemini process live video streams?

How many images can I include in a single request?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding

Why Multi-Modal Agents Matter

Processing Images

Uploading Large Files with the Files API

Video Analysis with Timestamps

Audio Transcription and Analysis

Building a Multi-Modal Agent

FAQ

What are the file size limits for Gemini media uploads?

Can Gemini process live video streams?

How many images can I include in a single request?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding