Generating Multimodal Outputs: AI Agents That Create Images, Audio, and Documents

Beyond Text Responses

Most AI agents return plain text. But many real tasks require richer outputs: a marketing agent should deliver copy alongside generated images, a report agent should produce formatted PDFs, and an accessibility agent should provide audio narrations. This guide builds an agent that generates images, audio, and documents as part of its response.

Image Generation with DALL-E

Start with the most common multimodal output — generating images from text descriptions:

import openai
import httpx
from dataclasses import dataclass, field
from pathlib import Path


@dataclass
class GeneratedImage:
    url: str
    local_path: str | None = None
    prompt: str = ""
    revised_prompt: str = ""


async def generate_image(
    prompt: str,
    client: openai.AsyncOpenAI,
    size: str = "1024x1024",
    quality: str = "standard",
    save_dir: str = "./outputs",
) -> GeneratedImage:
    """Generate an image using DALL-E 3."""
    response = await client.images.generate(
        model="dall-e-3",
        prompt=prompt,
        size=size,
        quality=quality,
        n=1,
    )

    image_url = response.data[0].url
    revised = response.data[0].revised_prompt

    # Download and save locally
    Path(save_dir).mkdir(parents=True, exist_ok=True)
    safe_name = prompt[:50].replace(" ", "_").replace("/", "_")
    local_path = f"{save_dir}/{safe_name}.png"

    async with httpx.AsyncClient() as http:
        img_response = await http.get(image_url)
        with open(local_path, "wb") as f:
            f.write(img_response.content)

    return GeneratedImage(
        url=image_url,
        local_path=local_path,
        prompt=prompt,
        revised_prompt=revised,
    )

Text-to-Speech Generation

For audio output, use OpenAI's TTS API to convert text to natural speech:

@dataclass
class GeneratedAudio:
    local_path: str
    duration_estimate: float
    voice: str
    text: str


async def generate_speech(
    text: str,
    client: openai.AsyncOpenAI,
    voice: str = "alloy",
    save_dir: str = "./outputs",
) -> GeneratedAudio:
    """Generate speech audio from text using OpenAI TTS."""
    Path(save_dir).mkdir(parents=True, exist_ok=True)
    safe_name = text[:30].replace(" ", "_").replace("/", "_")
    local_path = f"{save_dir}/{safe_name}.mp3"

    response = await client.audio.speech.create(
        model="tts-1-hd",
        voice=voice,
        input=text,
    )

    with open(local_path, "wb") as f:
        f.write(response.content)

    # Rough duration estimate: ~150 words per minute
    word_count = len(text.split())
    duration = word_count / 150 * 60

    return GeneratedAudio(
        local_path=local_path,
        duration_estimate=duration,
        voice=voice,
        text=text,
    )

PDF Document Generation

For structured document output, generate PDFs with formatted text, tables, and embedded images:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

from reportlab.lib.pagesizes import letter
from reportlab.platypus import (
    SimpleDocTemplate, Paragraph, Spacer, Table,
    TableStyle, RLImage,
)
from reportlab.lib.styles import getSampleStyleSheet
from reportlab.lib import colors


@dataclass
class GeneratedDocument:
    local_path: str
    page_count: int
    title: str


def generate_pdf_report(
    title: str,
    sections: list[dict],
    save_dir: str = "./outputs",
    images: list[str] | None = None,
) -> GeneratedDocument:
    """Generate a formatted PDF report.

    Each section: {"heading": str, "body": str, "table": list | None}
    """
    Path(save_dir).mkdir(parents=True, exist_ok=True)
    safe_name = title[:40].replace(" ", "_")
    path = f"{save_dir}/{safe_name}.pdf"

    doc = SimpleDocTemplate(path, pagesize=letter)
    styles = getSampleStyleSheet()
    story = []

    # Title
    story.append(Paragraph(title, styles["Title"]))
    story.append(Spacer(1, 20))

    for section in sections:
        # Section heading
        story.append(
            Paragraph(section["heading"], styles["Heading2"])
        )
        story.append(Spacer(1, 10))

        # Body text
        story.append(
            Paragraph(section["body"], styles["BodyText"])
        )
        story.append(Spacer(1, 10))

        # Optional table
        if section.get("table"):
            table_data = section["table"]
            t = Table(table_data)
            t.setStyle(TableStyle([
                ("BACKGROUND", (0, 0), (-1, 0), colors.grey),
                ("TEXTCOLOR", (0, 0), (-1, 0), colors.whitesmoke),
                ("GRID", (0, 0), (-1, -1), 1, colors.black),
                ("FONTSIZE", (0, 0), (-1, -1), 9),
            ]))
            story.append(t)
            story.append(Spacer(1, 15))

    # Embed images if provided
    for img_path in (images or []):
        if Path(img_path).exists():
            story.append(RLImage(img_path, width=400, height=300))
            story.append(Spacer(1, 15))

    doc.build(story)

    # Approximate page count
    page_count = max(1, len(sections) // 3 + 1)

    return GeneratedDocument(
        local_path=path,
        page_count=page_count,
        title=title,
    )

The Multimodal Output Agent

Bring all generators together into an agent that decides which output formats are appropriate for each request:

@dataclass
class MultimodalResponse:
    text: str
    images: list[GeneratedImage] = field(default_factory=list)
    audio: list[GeneratedAudio] = field(default_factory=list)
    documents: list[GeneratedDocument] = field(default_factory=list)


class MultimodalOutputAgent:
    def __init__(self):
        self.client = openai.AsyncOpenAI()

    async def _plan_outputs(self, query: str) -> dict:
        """Ask the LLM what output formats are appropriate."""
        response = await self.client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": (
                        "Decide what output formats to generate. "
                        "Return a JSON object with boolean fields: "
                        "needs_image, needs_audio, needs_document, "
                        "and string fields: image_prompt, "
                        "audio_text, document_title, text_response."
                    ),
                },
                {"role": "user", "content": query},
            ],
            response_format={"type": "json_object"},
        )
        import json
        return json.loads(response.choices[0].message.content)

    async def respond(self, query: str) -> MultimodalResponse:
        plan = await self._plan_outputs(query)

        result = MultimodalResponse(
            text=plan.get("text_response", "")
        )

        # Generate outputs in parallel where possible
        import asyncio
        tasks = []

        if plan.get("needs_image") and plan.get("image_prompt"):
            tasks.append(self._gen_image(plan["image_prompt"]))

        if plan.get("needs_audio") and plan.get("audio_text"):
            tasks.append(self._gen_audio(plan["audio_text"]))

        outputs = await asyncio.gather(*tasks, return_exceptions=True)

        for output in outputs:
            if isinstance(output, GeneratedImage):
                result.images.append(output)
            elif isinstance(output, GeneratedAudio):
                result.audio.append(output)
            elif isinstance(output, Exception):
                result.text += (
                    f"\n\n[Generation error: {output}]"
                )

        if plan.get("needs_document"):
            doc = generate_pdf_report(
                title=plan.get("document_title", "Report"),
                sections=[{
                    "heading": "Content",
                    "body": result.text,
                }],
                images=[
                    img.local_path
                    for img in result.images
                    if img.local_path
                ],
            )
            result.documents.append(doc)

        return result

    async def _gen_image(self, prompt: str) -> GeneratedImage:
        return await generate_image(prompt, self.client)

    async def _gen_audio(self, text: str) -> GeneratedAudio:
        return await generate_speech(text, self.client)

Usage Example

import asyncio


async def main():
    agent = MultimodalOutputAgent()

    response = await agent.respond(
        "Create a brief market analysis report for the AI "
        "industry in 2026, with a cover image and an audio "
        "executive summary."
    )

    print("Text:", response.text[:200])
    print("Images:", [img.local_path for img in response.images])
    print("Audio:", [a.local_path for a in response.audio])
    print("Docs:", [d.local_path for d in response.documents])

asyncio.run(main())

FAQ

How do I control the cost of generating multiple output types per request?

Implement a budget system that tracks estimated costs per generation type. DALL-E 3 costs approximately $0.04 per standard image, TTS costs about $0.015 per 1000 characters, and GPT-4o planning costs standard token rates. Set per-request spending limits and skip optional outputs (like images) when the budget is tight. Also cache generated outputs — if the same image prompt appears twice, serve the cached version instead of regenerating.

Can I use open-source alternatives instead of OpenAI APIs for generation?

Yes. For image generation, use Stable Diffusion via a local ComfyUI or A1111 server. For TTS, Coqui TTS and Bark provide open-source speech synthesis. For document generation, reportlab (shown above) is already open-source and runs locally with no API calls. Replace the API calls in each generator function with calls to your local model servers while keeping the same return types.

How do I serve multimodal outputs through a web API?

Return a JSON response with the text content inline and URLs or file paths for binary outputs. For a FastAPI endpoint, upload generated images and audio to cloud storage (S3, GCS) and return signed URLs. Alternatively, serve files directly from the local output directory using FastAPI's StaticFiles mount. For documents, return a download URL that streams the PDF directly to the client.

#MultimodalOutput #ImageGeneration #TexttoSpeech #DocumentGeneration #DALLE #AgenticAI #LearnAI #AIEngineering

Generating Multimodal Outputs: AI Agents That Create Images, Audio, and Documents

Beyond Text Responses

Image Generation with DALL-E

Text-to-Speech Generation

PDF Document Generation

The Multimodal Output Agent

Usage Example

FAQ

How do I control the cost of generating multiple output types per request?

Can I use open-source alternatives instead of OpenAI APIs for generation?

How do I serve multimodal outputs through a web API?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding