Generating Multimodal Outputs: AI Agents That Create Images, Audio, and Documents
Build AI agents that generate rich multimodal outputs including images with DALL-E, speech with TTS, PDF documents, and formatted reports. Learn how to orchestrate multiple generation APIs into cohesive, multi-format responses.
Beyond Text Responses
Most AI agents return plain text. But many real tasks require richer outputs: a marketing agent should deliver copy alongside generated images, a report agent should produce formatted PDFs, and an accessibility agent should provide audio narrations. This guide builds an agent that generates images, audio, and documents as part of its response.
Image Generation with DALL-E
Start with the most common multimodal output — generating images from text descriptions:
import openai
import httpx
from dataclasses import dataclass, field
from pathlib import Path
@dataclass
class GeneratedImage:
url: str
local_path: str | None = None
prompt: str = ""
revised_prompt: str = ""
async def generate_image(
prompt: str,
client: openai.AsyncOpenAI,
size: str = "1024x1024",
quality: str = "standard",
save_dir: str = "./outputs",
) -> GeneratedImage:
"""Generate an image using DALL-E 3."""
response = await client.images.generate(
model="dall-e-3",
prompt=prompt,
size=size,
quality=quality,
n=1,
)
image_url = response.data[0].url
revised = response.data[0].revised_prompt
# Download and save locally
Path(save_dir).mkdir(parents=True, exist_ok=True)
safe_name = prompt[:50].replace(" ", "_").replace("/", "_")
local_path = f"{save_dir}/{safe_name}.png"
async with httpx.AsyncClient() as http:
img_response = await http.get(image_url)
with open(local_path, "wb") as f:
f.write(img_response.content)
return GeneratedImage(
url=image_url,
local_path=local_path,
prompt=prompt,
revised_prompt=revised,
)
Text-to-Speech Generation
For audio output, use OpenAI's TTS API to convert text to natural speech:
@dataclass
class GeneratedAudio:
local_path: str
duration_estimate: float
voice: str
text: str
async def generate_speech(
text: str,
client: openai.AsyncOpenAI,
voice: str = "alloy",
save_dir: str = "./outputs",
) -> GeneratedAudio:
"""Generate speech audio from text using OpenAI TTS."""
Path(save_dir).mkdir(parents=True, exist_ok=True)
safe_name = text[:30].replace(" ", "_").replace("/", "_")
local_path = f"{save_dir}/{safe_name}.mp3"
response = await client.audio.speech.create(
model="tts-1-hd",
voice=voice,
input=text,
)
with open(local_path, "wb") as f:
f.write(response.content)
# Rough duration estimate: ~150 words per minute
word_count = len(text.split())
duration = word_count / 150 * 60
return GeneratedAudio(
local_path=local_path,
duration_estimate=duration,
voice=voice,
text=text,
)
PDF Document Generation
For structured document output, generate PDFs with formatted text, tables, and embedded images:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
from reportlab.lib.pagesizes import letter
from reportlab.platypus import (
SimpleDocTemplate, Paragraph, Spacer, Table,
TableStyle, RLImage,
)
from reportlab.lib.styles import getSampleStyleSheet
from reportlab.lib import colors
@dataclass
class GeneratedDocument:
local_path: str
page_count: int
title: str
def generate_pdf_report(
title: str,
sections: list[dict],
save_dir: str = "./outputs",
images: list[str] | None = None,
) -> GeneratedDocument:
"""Generate a formatted PDF report.
Each section: {"heading": str, "body": str, "table": list | None}
"""
Path(save_dir).mkdir(parents=True, exist_ok=True)
safe_name = title[:40].replace(" ", "_")
path = f"{save_dir}/{safe_name}.pdf"
doc = SimpleDocTemplate(path, pagesize=letter)
styles = getSampleStyleSheet()
story = []
# Title
story.append(Paragraph(title, styles["Title"]))
story.append(Spacer(1, 20))
for section in sections:
# Section heading
story.append(
Paragraph(section["heading"], styles["Heading2"])
)
story.append(Spacer(1, 10))
# Body text
story.append(
Paragraph(section["body"], styles["BodyText"])
)
story.append(Spacer(1, 10))
# Optional table
if section.get("table"):
table_data = section["table"]
t = Table(table_data)
t.setStyle(TableStyle([
("BACKGROUND", (0, 0), (-1, 0), colors.grey),
("TEXTCOLOR", (0, 0), (-1, 0), colors.whitesmoke),
("GRID", (0, 0), (-1, -1), 1, colors.black),
("FONTSIZE", (0, 0), (-1, -1), 9),
]))
story.append(t)
story.append(Spacer(1, 15))
# Embed images if provided
for img_path in (images or []):
if Path(img_path).exists():
story.append(RLImage(img_path, width=400, height=300))
story.append(Spacer(1, 15))
doc.build(story)
# Approximate page count
page_count = max(1, len(sections) // 3 + 1)
return GeneratedDocument(
local_path=path,
page_count=page_count,
title=title,
)
The Multimodal Output Agent
Bring all generators together into an agent that decides which output formats are appropriate for each request:
@dataclass
class MultimodalResponse:
text: str
images: list[GeneratedImage] = field(default_factory=list)
audio: list[GeneratedAudio] = field(default_factory=list)
documents: list[GeneratedDocument] = field(default_factory=list)
class MultimodalOutputAgent:
def __init__(self):
self.client = openai.AsyncOpenAI()
async def _plan_outputs(self, query: str) -> dict:
"""Ask the LLM what output formats are appropriate."""
response = await self.client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": (
"Decide what output formats to generate. "
"Return a JSON object with boolean fields: "
"needs_image, needs_audio, needs_document, "
"and string fields: image_prompt, "
"audio_text, document_title, text_response."
),
},
{"role": "user", "content": query},
],
response_format={"type": "json_object"},
)
import json
return json.loads(response.choices[0].message.content)
async def respond(self, query: str) -> MultimodalResponse:
plan = await self._plan_outputs(query)
result = MultimodalResponse(
text=plan.get("text_response", "")
)
# Generate outputs in parallel where possible
import asyncio
tasks = []
if plan.get("needs_image") and plan.get("image_prompt"):
tasks.append(self._gen_image(plan["image_prompt"]))
if plan.get("needs_audio") and plan.get("audio_text"):
tasks.append(self._gen_audio(plan["audio_text"]))
outputs = await asyncio.gather(*tasks, return_exceptions=True)
for output in outputs:
if isinstance(output, GeneratedImage):
result.images.append(output)
elif isinstance(output, GeneratedAudio):
result.audio.append(output)
elif isinstance(output, Exception):
result.text += (
f"\n\n[Generation error: {output}]"
)
if plan.get("needs_document"):
doc = generate_pdf_report(
title=plan.get("document_title", "Report"),
sections=[{
"heading": "Content",
"body": result.text,
}],
images=[
img.local_path
for img in result.images
if img.local_path
],
)
result.documents.append(doc)
return result
async def _gen_image(self, prompt: str) -> GeneratedImage:
return await generate_image(prompt, self.client)
async def _gen_audio(self, text: str) -> GeneratedAudio:
return await generate_speech(text, self.client)
Usage Example
import asyncio
async def main():
agent = MultimodalOutputAgent()
response = await agent.respond(
"Create a brief market analysis report for the AI "
"industry in 2026, with a cover image and an audio "
"executive summary."
)
print("Text:", response.text[:200])
print("Images:", [img.local_path for img in response.images])
print("Audio:", [a.local_path for a in response.audio])
print("Docs:", [d.local_path for d in response.documents])
asyncio.run(main())
FAQ
How do I control the cost of generating multiple output types per request?
Implement a budget system that tracks estimated costs per generation type. DALL-E 3 costs approximately $0.04 per standard image, TTS costs about $0.015 per 1000 characters, and GPT-4o planning costs standard token rates. Set per-request spending limits and skip optional outputs (like images) when the budget is tight. Also cache generated outputs — if the same image prompt appears twice, serve the cached version instead of regenerating.
Can I use open-source alternatives instead of OpenAI APIs for generation?
Yes. For image generation, use Stable Diffusion via a local ComfyUI or A1111 server. For TTS, Coqui TTS and Bark provide open-source speech synthesis. For document generation, reportlab (shown above) is already open-source and runs locally with no API calls. Replace the API calls in each generator function with calls to your local model servers while keeping the same return types.
How do I serve multimodal outputs through a web API?
Return a JSON response with the text content inline and URLs or file paths for binary outputs. For a FastAPI endpoint, upload generated images and audio to cloud storage (S3, GCS) and return signed URLs. Alternatively, serve files directly from the local output directory using FastAPI's StaticFiles mount. For documents, return a download URL that streams the PDF directly to the client.
#MultimodalOutput #ImageGeneration #TexttoSpeech #DocumentGeneration #DALLE #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.