Building a Multi-Input Agent: Combining User Text with Uploaded Files for Rich Interactions
Build a multi-input AI agent that handles user text alongside uploaded files of any format. Learn file upload handling, automatic format detection, unified processing pipelines, and how to generate contextual responses from mixed inputs.
The Multi-Input Problem
Most AI chat interfaces accept text only. But real user needs often involve files: "Here is my resume — help me improve it," "What does this error log mean," or "Analyze this CSV and tell me the trends." A multi-input agent must accept text and files together, detect what each file contains, process it appropriately, and generate a response that meaningfully integrates all inputs.
File Format Detection
The first step is reliably identifying what the user uploaded. MIME type detection combined with content inspection handles the vast majority of formats:
import mimetypes
import magic # python-magic
from dataclasses import dataclass
from enum import Enum
from pathlib import Path
class FileCategory(Enum):
TEXT = "text"
IMAGE = "image"
AUDIO = "audio"
VIDEO = "video"
DOCUMENT = "document" # PDF, DOCX
SPREADSHEET = "spreadsheet" # CSV, XLSX
CODE = "code"
ARCHIVE = "archive"
UNKNOWN = "unknown"
@dataclass
class DetectedFile:
filename: str
mime_type: str
category: FileCategory
size_bytes: int
content: bytes
MIME_CATEGORY_MAP = {
"application/pdf": FileCategory.DOCUMENT,
"application/vnd.openxmlformats-officedocument"
".wordprocessingml.document": FileCategory.DOCUMENT,
"text/csv": FileCategory.SPREADSHEET,
"application/vnd.openxmlformats-officedocument"
".spreadsheetml.sheet": FileCategory.SPREADSHEET,
}
CODE_EXTENSIONS = {
".py", ".js", ".ts", ".java", ".go", ".rs",
".rb", ".cpp", ".c", ".h", ".sql", ".sh",
}
def detect_file(filename: str, content: bytes) -> DetectedFile:
"""Detect the type and category of an uploaded file."""
mime = magic.from_buffer(content, mime=True)
ext = Path(filename).suffix.lower()
# Check extension-based overrides
if ext in CODE_EXTENSIONS:
category = FileCategory.CODE
elif mime in MIME_CATEGORY_MAP:
category = MIME_CATEGORY_MAP[mime]
elif mime.startswith("image/"):
category = FileCategory.IMAGE
elif mime.startswith("audio/"):
category = FileCategory.AUDIO
elif mime.startswith("video/"):
category = FileCategory.VIDEO
elif mime.startswith("text/"):
category = FileCategory.TEXT
else:
category = FileCategory.UNKNOWN
return DetectedFile(
filename=filename,
mime_type=mime,
category=category,
size_bytes=len(content),
content=content,
)
Category-Specific Processors
Each file category has a dedicated processor that extracts content into a text representation the LLM can reason over:
import csv
import io
async def process_text_file(file: DetectedFile) -> str:
text = file.content.decode("utf-8", errors="replace")
if len(text) > 50000:
text = text[:50000] + "\n... [truncated]"
return f"Contents of {file.filename}:\n{text}"
async def process_code_file(file: DetectedFile) -> str:
code = file.content.decode("utf-8", errors="replace")
ext = Path(file.filename).suffix.lstrip(".")
return (
f"Code file: {file.filename}\n"
f"Language: {ext}\n"
f"Lines: {code.count(chr(10)) + 1}\n"
f"~~~{ext}\n{code}\n~~~"
)
async def process_csv_file(file: DetectedFile) -> str:
text = file.content.decode("utf-8", errors="replace")
reader = csv.reader(io.StringIO(text))
rows = list(reader)
if not rows:
return f"{file.filename}: empty CSV"
header = rows[0]
preview_rows = rows[1:11] # First 10 data rows
lines = [
f"CSV file: {file.filename}",
f"Columns: {', '.join(header)}",
f"Total rows: {len(rows) - 1}",
"",
"Preview (first 10 rows):",
"| " + " | ".join(header) + " |",
"| " + " | ".join("---" for _ in header) + " |",
]
for row in preview_rows:
lines.append("| " + " | ".join(row) + " |")
return "\n".join(lines)
PROCESSORS = {
FileCategory.TEXT: process_text_file,
FileCategory.CODE: process_code_file,
FileCategory.SPREADSHEET: process_csv_file,
}
The Unified Processing Pipeline
Bring file detection, processing, and LLM reasoning together:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
import openai
class MultiInputAgent:
def __init__(self):
self.client = openai.AsyncOpenAI()
async def _process_file(self, file: DetectedFile) -> str:
processor = PROCESSORS.get(file.category)
if processor:
return await processor(file)
# Fallback: describe what we received
return (
f"File: {file.filename} "
f"({file.category.value}, {file.size_bytes} bytes)"
)
async def chat(
self,
user_message: str,
files: list[tuple[str, bytes]] | None = None,
) -> str:
"""Process user text and optional file uploads."""
# Detect and process all files
file_contexts = []
image_parts = []
for filename, content in (files or []):
detected = detect_file(filename, content)
if detected.category == FileCategory.IMAGE:
import base64
b64 = base64.b64encode(content).decode()
image_parts.append({
"type": "image_url",
"image_url": {
"url": (
f"data:{detected.mime_type};"
f"base64,{b64}"
)
},
})
else:
processed = await self._process_file(detected)
file_contexts.append(processed)
# Build the prompt
parts = []
if file_contexts:
parts.append(
"Uploaded file contents:\n\n"
+ "\n\n---\n\n".join(file_contexts)
)
parts.append(f"User message: {user_message}")
content = [{"type": "text", "text": "\n\n".join(parts)}]
content.extend(image_parts)
response = await self.client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": (
"You are a helpful assistant that analyzes "
"user messages along with any uploaded files. "
"Reference specific file contents in your "
"response."
),
},
{"role": "user", "content": content},
],
)
return response.choices[0].message.content
FastAPI Endpoint
Expose the agent through a web API that accepts multipart form data:
from fastapi import FastAPI, UploadFile, File, Form
from typing import Annotated
app = FastAPI()
agent = MultiInputAgent()
@app.post("/chat")
async def chat_endpoint(
message: Annotated[str, Form()],
files: list[UploadFile] = File(default=[]),
):
file_data = []
for f in files:
content = await f.read()
file_data.append((f.filename, content))
response = await agent.chat(message, file_data)
return {"response": response}
FAQ
How do I handle very large files that exceed the LLM context window?
For large files, implement a summarization or chunking strategy. For text and code files, truncate to the first and last sections with a note about what was omitted. For CSVs, show the schema plus a statistical summary (column types, min, max, mean) instead of raw rows. For PDFs, extract only the pages most relevant to the user's question using keyword matching against the query.
What security considerations are important for file upload agents?
Never execute uploaded files or evaluate their contents as code. Validate file sizes (reject uploads over a reasonable limit like 50MB). Scan for malware if the system is exposed to the public. Sanitize filenames to prevent path traversal attacks. Process files in isolated temporary directories and clean them up after processing. Never store raw uploads permanently unless explicitly required.
Can this agent maintain context across multiple messages with different file uploads?
Yes. Add a conversation history that stores both messages and processed file contexts. On each new message, include the relevant prior context in the prompt. For efficiency, store processed file summaries rather than raw file contents in the history, and allow the user to reference previously uploaded files by name without re-uploading them.
#MultiInputAgent #FileProcessing #FormatDetection #FastAPI #Python #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.