Skip to content
Learn Agentic AI12 min read0 views

Building a Multi-Input Agent: Combining User Text with Uploaded Files for Rich Interactions

Build a multi-input AI agent that handles user text alongside uploaded files of any format. Learn file upload handling, automatic format detection, unified processing pipelines, and how to generate contextual responses from mixed inputs.

The Multi-Input Problem

Most AI chat interfaces accept text only. But real user needs often involve files: "Here is my resume — help me improve it," "What does this error log mean," or "Analyze this CSV and tell me the trends." A multi-input agent must accept text and files together, detect what each file contains, process it appropriately, and generate a response that meaningfully integrates all inputs.

File Format Detection

The first step is reliably identifying what the user uploaded. MIME type detection combined with content inspection handles the vast majority of formats:

import mimetypes
import magic  # python-magic
from dataclasses import dataclass
from enum import Enum
from pathlib import Path


class FileCategory(Enum):
    TEXT = "text"
    IMAGE = "image"
    AUDIO = "audio"
    VIDEO = "video"
    DOCUMENT = "document"  # PDF, DOCX
    SPREADSHEET = "spreadsheet"  # CSV, XLSX
    CODE = "code"
    ARCHIVE = "archive"
    UNKNOWN = "unknown"


@dataclass
class DetectedFile:
    filename: str
    mime_type: str
    category: FileCategory
    size_bytes: int
    content: bytes


MIME_CATEGORY_MAP = {
    "application/pdf": FileCategory.DOCUMENT,
    "application/vnd.openxmlformats-officedocument"
    ".wordprocessingml.document": FileCategory.DOCUMENT,
    "text/csv": FileCategory.SPREADSHEET,
    "application/vnd.openxmlformats-officedocument"
    ".spreadsheetml.sheet": FileCategory.SPREADSHEET,
}

CODE_EXTENSIONS = {
    ".py", ".js", ".ts", ".java", ".go", ".rs",
    ".rb", ".cpp", ".c", ".h", ".sql", ".sh",
}


def detect_file(filename: str, content: bytes) -> DetectedFile:
    """Detect the type and category of an uploaded file."""
    mime = magic.from_buffer(content, mime=True)
    ext = Path(filename).suffix.lower()

    # Check extension-based overrides
    if ext in CODE_EXTENSIONS:
        category = FileCategory.CODE
    elif mime in MIME_CATEGORY_MAP:
        category = MIME_CATEGORY_MAP[mime]
    elif mime.startswith("image/"):
        category = FileCategory.IMAGE
    elif mime.startswith("audio/"):
        category = FileCategory.AUDIO
    elif mime.startswith("video/"):
        category = FileCategory.VIDEO
    elif mime.startswith("text/"):
        category = FileCategory.TEXT
    else:
        category = FileCategory.UNKNOWN

    return DetectedFile(
        filename=filename,
        mime_type=mime,
        category=category,
        size_bytes=len(content),
        content=content,
    )

Category-Specific Processors

Each file category has a dedicated processor that extracts content into a text representation the LLM can reason over:

import csv
import io


async def process_text_file(file: DetectedFile) -> str:
    text = file.content.decode("utf-8", errors="replace")
    if len(text) > 50000:
        text = text[:50000] + "\n... [truncated]"
    return f"Contents of {file.filename}:\n{text}"


async def process_code_file(file: DetectedFile) -> str:
    code = file.content.decode("utf-8", errors="replace")
    ext = Path(file.filename).suffix.lstrip(".")
    return (
        f"Code file: {file.filename}\n"
        f"Language: {ext}\n"
        f"Lines: {code.count(chr(10)) + 1}\n"
        f"~~~{ext}\n{code}\n~~~"
    )


async def process_csv_file(file: DetectedFile) -> str:
    text = file.content.decode("utf-8", errors="replace")
    reader = csv.reader(io.StringIO(text))
    rows = list(reader)

    if not rows:
        return f"{file.filename}: empty CSV"

    header = rows[0]
    preview_rows = rows[1:11]  # First 10 data rows

    lines = [
        f"CSV file: {file.filename}",
        f"Columns: {', '.join(header)}",
        f"Total rows: {len(rows) - 1}",
        "",
        "Preview (first 10 rows):",
        "| " + " | ".join(header) + " |",
        "| " + " | ".join("---" for _ in header) + " |",
    ]
    for row in preview_rows:
        lines.append("| " + " | ".join(row) + " |")

    return "\n".join(lines)


PROCESSORS = {
    FileCategory.TEXT: process_text_file,
    FileCategory.CODE: process_code_file,
    FileCategory.SPREADSHEET: process_csv_file,
}

The Unified Processing Pipeline

Bring file detection, processing, and LLM reasoning together:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

import openai


class MultiInputAgent:
    def __init__(self):
        self.client = openai.AsyncOpenAI()

    async def _process_file(self, file: DetectedFile) -> str:
        processor = PROCESSORS.get(file.category)
        if processor:
            return await processor(file)

        # Fallback: describe what we received
        return (
            f"File: {file.filename} "
            f"({file.category.value}, {file.size_bytes} bytes)"
        )

    async def chat(
        self,
        user_message: str,
        files: list[tuple[str, bytes]] | None = None,
    ) -> str:
        """Process user text and optional file uploads."""
        # Detect and process all files
        file_contexts = []
        image_parts = []

        for filename, content in (files or []):
            detected = detect_file(filename, content)

            if detected.category == FileCategory.IMAGE:
                import base64
                b64 = base64.b64encode(content).decode()
                image_parts.append({
                    "type": "image_url",
                    "image_url": {
                        "url": (
                            f"data:{detected.mime_type};"
                            f"base64,{b64}"
                        )
                    },
                })
            else:
                processed = await self._process_file(detected)
                file_contexts.append(processed)

        # Build the prompt
        parts = []
        if file_contexts:
            parts.append(
                "Uploaded file contents:\n\n"
                + "\n\n---\n\n".join(file_contexts)
            )
        parts.append(f"User message: {user_message}")

        content = [{"type": "text", "text": "\n\n".join(parts)}]
        content.extend(image_parts)

        response = await self.client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": (
                        "You are a helpful assistant that analyzes "
                        "user messages along with any uploaded files. "
                        "Reference specific file contents in your "
                        "response."
                    ),
                },
                {"role": "user", "content": content},
            ],
        )
        return response.choices[0].message.content

FastAPI Endpoint

Expose the agent through a web API that accepts multipart form data:

from fastapi import FastAPI, UploadFile, File, Form
from typing import Annotated

app = FastAPI()
agent = MultiInputAgent()


@app.post("/chat")
async def chat_endpoint(
    message: Annotated[str, Form()],
    files: list[UploadFile] = File(default=[]),
):
    file_data = []
    for f in files:
        content = await f.read()
        file_data.append((f.filename, content))

    response = await agent.chat(message, file_data)
    return {"response": response}

FAQ

How do I handle very large files that exceed the LLM context window?

For large files, implement a summarization or chunking strategy. For text and code files, truncate to the first and last sections with a note about what was omitted. For CSVs, show the schema plus a statistical summary (column types, min, max, mean) instead of raw rows. For PDFs, extract only the pages most relevant to the user's question using keyword matching against the query.

What security considerations are important for file upload agents?

Never execute uploaded files or evaluate their contents as code. Validate file sizes (reject uploads over a reasonable limit like 50MB). Scan for malware if the system is exposed to the public. Sanitize filenames to prevent path traversal attacks. Process files in isolated temporary directories and clean them up after processing. Never store raw uploads permanently unless explicitly required.

Can this agent maintain context across multiple messages with different file uploads?

Yes. Add a conversation history that stores both messages and processed file contexts. On each new message, include the relevant prior context in the prompt. For efficiency, store processed file summaries rather than raw file contents in the history, and allow the user to reference previously uploaded files by name without re-uploading them.


#MultiInputAgent #FileProcessing #FormatDetection #FastAPI #Python #AgenticAI #LearnAI #AIEngineering

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.