Building a File Organization Agent: AI-Powered Document Categorization and Filing

The Cost of Digital Disorganization

A typical shared drive accumulates thousands of files with names like "Final_v2_REVISED.docx" and "report copy (3).pdf." Finding the right document means searching through nested folders with inconsistent naming, duplicate files scattered across directories, and no clear taxonomy. An AI file organization agent solves this by analyzing file content, categorizing documents by type and topic, and filing them into a structured hierarchy.

This guide builds a complete file organization agent that scans directories, extracts content from multiple file types, uses an LLM for intelligent categorization, and reorganizes files with consistent naming.

Scanning and Extracting File Content

The agent needs to read content from various file types. We create extractors for the most common formats:

flowchart TD
    START["Building a File Organization Agent: AI-Powered Do…"] --> A
    A["The Cost of Digital Disorganization"]
    A --> B
    B["Scanning and Extracting File Content"]
    B --> C
    C["AI-Powered Categorization"]
    C --> D
    D["Building the Folder Structure"]
    D --> E
    E["Executing the Organization Plan"]
    E --> F
    F["FAQ"]
    F --> DONE["Key Takeaways"]
    style START fill:#4f46e5,stroke:#4338ca,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff

from pathlib import Path
from dataclasses import dataclass
import mimetypes

@dataclass
class FileInfo:
    path: Path
    name: str
    extension: str
    size_bytes: int
    content_preview: str
    mime_type: str

def extract_text_content(filepath: Path, max_chars: int = 2000) -> str:
    """Extract text content from common file types."""
    ext = filepath.suffix.lower()

    if ext in (".txt", ".md", ".csv", ".log", ".json", ".yaml", ".yml"):
        return filepath.read_text(errors="replace")[:max_chars]

    if ext == ".pdf":
        import pymupdf
        doc = pymupdf.open(str(filepath))
        text = ""
        for page in doc:
            text += page.get_text()
            if len(text) > max_chars:
                break
        doc.close()
        return text[:max_chars]

    if ext in (".docx",):
        from docx import Document
        doc = Document(str(filepath))
        text = "\n".join(p.text for p in doc.paragraphs)
        return text[:max_chars]

    if ext in (".xlsx", ".xls"):
        import openpyxl
        wb = openpyxl.load_workbook(str(filepath), read_only=True)
        text = ""
        for sheet in wb.sheetnames[:3]:
            ws = wb[sheet]
            for row in ws.iter_rows(max_row=20, values_only=True):
                text += " ".join(str(c) for c in row if c) + "\n"
        return text[:max_chars]

    return ""

def scan_directory(directory: str, recursive: bool = True) -> list[FileInfo]:
    """Scan a directory and extract file information."""
    root = Path(directory)
    pattern = "**/*" if recursive else "*"
    files = []

    for filepath in root.glob(pattern):
        if filepath.is_file() and not filepath.name.startswith("."):
            content = extract_text_content(filepath)
            mime, _ = mimetypes.guess_type(str(filepath))
            files.append(FileInfo(
                path=filepath,
                name=filepath.name,
                extension=filepath.suffix.lower(),
                size_bytes=filepath.stat().st_size,
                content_preview=content,
                mime_type=mime or "application/octet-stream",
            ))

    return files

AI-Powered Categorization

The agent sends file metadata and content previews to an LLM for intelligent categorization. The model determines the document type, topic, and an appropriate filename:

from openai import OpenAI
import json

client = OpenAI()

CATEGORIES = {
    "contracts": "Legal agreements, NDAs, service contracts, amendments",
    "proposals": "Business proposals, RFPs, pitch decks",
    "invoices": "Invoices, receipts, purchase orders, billing statements",
    "reports": "Analytics reports, status updates, research findings",
    "correspondence": "Emails, letters, memos, meeting notes",
    "technical": "Architecture docs, API specs, runbooks, code reviews",
    "marketing": "Campaign materials, brand assets, social media content",
    "hr": "Employee records, policies, offer letters, reviews",
    "misc": "Files that do not fit other categories",
}

def categorize_file(file_info: FileInfo) -> dict:
    """Use LLM to categorize a file based on its content and metadata."""
    category_desc = "\n".join(f"- {k}: {v}" for k, v in CATEGORIES.items())

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0,
        response_format={"type": "json_object"},
        messages=[
            {
                "role": "system",
                "content": (
                    "You categorize files. Return JSON with:\n"
                    "- category: one of the categories below\n"
                    "- subcategory: a specific subcategory (e.g., 'nda' under contracts)\n"
                    "- suggested_name: a clean descriptive filename (lowercase, hyphens, no spaces)\n"
                    "- confidence: float 0-1\n"
                    "- summary: one sentence describing the file\n\n"
                    f"Categories:\n{category_desc}"
                ),
            },
            {
                "role": "user",
                "content": (
                    f"Filename: {file_info.name}\n"
                    f"Type: {file_info.mime_type}\n"
                    f"Size: {file_info.size_bytes} bytes\n\n"
                    f"Content preview:\n{file_info.content_preview[:1500]}"
                ),
            },
        ],
    )
    return json.loads(response.choices[0].message.content)

Building the Folder Structure

The agent creates a structured folder hierarchy based on categories and subcategories:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

from datetime import datetime

def build_target_path(
    base_dir: str,
    category: str,
    subcategory: str,
    suggested_name: str,
    original_ext: str,
    year: int | None = None,
) -> Path:
    """Build a target path following the folder structure convention."""
    if year is None:
        year = datetime.now().year

    target_dir = Path(base_dir) / category / subcategory / str(year)
    target_dir.mkdir(parents=True, exist_ok=True)

    filename = f"{suggested_name}{original_ext}"
    target = target_dir / filename

    # Handle name collisions
    counter = 1
    while target.exists():
        target = target_dir / f"{suggested_name}-{counter}{original_ext}"
        counter += 1

    return target

Executing the Organization Plan

Before moving files, the agent generates a plan for human review. This prevents destructive mistakes:

import shutil
import logging

logger = logging.getLogger("file_agent")

@dataclass
class FilePlan:
    source: Path
    destination: Path
    category: str
    confidence: float
    summary: str

def create_organization_plan(
    source_dir: str, target_dir: str
) -> list[FilePlan]:
    """Scan files and create an organization plan without moving anything."""
    files = scan_directory(source_dir)
    plan = []

    for file_info in files:
        result = categorize_file(file_info)
        dest = build_target_path(
            target_dir,
            result["category"],
            result.get("subcategory", "general"),
            result["suggested_name"],
            file_info.extension,
        )
        plan.append(FilePlan(
            source=file_info.path,
            destination=dest,
            category=result["category"],
            confidence=result["confidence"],
            summary=result["summary"],
        ))

    return plan

def execute_plan(plan: list[FilePlan], min_confidence: float = 0.7):
    """Execute the organization plan, moving files above the confidence threshold."""
    for item in plan:
        if item.confidence < min_confidence:
            logger.warning(f"Skipping (low confidence {item.confidence}): {item.source}")
            continue
        item.destination.parent.mkdir(parents=True, exist_ok=True)
        shutil.move(str(item.source), str(item.destination))
        logger.info(f"Moved: {item.source.name} -> {item.destination}")

The confidence threshold ensures that files the AI is unsure about remain untouched for manual review. Start with a high threshold like 0.85 and lower it as you validate accuracy.

FAQ

How do I handle duplicate files during organization?

Compute a SHA-256 hash of each file's content before moving. Maintain a hash-to-path mapping and flag duplicates. Let the user choose which copy to keep. For near-duplicates like different versions of the same document, compare filenames and modification dates to identify the most recent version.

What about files the AI cannot read, like images or videos?

For images, use an LLM with vision capabilities to describe the content. For videos, extract metadata like duration and codec using ffprobe. Fall back to filename analysis and file extension when content extraction is impossible. These files typically end up in a media category with subcategories based on metadata.

How do I undo a batch organization if something goes wrong?

Log every move operation with source and destination paths in a JSON manifest file. To undo, read the manifest and reverse each move. This is why the plan-then-execute pattern is critical — the plan itself serves as an undo log.

#FileOrganization #AIAgents #DocumentClassification #WorkflowAutomation #Python #Automation #AgenticAI #LearnAI #AIEngineering

Building a File Organization Agent: AI-Powered Document Categorization and Filing

The Cost of Digital Disorganization

Scanning and Extracting File Content

AI-Powered Categorization

Building the Folder Structure

Executing the Organization Plan

FAQ

How do I handle duplicate files during organization?

What about files the AI cannot read, like images or videos?

How do I undo a batch organization if something goes wrong?

Try CallSphere AI Voice Agents

Related Articles

Building an AI Agent with Tool-Use Chains: Sequential Tool Orchestration for Complex Tasks

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Building a Hypothesis-Testing Agent: Scientific Method Applied to Data Analysis