Building a File Organization Agent: AI-Powered Document Categorization and Filing
Build an AI agent that scans directories, analyzes file content, categorizes documents by type and topic, and organizes them into a structured folder hierarchy with consistent naming conventions.
The Cost of Digital Disorganization
A typical shared drive accumulates thousands of files with names like "Final_v2_REVISED.docx" and "report copy (3).pdf." Finding the right document means searching through nested folders with inconsistent naming, duplicate files scattered across directories, and no clear taxonomy. An AI file organization agent solves this by analyzing file content, categorizing documents by type and topic, and filing them into a structured hierarchy.
This guide builds a complete file organization agent that scans directories, extracts content from multiple file types, uses an LLM for intelligent categorization, and reorganizes files with consistent naming.
Scanning and Extracting File Content
The agent needs to read content from various file types. We create extractors for the most common formats:
flowchart TD
START["Building a File Organization Agent: AI-Powered Do…"] --> A
A["The Cost of Digital Disorganization"]
A --> B
B["Scanning and Extracting File Content"]
B --> C
C["AI-Powered Categorization"]
C --> D
D["Building the Folder Structure"]
D --> E
E["Executing the Organization Plan"]
E --> F
F["FAQ"]
F --> DONE["Key Takeaways"]
style START fill:#4f46e5,stroke:#4338ca,color:#fff
style DONE fill:#059669,stroke:#047857,color:#fff
from pathlib import Path
from dataclasses import dataclass
import mimetypes
@dataclass
class FileInfo:
path: Path
name: str
extension: str
size_bytes: int
content_preview: str
mime_type: str
def extract_text_content(filepath: Path, max_chars: int = 2000) -> str:
"""Extract text content from common file types."""
ext = filepath.suffix.lower()
if ext in (".txt", ".md", ".csv", ".log", ".json", ".yaml", ".yml"):
return filepath.read_text(errors="replace")[:max_chars]
if ext == ".pdf":
import pymupdf
doc = pymupdf.open(str(filepath))
text = ""
for page in doc:
text += page.get_text()
if len(text) > max_chars:
break
doc.close()
return text[:max_chars]
if ext in (".docx",):
from docx import Document
doc = Document(str(filepath))
text = "\n".join(p.text for p in doc.paragraphs)
return text[:max_chars]
if ext in (".xlsx", ".xls"):
import openpyxl
wb = openpyxl.load_workbook(str(filepath), read_only=True)
text = ""
for sheet in wb.sheetnames[:3]:
ws = wb[sheet]
for row in ws.iter_rows(max_row=20, values_only=True):
text += " ".join(str(c) for c in row if c) + "\n"
return text[:max_chars]
return ""
def scan_directory(directory: str, recursive: bool = True) -> list[FileInfo]:
"""Scan a directory and extract file information."""
root = Path(directory)
pattern = "**/*" if recursive else "*"
files = []
for filepath in root.glob(pattern):
if filepath.is_file() and not filepath.name.startswith("."):
content = extract_text_content(filepath)
mime, _ = mimetypes.guess_type(str(filepath))
files.append(FileInfo(
path=filepath,
name=filepath.name,
extension=filepath.suffix.lower(),
size_bytes=filepath.stat().st_size,
content_preview=content,
mime_type=mime or "application/octet-stream",
))
return files
AI-Powered Categorization
The agent sends file metadata and content previews to an LLM for intelligent categorization. The model determines the document type, topic, and an appropriate filename:
from openai import OpenAI
import json
client = OpenAI()
CATEGORIES = {
"contracts": "Legal agreements, NDAs, service contracts, amendments",
"proposals": "Business proposals, RFPs, pitch decks",
"invoices": "Invoices, receipts, purchase orders, billing statements",
"reports": "Analytics reports, status updates, research findings",
"correspondence": "Emails, letters, memos, meeting notes",
"technical": "Architecture docs, API specs, runbooks, code reviews",
"marketing": "Campaign materials, brand assets, social media content",
"hr": "Employee records, policies, offer letters, reviews",
"misc": "Files that do not fit other categories",
}
def categorize_file(file_info: FileInfo) -> dict:
"""Use LLM to categorize a file based on its content and metadata."""
category_desc = "\n".join(f"- {k}: {v}" for k, v in CATEGORIES.items())
response = client.chat.completions.create(
model="gpt-4o-mini",
temperature=0,
response_format={"type": "json_object"},
messages=[
{
"role": "system",
"content": (
"You categorize files. Return JSON with:\n"
"- category: one of the categories below\n"
"- subcategory: a specific subcategory (e.g., 'nda' under contracts)\n"
"- suggested_name: a clean descriptive filename (lowercase, hyphens, no spaces)\n"
"- confidence: float 0-1\n"
"- summary: one sentence describing the file\n\n"
f"Categories:\n{category_desc}"
),
},
{
"role": "user",
"content": (
f"Filename: {file_info.name}\n"
f"Type: {file_info.mime_type}\n"
f"Size: {file_info.size_bytes} bytes\n\n"
f"Content preview:\n{file_info.content_preview[:1500]}"
),
},
],
)
return json.loads(response.choices[0].message.content)
Building the Folder Structure
The agent creates a structured folder hierarchy based on categories and subcategories:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
from datetime import datetime
def build_target_path(
base_dir: str,
category: str,
subcategory: str,
suggested_name: str,
original_ext: str,
year: int | None = None,
) -> Path:
"""Build a target path following the folder structure convention."""
if year is None:
year = datetime.now().year
target_dir = Path(base_dir) / category / subcategory / str(year)
target_dir.mkdir(parents=True, exist_ok=True)
filename = f"{suggested_name}{original_ext}"
target = target_dir / filename
# Handle name collisions
counter = 1
while target.exists():
target = target_dir / f"{suggested_name}-{counter}{original_ext}"
counter += 1
return target
Executing the Organization Plan
Before moving files, the agent generates a plan for human review. This prevents destructive mistakes:
import shutil
import logging
logger = logging.getLogger("file_agent")
@dataclass
class FilePlan:
source: Path
destination: Path
category: str
confidence: float
summary: str
def create_organization_plan(
source_dir: str, target_dir: str
) -> list[FilePlan]:
"""Scan files and create an organization plan without moving anything."""
files = scan_directory(source_dir)
plan = []
for file_info in files:
result = categorize_file(file_info)
dest = build_target_path(
target_dir,
result["category"],
result.get("subcategory", "general"),
result["suggested_name"],
file_info.extension,
)
plan.append(FilePlan(
source=file_info.path,
destination=dest,
category=result["category"],
confidence=result["confidence"],
summary=result["summary"],
))
return plan
def execute_plan(plan: list[FilePlan], min_confidence: float = 0.7):
"""Execute the organization plan, moving files above the confidence threshold."""
for item in plan:
if item.confidence < min_confidence:
logger.warning(f"Skipping (low confidence {item.confidence}): {item.source}")
continue
item.destination.parent.mkdir(parents=True, exist_ok=True)
shutil.move(str(item.source), str(item.destination))
logger.info(f"Moved: {item.source.name} -> {item.destination}")
The confidence threshold ensures that files the AI is unsure about remain untouched for manual review. Start with a high threshold like 0.85 and lower it as you validate accuracy.
FAQ
How do I handle duplicate files during organization?
Compute a SHA-256 hash of each file's content before moving. Maintain a hash-to-path mapping and flag duplicates. Let the user choose which copy to keep. For near-duplicates like different versions of the same document, compare filenames and modification dates to identify the most recent version.
What about files the AI cannot read, like images or videos?
For images, use an LLM with vision capabilities to describe the content. For videos, extract metadata like duration and codec using ffprobe. Fall back to filename analysis and file extension when content extraction is impossible. These files typically end up in a media category with subcategories based on metadata.
How do I undo a batch organization if something goes wrong?
Log every move operation with source and destination paths in a JSON manifest file. To undo, read the manifest and reverse each move. This is why the plan-then-execute pattern is critical — the plan itself serves as an undo log.
#FileOrganization #AIAgents #DocumentClassification #WorkflowAutomation #Python #Automation #AgenticAI #LearnAI #AIEngineering
Written by
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.