Building a CSV Analysis Agent: Upload Data, Ask Questions, Get Answers
Create an AI agent that accepts CSV file uploads, uses pandas for statistical analysis, answers natural language questions about the data, and generates clear reports with charts — all without the user writing any code.
The Problem with Manual CSV Analysis
Every business analyst has been there: someone sends a CSV file and asks "What trends do you see?" What follows is 30 minutes of opening the file in a spreadsheet, eyeballing columns, writing formulas, creating pivot tables, and building charts. A CSV analysis agent automates this entire workflow. The user uploads a file, asks a question in plain English, and gets a statistical answer with supporting visualizations.
The key insight is that pandas is already the perfect execution engine for tabular data analysis. The agent's job is to translate natural language into pandas operations, run them safely, and present the results clearly.
Loading and Profiling the CSV
The first tool loads a CSV file and returns a structural profile — column names, types, row count, and sample values:
import pandas as pd
from agents import Agent, Runner, function_tool
_dataframes: dict[str, pd.DataFrame] = {}
@function_tool
def load_csv(file_path: str, name: str = "default") -> str:
"""Load a CSV file into memory and return a profile of its structure."""
try:
df = pd.read_csv(file_path)
except Exception as e:
return f"Error loading CSV: {e}"
_dataframes[name] = df
profile_parts = [
f"Loaded: {len(df)} rows, {len(df.columns)} columns",
f"Columns: {', '.join(df.columns.tolist())}",
f"Data types:\n{df.dtypes.to_string()}",
f"\nFirst 3 rows:\n{df.head(3).to_string()}",
f"\nNull counts:\n{df.isnull().sum().to_string()}",
]
return "\n".join(profile_parts)
Statistical Analysis Tool
This tool runs descriptive statistics, correlations, group-by aggregations, and other pandas operations specified by the agent:
@function_tool
def analyze_data(name: str, operation: str) -> str:
"""Run a pandas operation on a loaded dataset.
Operations: describe, corr, groupby, value_counts, query.
For groupby: use format 'groupby:column:agg_col:agg_func'
For query: use format 'query:pandas_query_string'"""
if name not in _dataframes:
return f"Error: No dataset loaded with name '{name}'."
df = _dataframes[name]
try:
if operation == "describe":
return df.describe(include="all").to_string()
elif operation == "corr":
numeric_df = df.select_dtypes(include="number")
return numeric_df.corr().to_string()
elif operation.startswith("groupby:"):
parts = operation.split(":")
group_col, agg_col, agg_func = parts[1], parts[2], parts[3]
result = df.groupby(group_col)[agg_col].agg(agg_func)
return result.to_string()
elif operation.startswith("value_counts:"):
col = operation.split(":")[1]
return df[col].value_counts().head(20).to_string()
elif operation.startswith("query:"):
query_str = operation[6:]
filtered = df.query(query_str)
return f"Rows matching: {len(filtered)}\n{filtered.head(10).to_string()}"
else:
return "Unknown operation."
except Exception as e:
return f"Analysis error: {e}"
Safe Code Execution for Custom Analysis
Sometimes the agent needs to run custom pandas code that does not fit predefined operations. A sandboxed execution tool handles this:
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
@function_tool
def run_pandas_code(name: str, code: str) -> str:
"""Execute custom pandas code on a loaded dataset.
The dataframe is available as 'df'. Return value of last expression is captured."""
if name not in _dataframes:
return f"Error: No dataset '{name}' loaded."
BLOCKED_KEYWORDS = ["import os", "subprocess", "eval(", "exec(", "__", "open("]
for kw in BLOCKED_KEYWORDS:
if kw in code:
return f"Blocked: '{kw}' is not allowed for security reasons."
df = _dataframes[name].copy()
local_vars = {"df": df, "pd": pd}
try:
exec(code, {"__builtins__": {}}, local_vars)
if "result" in local_vars:
r = local_vars["result"]
return r.to_string() if hasattr(r, "to_string") else str(r)
return "Code executed. Set 'result = ...' to return output."
except Exception as e:
return f"Execution error: {e}"
Assembling the CSV Agent
csv_agent = Agent(
name="CSV Analyst",
instructions="""You are a CSV data analysis agent. Your workflow:
1. When the user provides a file, call load_csv to inspect its structure.
2. For questions about distributions or summaries, use analyze_data with 'describe'.
3. For relationship questions, use analyze_data with 'corr'.
4. For group comparisons, use analyze_data with 'groupby:col:agg_col:func'.
5. For complex analysis, use run_pandas_code with safe pandas operations.
6. Always explain results in plain language with specific numbers.
7. Suggest follow-up questions the user might want to ask.""",
tools=[load_csv, analyze_data, run_pandas_code],
)
Example Interaction
result = Runner.run_sync(
csv_agent,
"Load the file sales_2025.csv and tell me which region had the "
"highest average order value, and whether order value correlates with "
"customer tenure.",
)
print(result.final_output)
The agent loads the CSV, runs describe to understand the data, performs a groupby:region:order_value:mean to rank regions, computes correlations between order_value and customer_tenure_months, and responds with something like: "The West region had the highest average order value at $187.40. There is a moderate positive correlation (r = 0.43) between customer tenure and order value."
Handling Messy Data
Real-world CSVs are messy. Add a cleaning tool that the agent can invoke before analysis:
@function_tool
def clean_data(name: str, operations: str) -> str:
"""Clean a loaded dataset. Operations (comma-separated):
drop_nulls, fill_nulls_zero, strip_whitespace, lowercase_columns"""
if name not in _dataframes:
return f"No dataset '{name}'."
df = _dataframes[name]
ops = [o.strip() for o in operations.split(",")]
for op in ops:
if op == "drop_nulls":
df = df.dropna()
elif op == "fill_nulls_zero":
df = df.fillna(0)
elif op == "strip_whitespace":
df = df.apply(lambda c: c.str.strip() if c.dtype == "object" else c)
elif op == "lowercase_columns":
df.columns = [c.lower().replace(" ", "_") for c in df.columns]
_dataframes[name] = df
return f"Cleaned. Shape: {df.shape[0]} rows, {df.shape[1]} columns."
FAQ
Is it safe to let an LLM execute pandas code?
The sandboxed executor shown above blocks dangerous operations like file access and subprocess calls. For production systems, run the code in a Docker container or use a dedicated code sandbox like E2B to provide true isolation.
What is the maximum file size this agent can handle?
Pandas loads the entire CSV into memory, so the limit depends on your server's RAM. For files under 500 MB, standard pandas works well. For larger files, swap in polars or dask as the execution engine — the agent interface stays the same.
Can the agent handle Excel files or databases too?
Yes. Replace pd.read_csv with pd.read_excel for spreadsheets or pd.read_sql for databases. The rest of the analysis pipeline remains identical since pandas normalizes everything into DataFrames.
#CSV #Pandas #DataAnalysis #Statistics #AIAgents #AgenticAI #LearnAI #AIEngineering
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.