Building AI Report Generation for SaaS: Natural Language to Analytics

Why Natural Language Reports Matter

Most SaaS products have a reporting section that requires users to select filters, choose chart types, and configure date ranges manually. Power users love it. Everyone else avoids it. When a VP asks "How did our conversion rate change after we launched the new pricing page?", they want to type that question and get an answer — not spend 15 minutes configuring a funnel report.

AI report generation bridges this gap by translating natural language into database queries, visualizations, and exportable documents.

Safe Data Access Layer

The AI must query your database without risking SQL injection or unauthorized data access. Build a restricted query layer that only allows SELECT statements on approved tables.

import re
import sqlalchemy
from sqlalchemy import text
from dataclasses import dataclass

@dataclass
class TableSchema:
    name: str
    columns: list[dict]  # {"name": str, "type": str, "description": str}
    description: str

ALLOWED_TABLES: dict[str, TableSchema] = {
    "deals": TableSchema(
        name="deals",
        columns=[
            {"name": "id", "type": "UUID", "description": "Deal ID"},
            {"name": "name", "type": "VARCHAR", "description": "Deal name"},
            {"name": "value", "type": "DECIMAL", "description": "Deal value in USD"},
            {"name": "stage", "type": "VARCHAR", "description": "Pipeline stage"},
            {"name": "created_at", "type": "TIMESTAMP", "description": "Creation date"},
            {"name": "closed_at", "type": "TIMESTAMP", "description": "Close date"},
            {"name": "tenant_id", "type": "UUID", "description": "Tenant ID"},
        ],
        description="Sales deals and opportunities",
    ),
    "contacts": TableSchema(
        name="contacts",
        columns=[
            {"name": "id", "type": "UUID", "description": "Contact ID"},
            {"name": "name", "type": "VARCHAR", "description": "Full name"},
            {"name": "email", "type": "VARCHAR", "description": "Email address"},
            {"name": "company", "type": "VARCHAR", "description": "Company name"},
            {"name": "created_at", "type": "TIMESTAMP", "description": "Creation date"},
            {"name": "tenant_id", "type": "UUID", "description": "Tenant ID"},
        ],
        description="Contact records",
    ),
}


def validate_query(sql: str, tenant_id: str) -> str:
    """Validate and sandbox the generated SQL."""
    sql_upper = sql.strip().upper()

    # Only allow SELECT statements
    if not sql_upper.startswith("SELECT"):
        raise ValueError("Only SELECT queries are allowed.")

    # Block dangerous keywords
    forbidden = ["INSERT", "UPDATE", "DELETE", "DROP", "ALTER", "TRUNCATE",
                  "CREATE", "GRANT", "REVOKE", "EXEC"]
    for keyword in forbidden:
        if re.search(rf'\b{keyword}\b', sql_upper):
            raise ValueError(f"Forbidden keyword: {keyword}")

    # Ensure tenant_id filter is present
    if "tenant_id" not in sql.lower():
        raise ValueError("Query must include tenant_id filter.")

    return sql

Text-to-SQL with Schema Context

Feed the LLM your table schemas so it generates accurate queries. Always include column descriptions — they are more valuable than column names for query accuracy.

async def generate_report_query(question: str, tenant_id: str,
                                 llm_client) -> dict:
    schema_description = ""
    for table in ALLOWED_TABLES.values():
        cols = ", ".join(
            [f"{c['name']} ({c['type']}: {c['description']})"
             for c in table.columns]
        )
        schema_description += f"\nTable: {table.name} - {table.description}\n"
        schema_description += f"  Columns: {cols}\n"

    prompt = f"""You are a SQL query generator for a SaaS analytics system.
Generate a PostgreSQL query to answer the user's question.

RULES:
- Only use tables and columns from the schema below
- ALWAYS filter by tenant_id = '{tenant_id}'
- Use aggregate functions (COUNT, SUM, AVG) for summary questions
- Include ORDER BY and LIMIT where appropriate
- Return JSON with: "sql", "chart_type" (bar, line, pie, table, number),
  "title", "x_axis", "y_axis"

SCHEMA:
{schema_description}

User question: {question}"""

    response = await llm_client.chat(
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
    )
    return response

Executing Queries and Building Charts

Run the validated query and transform results into chart-ready data structures.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

from enum import Enum
from pydantic import BaseModel

class ChartType(str, Enum):
    BAR = "bar"
    LINE = "line"
    PIE = "pie"
    TABLE = "table"
    NUMBER = "number"

class ReportResult(BaseModel):
    title: str
    chart_type: ChartType
    data: list[dict]
    x_axis: str | None = None
    y_axis: str | None = None
    summary: str

async def execute_report(query_plan: dict, tenant_id: str,
                          db_engine) -> ReportResult:
    sql = validate_query(query_plan["sql"], tenant_id)

    async with db_engine.connect() as conn:
        result = await conn.execute(text(sql))
        rows = [dict(row._mapping) for row in result.fetchall()]

    # For single-number results
    if query_plan.get("chart_type") == "number" and len(rows) == 1:
        value = list(rows[0].values())[0]
        return ReportResult(
            title=query_plan["title"],
            chart_type=ChartType.NUMBER,
            data=[{"value": value}],
            summary=f"{query_plan['title']}: {value}",
        )

    # Serialize datetime and decimal values
    import json
    from datetime import datetime
    from decimal import Decimal

    def serialize(obj):
        if isinstance(obj, datetime):
            return obj.isoformat()
        if isinstance(obj, Decimal):
            return float(obj)
        return obj

    serialized_rows = [
        {k: serialize(v) for k, v in row.items()} for row in rows
    ]

    return ReportResult(
        title=query_plan["title"],
        chart_type=ChartType(query_plan.get("chart_type", "table")),
        data=serialized_rows,
        x_axis=query_plan.get("x_axis"),
        y_axis=query_plan.get("y_axis"),
        summary=f"Found {len(rows)} records for: {query_plan['title']}",
    )

Export to Multiple Formats

Users need reports in PDF, CSV, and email-ready formats.

import csv
import io

def export_csv(report: ReportResult) -> str:
    if not report.data:
        return ""
    output = io.StringIO()
    writer = csv.DictWriter(output, fieldnames=report.data[0].keys())
    writer.writeheader()
    writer.writerows(report.data)
    return output.getvalue()


def export_html_table(report: ReportResult) -> str:
    if not report.data:
        return "<p>No data available.</p>"
    headers = list(report.data[0].keys())
    html = f"<h2>{report.title}</h2><table border='1'><tr>"
    html += "".join(f"<th>{h}</th>" for h in headers)
    html += "</tr>"
    for row in report.data:
        html += "<tr>"
        html += "".join(f"<td>{row.get(h, '')}</td>" for h in headers)
        html += "</tr>"
    html += "</table>"
    return html

The Complete Report API

from fastapi import FastAPI, Depends
from pydantic import BaseModel

app = FastAPI()

class ReportRequest(BaseModel):
    question: str

@app.post("/api/reports/generate", response_model=ReportResult)
async def generate_report(
    req: ReportRequest,
    tenant_id: str = Depends(get_current_tenant),
    llm_client = Depends(get_llm_client),
    db_engine = Depends(get_db_engine),
):
    query_plan = await generate_report_query(
        question=req.question,
        tenant_id=tenant_id,
        llm_client=llm_client,
    )
    report = await execute_report(query_plan, tenant_id, db_engine)
    return report

FAQ

How do I prevent the AI from generating expensive queries?

Add a query cost estimator using EXPLAIN before execution. Set a maximum estimated cost threshold (e.g., 10,000 cost units) and reject queries that exceed it. Also enforce a hard row limit with LIMIT 10000 appended to every query, and set a statement timeout at the database level (e.g., 30 seconds).

What if the AI generates an incorrect query?

Show the generated SQL to the user alongside the results, with an "Edit Query" option. Log all generated queries with the original question for quality monitoring. Build a feedback loop where users can flag incorrect results, and use those examples to improve the system prompt with few-shot examples of correct query patterns.

How do I handle questions that span multiple tables?

Include JOIN relationships in your schema description. Specify which columns are foreign keys and how tables relate. The LLM handles multi-table queries well when the schema description includes lines like "deals.contact_id references contacts.id" — this gives it the explicit relationship it needs to write correct JOINs.

#AIReports #NaturalLanguageAnalytics #SaaS #TexttoSQL #Python #DataVisualization #AgenticAI #LearnAI #AIEngineering

Building AI Report Generation for SaaS: Natural Language to Analytics

Why Natural Language Reports Matter

Safe Data Access Layer

Text-to-SQL with Schema Context

Executing Queries and Building Charts

Export to Multiple Formats

The Complete Report API

FAQ

How do I prevent the AI from generating expensive queries?

What if the AI generates an incorrect query?

How do I handle questions that span multiple tables?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding