Building a Text-to-SQL Agent with GPT-4: Schema-Aware Query Generation

Why Build an Agent, Not Just a Prompt?

A simple text-to-SQL prompt generates a query and stops. An agent goes further: it extracts the schema automatically, generates the query, executes it, handles errors, retries with corrections, and formats the final answer. This agentic loop is what makes the difference between a demo and a production tool.

In this tutorial, you will build a complete text-to-SQL agent using GPT-4 that works against any PostgreSQL database.

Step 1: Schema Extraction

The first component extracts a structured schema representation from your database. This gives the LLM the context it needs to write accurate queries.

import psycopg2
from dataclasses import dataclass

@dataclass
class ColumnInfo:
    name: str
    data_type: str
    is_nullable: bool
    is_primary_key: bool

@dataclass
class TableInfo:
    name: str
    columns: list[ColumnInfo]
    foreign_keys: list[str]

def extract_schema(conn_string: str) -> list[TableInfo]:
    """Extract full schema metadata from a PostgreSQL database."""
    conn = psycopg2.connect(conn_string)
    cur = conn.cursor()

    # Get all user tables
    cur.execute("""
        SELECT table_name FROM information_schema.tables
        WHERE table_schema = 'public' AND table_type = 'BASE TABLE'
    """)
    tables = []

    for (table_name,) in cur.fetchall():
        # Get columns
        cur.execute("""
            SELECT column_name, data_type, is_nullable
            FROM information_schema.columns
            WHERE table_schema = 'public' AND table_name = %s
            ORDER BY ordinal_position
        """, (table_name,))
        columns = [
            ColumnInfo(name=row[0], data_type=row[1],
                       is_nullable=row[2] == "YES", is_primary_key=False)
            for row in cur.fetchall()
        ]

        # Get primary keys
        cur.execute("""
            SELECT kcu.column_name
            FROM information_schema.table_constraints tc
            JOIN information_schema.key_column_usage kcu
              ON tc.constraint_name = kcu.constraint_name
            WHERE tc.table_name = %s AND tc.constraint_type = 'PRIMARY KEY'
        """, (table_name,))
        pk_columns = {row[0] for row in cur.fetchall()}
        for col in columns:
            col.is_primary_key = col.name in pk_columns

        # Get foreign keys
        cur.execute("""
            SELECT kcu.column_name, ccu.table_name, ccu.column_name
            FROM information_schema.table_constraints tc
            JOIN information_schema.key_column_usage kcu
              ON tc.constraint_name = kcu.constraint_name
            JOIN information_schema.constraint_column_usage ccu
              ON tc.constraint_name = ccu.constraint_name
            WHERE tc.table_name = %s AND tc.constraint_type = 'FOREIGN KEY'
        """, (table_name,))
        fks = [f"{r[0]} -> {r[1]}.{r[2]}" for r in cur.fetchall()]

        tables.append(TableInfo(name=table_name, columns=columns, foreign_keys=fks))

    conn.close()
    return tables

Step 2: Schema Formatting for the Prompt

The schema needs to be formatted in a way the LLM can parse efficiently. CREATE TABLE syntax works best because LLMs have seen millions of examples during training.

def format_schema(tables: list[TableInfo]) -> str:
    """Format schema as CREATE TABLE statements."""
    output = []
    for table in tables:
        cols = []
        for c in table.columns:
            parts = [f"  {c.name} {c.data_type}"]
            if c.is_primary_key:
                parts.append("PRIMARY KEY")
            if not c.is_nullable:
                parts.append("NOT NULL")
            cols.append(" ".join(parts))
        create = f"CREATE TABLE {table.name} (\n"
        create += ",\n".join(cols)
        if table.foreign_keys:
            for fk in table.foreign_keys:
                col, ref = fk.split(" -> ")
                create += f",\n  FOREIGN KEY ({col}) REFERENCES {ref}"
        create += "\n);"
        output.append(create)
    return "\n\n".join(output)

Step 3: The Query Generation Agent

Now combine schema extraction with GPT-4 to create a full agent that generates, executes, and retries queries.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

import openai
import json

class TextToSQLAgent:
    def __init__(self, conn_string: str):
        self.conn_string = conn_string
        self.client = openai.OpenAI()
        self.schema = format_schema(extract_schema(conn_string))
        self.max_retries = 3

    def generate_sql(self, question: str, error_context: str = "") -> str:
        messages = [
            {
                "role": "system",
                "content": f"""You are a PostgreSQL expert. Convert the user's
question into a valid SQL query using this schema:

{self.schema}

Rules:
- Return ONLY the SQL query
- Use PostgreSQL syntax
- Always use table aliases for JOINs
- Limit results to 100 rows unless specified otherwise
{f"Previous error: {error_context}" if error_context else ""}""",
            },
            {"role": "user", "content": question},
        ]
        response = self.client.chat.completions.create(
            model="gpt-4o", messages=messages, temperature=0
        )
        sql = response.choices[0].message.content.strip()
        # Strip markdown code fences if present
        if sql.startswith("~~~"):
            sql = sql.split("\n", 1)[1].rsplit("~~~", 1)[0].strip()
        return sql

    def execute(self, sql: str) -> list[dict]:
        conn = psycopg2.connect(self.conn_string)
        cur = conn.cursor()
        cur.execute(sql)
        columns = [desc[0] for desc in cur.description]
        rows = [dict(zip(columns, row)) for row in cur.fetchall()]
        conn.close()
        return rows

    def ask(self, question: str) -> dict:
        error_context = ""
        for attempt in range(self.max_retries):
            sql = self.generate_sql(question, error_context)
            try:
                results = self.execute(sql)
                return {"sql": sql, "results": results, "attempts": attempt + 1}
            except Exception as e:
                error_context = f"Query: {sql}\nError: {str(e)}"
        return {"sql": sql, "error": error_context, "attempts": self.max_retries}

Step 4: Using the Agent

agent = TextToSQLAgent("postgresql://user:pass@localhost/sales_db")

answer = agent.ask("Which salesperson had the highest revenue last quarter?")
print(f"SQL: {answer['sql']}")
print(f"Results: {json.dumps(answer['results'], indent=2)}")
print(f"Resolved in {answer['attempts']} attempt(s)")

The retry mechanism is critical. In practice, roughly 10-15% of first-attempt queries contain minor errors that the LLM can self-correct when given the error message as context.

FAQ

Why use CREATE TABLE format instead of JSON for schema context?

LLMs have been trained on vastly more SQL DDL than structured JSON schema descriptions. Using CREATE TABLE statements consistently produces higher accuracy because the model can directly pattern-match against its training data. Benchmarks show 3-5% accuracy improvement with DDL format.

How do I handle very large schemas with hundreds of tables?

For large schemas, use a two-stage approach: first ask the LLM to identify which tables are relevant to the question, then provide only those tables in the generation prompt. This keeps the context window manageable and improves accuracy by reducing noise.

Should I use GPT-4 or GPT-4o for text-to-SQL?

GPT-4o is recommended for most use cases. It offers comparable SQL generation accuracy to GPT-4 at significantly lower cost and latency. For extremely complex queries involving multiple CTEs or window functions, GPT-4 may produce slightly better results, but the difference is usually within 2-3%.

#TextToSQL #GPT4 #SQLAgent #SchemaExtraction #OpenAI #AgenticAI #DatabaseAutomation #PythonSQL

Building a Text-to-SQL Agent with GPT-4: Schema-Aware Query Generation

Why Build an Agent, Not Just a Prompt?

Step 1: Schema Extraction

Step 2: Schema Formatting for the Prompt

Step 3: The Query Generation Agent

Step 4: Using the Agent

FAQ

Why use CREATE TABLE format instead of JSON for schema context?

How do I handle very large schemas with hundreds of tables?

Should I use GPT-4 or GPT-4o for text-to-SQL?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding