Skip to content
Learn Agentic AI12 min read0 views

Multi-Table Text-to-SQL: Handling JOINs, Subqueries, and Complex Relationships

Master multi-table text-to-SQL challenges including JOIN inference, ambiguous column resolution, query planning for complex questions, and techniques that help LLMs reason across table relationships.

Why Multi-Table Queries Are Hard for LLMs

Single-table text-to-SQL is largely a solved problem — modern LLMs handle it with 95%+ accuracy. But when questions span multiple tables, accuracy drops to 70-80%. The challenge is not SQL syntax; it is relational reasoning. The LLM must determine which tables to join, through which columns, and how to handle one-to-many versus many-to-many relationships.

Consider this question: "What are the top 5 customers by total spending who have also left a review?" This requires joining customers, orders, and reviews, aggregating order totals, filtering by review existence, and sorting. Each of these steps introduces a potential error.

Schema Design for Multi-Table Reasoning

The way you present your schema directly affects multi-table accuracy. Explicitly state relationships and join paths.

SCHEMA_WITH_RELATIONSHIPS = """
-- Table Relationships:
-- customers.id -> orders.customer_id (one-to-many: a customer has many orders)
-- orders.id -> order_items.order_id (one-to-many: an order has many items)
-- products.id -> order_items.product_id (one-to-many: a product appears in many order items)
-- customers.id -> reviews.customer_id (one-to-many: a customer writes many reviews)
-- products.id -> reviews.product_id (one-to-many: a product has many reviews)

CREATE TABLE customers (
    id SERIAL PRIMARY KEY,
    name VARCHAR(100) NOT NULL,
    email VARCHAR(255) UNIQUE NOT NULL,
    tier VARCHAR(20) DEFAULT 'standard'  -- standard, premium, enterprise
);

CREATE TABLE orders (
    id SERIAL PRIMARY KEY,
    customer_id INTEGER NOT NULL REFERENCES customers(id),
    order_date DATE NOT NULL,
    status VARCHAR(20) NOT NULL  -- pending, shipped, delivered, cancelled
);

CREATE TABLE order_items (
    id SERIAL PRIMARY KEY,
    order_id INTEGER NOT NULL REFERENCES orders(id),
    product_id INTEGER NOT NULL REFERENCES products(id),
    quantity INTEGER NOT NULL,
    unit_price DECIMAL(10,2) NOT NULL
);

CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    name VARCHAR(200) NOT NULL,
    category VARCHAR(50) NOT NULL,
    price DECIMAL(10,2) NOT NULL
);

CREATE TABLE reviews (
    id SERIAL PRIMARY KEY,
    customer_id INTEGER NOT NULL REFERENCES customers(id),
    product_id INTEGER NOT NULL REFERENCES products(id),
    rating INTEGER CHECK (rating BETWEEN 1 AND 5),
    review_text TEXT,
    created_at TIMESTAMP DEFAULT NOW()
);
"""

Join Path Inference

When the LLM needs to connect two tables that are not directly related, it must find an intermediate path. For example, connecting products to customers requires going through order_items and orders.

Build a helper that computes join paths and include them in the prompt:

from collections import defaultdict, deque

class SchemaGraph:
    """Graph representation of table relationships for join path finding."""

    def __init__(self):
        self.edges = defaultdict(list)

    def add_relationship(self, table_a: str, col_a: str,
                         table_b: str, col_b: str):
        self.edges[table_a].append((table_b, col_a, col_b))
        self.edges[table_b].append((table_a, col_b, col_a))

    def find_join_path(self, start: str, end: str) -> list[dict]:
        """BFS to find shortest join path between two tables."""
        if start == end:
            return []
        visited = {start}
        queue = deque([(start, [])])

        while queue:
            current, path = queue.popleft()
            for neighbor, from_col, to_col in self.edges[current]:
                if neighbor == end:
                    return path + [{
                        "from_table": current,
                        "from_col": from_col,
                        "to_table": neighbor,
                        "to_col": to_col,
                    }]
                if neighbor not in visited:
                    visited.add(neighbor)
                    queue.append((neighbor, path + [{
                        "from_table": current,
                        "from_col": from_col,
                        "to_table": neighbor,
                        "to_col": to_col,
                    }]))
        return []

# Build graph from schema
graph = SchemaGraph()
graph.add_relationship("customers", "id", "orders", "customer_id")
graph.add_relationship("orders", "id", "order_items", "order_id")
graph.add_relationship("products", "id", "order_items", "product_id")
graph.add_relationship("customers", "id", "reviews", "customer_id")
graph.add_relationship("products", "id", "reviews", "product_id")

# Find path from products to customers
path = graph.find_join_path("products", "customers")
for step in path:
    print(f"JOIN {step['to_table']} ON {step['from_table']}.{step['from_col']} = {step['to_table']}.{step['to_col']}")

Output:

JOIN order_items ON products.id = order_items.product_id
JOIN orders ON order_items.order_id = orders.id
JOIN customers ON orders.customer_id = customers.id

Handling Ambiguous Column References

When multiple tables have columns with the same name, the LLM must use table aliases. Add disambiguation instructions to your prompt:

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

DISAMBIGUATION_RULES = """
Important disambiguation notes:
- Both orders and reviews have a 'created_at' column. Use o.created_at for
  order dates and r.created_at for review dates.
- Both order_items.unit_price and products.price exist. Use order_items.unit_price
  for actual transaction prices (may include discounts). Use products.price for
  current catalog price.
- Always use table aliases (c for customers, o for orders, oi for order_items,
  p for products, r for reviews).
"""

Query Planning with Chain of Thought

For complex multi-table questions, ask the LLM to plan before writing SQL. This decomposition step significantly improves accuracy.

PLANNING_PROMPT = """Given this question: "{question}"

Before writing SQL, plan your approach:
1. Which tables are needed?
2. What are the join conditions?
3. What aggregations are required?
4. What filters should be applied?
5. What should the result be sorted by?

Then write the SQL query.
"""

def text_to_sql_with_planning(question: str, schema: str) -> str:
    client = openai.OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"Schema:\n{schema}\n{DISAMBIGUATION_RULES}"},
            {"role": "user", "content": PLANNING_PROMPT.format(question=question)},
        ],
        temperature=0,
    )
    text = response.choices[0].message.content
    # Extract SQL from the response (after the planning section)
    if "SELECT" in text.upper():
        sql_start = text.upper().index("SELECT")
        sql = text[sql_start:].split("\n\n")[0].strip().rstrip(";") + ";"
        return sql
    return text

Common Multi-Table Pitfalls

Double counting. When joining a one-to-many relationship, aggregations get inflated. If a customer has 3 orders and 2 reviews, a naive JOIN produces 6 rows. Use DISTINCT or subqueries to avoid this.

Missing GROUP BY columns. LLMs sometimes include non-aggregated columns in SELECT without adding them to GROUP BY. This is valid in MySQL but fails in PostgreSQL and SQLite.

Wrong join type. "Customers who have NOT placed orders" requires LEFT JOIN ... WHERE orders.id IS NULL, not INNER JOIN. The word "not" signals an anti-join pattern.

FAQ

How do I handle many-to-many relationships?

Many-to-many relationships require joining through a junction table. Include the junction table in your schema and add a comment like "students_courses is a junction table linking students and courses (many-to-many)." This explicit annotation helps the LLM choose the correct join path instead of trying to join the two main tables directly.

Should I decompose complex questions into multiple simpler queries?

Yes, for questions requiring data from unrelated table groups. A question like "Show me the top 5 products and the total number of customers" involves independent aggregations. Running two separate queries and combining results is often more accurate than forcing everything into a single query with multiple subqueries.

How do I handle self-joins?

Self-joins (e.g., employee-manager relationships) are challenging for LLMs because the same table appears twice. Add an explicit note like "employees.manager_id references employees.id (self-referencing: each employee has a manager who is also an employee)." Without this, the LLM may create a separate JOIN against a non-existent "managers" table.


#MultiTableSQL #JOINs #TextToSQL #QueryPlanning #DatabaseRelationships #AgenticAI #SQLReasoning #ComplexQueries

Share this article
C

CallSphere Team

Expert insights on AI voice agents and customer communication automation.

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.