Text-to-SQL Evaluation: Spider, BIRD, and Custom Benchmarks for Accuracy Testing

Why Standard Evaluation Matters

Claiming your text-to-SQL system "works well" without rigorous evaluation is meaningless. Two systems that feel similar in casual testing might differ by 20% in accuracy on edge cases. Benchmarks give you objective measurements to compare models, track improvements, and identify weaknesses.

The text-to-SQL community has developed standardized benchmarks that test across hundreds of databases and thousands of question-query pairs. Understanding these benchmarks — and knowing when to build your own — is essential for production systems.

The Spider Benchmark

Spider is the most widely used text-to-SQL benchmark. It contains 10,181 questions across 200 databases covering 138 domains. Questions are categorized by difficulty: easy (single table, no aggregation), medium (joins, grouping), hard (subqueries, set operations), and extra hard (nested queries, multiple conditions).

Key characteristics:

Cross-database evaluation — the test set uses databases not seen during training
SQL complexity levels — from simple SELECT to multi-level nested queries
Multiple valid SQL representations — the same question might have several correct SQL formulations

# Example Spider dataset entry
spider_example = {
    "db_id": "concert_singer",
    "question": "How many singers do we have?",
    "query": "SELECT count(*) FROM singer",
    "difficulty": "easy",
}

The BIRD Benchmark

BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation) addresses limitations in Spider by using real-world databases with messy data, requiring external knowledge, and including value-based questions.

Key differences from Spider:

Dirty data — databases contain NULLs, inconsistent formats, and realistic noise
External knowledge — some questions require understanding domain conventions (e.g., "fiscal year starts in April")
Larger databases — tables with millions of rows where query efficiency matters

Evaluation Metrics

Exact Match Accuracy (EM) compares the predicted SQL string to the reference SQL. This is too strict — SELECT name FROM users and SELECT users.name FROM users are both correct but do not match.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

Execution Accuracy (EX) runs both the predicted and reference SQL against the database and compares results. This is the standard metric because it correctly handles multiple valid SQL formulations.

import sqlite3
from typing import Any

def execution_accuracy(predicted_sql: str, reference_sql: str,
                       db_path: str) -> bool:
    """Check if predicted and reference SQL return the same results."""
    conn = sqlite3.connect(db_path)

    try:
        pred_results = set(
            tuple(row) for row in conn.execute(predicted_sql).fetchall()
        )
        ref_results = set(
            tuple(row) for row in conn.execute(reference_sql).fetchall()
        )
        return pred_results == ref_results
    except Exception:
        return False
    finally:
        conn.close()

def evaluate_batch(test_cases: list[dict], model_fn, db_dir: str) -> dict:
    """Evaluate a text-to-SQL model on a batch of test cases."""
    results = {"total": 0, "correct": 0, "errors": 0, "by_difficulty": {}}

    for case in test_cases:
        results["total"] += 1
        db_path = f"{db_dir}/{case['db_id']}/{case['db_id']}.sqlite"

        try:
            predicted = model_fn(case["question"], db_path)
            is_correct = execution_accuracy(predicted, case["query"], db_path)

            if is_correct:
                results["correct"] += 1

            # Track by difficulty
            diff = case.get("difficulty", "unknown")
            if diff not in results["by_difficulty"]:
                results["by_difficulty"][diff] = {"total": 0, "correct": 0}
            results["by_difficulty"][diff]["total"] += 1
            if is_correct:
                results["by_difficulty"][diff]["correct"] += 1

        except Exception as e:
            results["errors"] += 1

    results["accuracy"] = results["correct"] / results["total"] if results["total"] > 0 else 0
    return results

Building a Custom Evaluation Dataset

Standard benchmarks tell you how your model performs in general. But production accuracy depends on your specific schema, your users' question patterns, and your data characteristics. Build a custom evaluation set.

import json
from dataclasses import dataclass, asdict

@dataclass
class EvalCase:
    question: str
    reference_sql: str
    difficulty: str  # easy, medium, hard
    category: str    # e.g., "aggregation", "join", "filter", "date_range"
    notes: str = ""  # Why this case is interesting

class EvalDatasetBuilder:
    """Build and manage a custom text-to-SQL evaluation dataset."""

    def __init__(self, db_path: str):
        self.db_path = db_path
        self.cases: list[EvalCase] = []

    def add_case(self, question: str, reference_sql: str,
                 difficulty: str, category: str, notes: str = ""):
        # Verify the reference SQL actually works
        conn = sqlite3.connect(self.db_path)
        try:
            conn.execute(reference_sql)
        except Exception as e:
            raise ValueError(
                f"Reference SQL is invalid: {e}\nSQL: {reference_sql}"
            )
        finally:
            conn.close()

        self.cases.append(EvalCase(
            question=question,
            reference_sql=reference_sql,
            difficulty=difficulty,
            category=category,
            notes=notes,
        ))

    def save(self, path: str):
        with open(path, "w") as f:
            json.dump([asdict(c) for c in self.cases], f, indent=2)

    def load(self, path: str):
        with open(path) as f:
            self.cases = [EvalCase(**c) for c in json.load(f)]

    def summary(self) -> dict:
        from collections import Counter
        return {
            "total_cases": len(self.cases),
            "by_difficulty": dict(Counter(c.difficulty for c in self.cases)),
            "by_category": dict(Counter(c.category for c in self.cases)),
        }

# Build your dataset
builder = EvalDatasetBuilder("production_analytics.db")
builder.add_case(
    question="How many orders were placed in January 2026?",
    reference_sql="SELECT COUNT(*) FROM orders WHERE order_date >= '2026-01-01' AND order_date < '2026-02-01'",
    difficulty="easy",
    category="date_range",
    notes="Tests date range filtering with boundary conditions",
)
builder.add_case(
    question="Which product category has the highest average order value?",
    reference_sql="""
        SELECT p.category, AVG(oi.unit_price * oi.quantity) as avg_value
        FROM order_items oi
        JOIN products p ON oi.product_id = p.id
        GROUP BY p.category
        ORDER BY avg_value DESC
        LIMIT 1
    """,
    difficulty="medium",
    category="aggregation",
    notes="Requires JOIN and aggregation with sorting",
)
builder.save("eval_dataset.json")

Recommended Evaluation Set Composition

For a production system, aim for at least 100 test cases with this distribution:

30% easy — single-table filters, counts, simple aggregations
40% medium — two-table JOINs, GROUP BY with HAVING, date ranges
20% hard — three+ table JOINs, subqueries, window functions
10% adversarial — ambiguous questions, questions with no valid answer, domain-specific terminology

Cover every table and relationship in your schema. Ensure you have cases for each common question pattern your users ask.

FAQ

How often should I re-evaluate my text-to-SQL system?

Re-evaluate whenever you change the model, update the schema, modify the prompt, or add new tables. At minimum, run your evaluation suite weekly in CI/CD. Schema changes are the most common cause of accuracy regression — a renamed column can silently break queries the model previously got right.

Is 80% accuracy good enough for production?

It depends on the use case. For exploratory analytics where users can verify results, 80% is workable with good error messaging. For automated reporting or dashboards where results are consumed without review, you need 95%+ accuracy. Most production systems use error correction loops to bridge this gap.

Can I use Spider or BIRD results to predict production accuracy?

Benchmark accuracy provides a ceiling estimate, not a prediction. Your production accuracy will typically be 5-15% lower than benchmark scores because real users ask messier questions, your schema has domain-specific quirks, and benchmark questions are carefully written to be unambiguous. Always supplement benchmarks with custom evaluation on your own data.

#Evaluation #SpiderBenchmark #BIRDBenchmark #TextToSQL #AccuracyTesting #AgenticAI #MLOps #BenchmarkDriven

Text-to-SQL Evaluation: Spider, BIRD, and Custom Benchmarks for Accuracy Testing

Why Standard Evaluation Matters

The Spider Benchmark

The BIRD Benchmark

Evaluation Metrics

Building a Custom Evaluation Dataset

Recommended Evaluation Set Composition

FAQ

How often should I re-evaluate my text-to-SQL system?

Is 80% accuracy good enough for production?

Can I use Spider or BIRD results to predict production accuracy?

Try CallSphere AI Voice Agents

Related Articles

WebArena and Real-World Web Agent Benchmarks: How We Measure Browser Agent Performance

Taking Screenshots and Recording Videos with Playwright for AI Analysis

Playwright Selectors Deep Dive: CSS, XPath, Text, and Role-Based Element Finding