Text-to-SQL Evaluation: Spider, BIRD, and Custom Benchmarks for Accuracy Testing
Understand how to evaluate text-to-SQL systems using the Spider and BIRD benchmarks, implement execution accuracy metrics, and build custom evaluation datasets for your specific database schema.
Why Standard Evaluation Matters
Claiming your text-to-SQL system "works well" without rigorous evaluation is meaningless. Two systems that feel similar in casual testing might differ by 20% in accuracy on edge cases. Benchmarks give you objective measurements to compare models, track improvements, and identify weaknesses.
The text-to-SQL community has developed standardized benchmarks that test across hundreds of databases and thousands of question-query pairs. Understanding these benchmarks — and knowing when to build your own — is essential for production systems.
The Spider Benchmark
Spider is the most widely used text-to-SQL benchmark. It contains 10,181 questions across 200 databases covering 138 domains. Questions are categorized by difficulty: easy (single table, no aggregation), medium (joins, grouping), hard (subqueries, set operations), and extra hard (nested queries, multiple conditions).
Key characteristics:
- Cross-database evaluation — the test set uses databases not seen during training
- SQL complexity levels — from simple SELECT to multi-level nested queries
- Multiple valid SQL representations — the same question might have several correct SQL formulations
# Example Spider dataset entry
spider_example = {
"db_id": "concert_singer",
"question": "How many singers do we have?",
"query": "SELECT count(*) FROM singer",
"difficulty": "easy",
}
The BIRD Benchmark
BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation) addresses limitations in Spider by using real-world databases with messy data, requiring external knowledge, and including value-based questions.
Key differences from Spider:
- Dirty data — databases contain NULLs, inconsistent formats, and realistic noise
- External knowledge — some questions require understanding domain conventions (e.g., "fiscal year starts in April")
- Larger databases — tables with millions of rows where query efficiency matters
Evaluation Metrics
Exact Match Accuracy (EM) compares the predicted SQL string to the reference SQL. This is too strict — SELECT name FROM users and SELECT users.name FROM users are both correct but do not match.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Execution Accuracy (EX) runs both the predicted and reference SQL against the database and compares results. This is the standard metric because it correctly handles multiple valid SQL formulations.
import sqlite3
from typing import Any
def execution_accuracy(predicted_sql: str, reference_sql: str,
db_path: str) -> bool:
"""Check if predicted and reference SQL return the same results."""
conn = sqlite3.connect(db_path)
try:
pred_results = set(
tuple(row) for row in conn.execute(predicted_sql).fetchall()
)
ref_results = set(
tuple(row) for row in conn.execute(reference_sql).fetchall()
)
return pred_results == ref_results
except Exception:
return False
finally:
conn.close()
def evaluate_batch(test_cases: list[dict], model_fn, db_dir: str) -> dict:
"""Evaluate a text-to-SQL model on a batch of test cases."""
results = {"total": 0, "correct": 0, "errors": 0, "by_difficulty": {}}
for case in test_cases:
results["total"] += 1
db_path = f"{db_dir}/{case['db_id']}/{case['db_id']}.sqlite"
try:
predicted = model_fn(case["question"], db_path)
is_correct = execution_accuracy(predicted, case["query"], db_path)
if is_correct:
results["correct"] += 1
# Track by difficulty
diff = case.get("difficulty", "unknown")
if diff not in results["by_difficulty"]:
results["by_difficulty"][diff] = {"total": 0, "correct": 0}
results["by_difficulty"][diff]["total"] += 1
if is_correct:
results["by_difficulty"][diff]["correct"] += 1
except Exception as e:
results["errors"] += 1
results["accuracy"] = results["correct"] / results["total"] if results["total"] > 0 else 0
return results
Building a Custom Evaluation Dataset
Standard benchmarks tell you how your model performs in general. But production accuracy depends on your specific schema, your users' question patterns, and your data characteristics. Build a custom evaluation set.
import json
from dataclasses import dataclass, asdict
@dataclass
class EvalCase:
question: str
reference_sql: str
difficulty: str # easy, medium, hard
category: str # e.g., "aggregation", "join", "filter", "date_range"
notes: str = "" # Why this case is interesting
class EvalDatasetBuilder:
"""Build and manage a custom text-to-SQL evaluation dataset."""
def __init__(self, db_path: str):
self.db_path = db_path
self.cases: list[EvalCase] = []
def add_case(self, question: str, reference_sql: str,
difficulty: str, category: str, notes: str = ""):
# Verify the reference SQL actually works
conn = sqlite3.connect(self.db_path)
try:
conn.execute(reference_sql)
except Exception as e:
raise ValueError(
f"Reference SQL is invalid: {e}\nSQL: {reference_sql}"
)
finally:
conn.close()
self.cases.append(EvalCase(
question=question,
reference_sql=reference_sql,
difficulty=difficulty,
category=category,
notes=notes,
))
def save(self, path: str):
with open(path, "w") as f:
json.dump([asdict(c) for c in self.cases], f, indent=2)
def load(self, path: str):
with open(path) as f:
self.cases = [EvalCase(**c) for c in json.load(f)]
def summary(self) -> dict:
from collections import Counter
return {
"total_cases": len(self.cases),
"by_difficulty": dict(Counter(c.difficulty for c in self.cases)),
"by_category": dict(Counter(c.category for c in self.cases)),
}
# Build your dataset
builder = EvalDatasetBuilder("production_analytics.db")
builder.add_case(
question="How many orders were placed in January 2026?",
reference_sql="SELECT COUNT(*) FROM orders WHERE order_date >= '2026-01-01' AND order_date < '2026-02-01'",
difficulty="easy",
category="date_range",
notes="Tests date range filtering with boundary conditions",
)
builder.add_case(
question="Which product category has the highest average order value?",
reference_sql="""
SELECT p.category, AVG(oi.unit_price * oi.quantity) as avg_value
FROM order_items oi
JOIN products p ON oi.product_id = p.id
GROUP BY p.category
ORDER BY avg_value DESC
LIMIT 1
""",
difficulty="medium",
category="aggregation",
notes="Requires JOIN and aggregation with sorting",
)
builder.save("eval_dataset.json")
Recommended Evaluation Set Composition
For a production system, aim for at least 100 test cases with this distribution:
- 30% easy — single-table filters, counts, simple aggregations
- 40% medium — two-table JOINs, GROUP BY with HAVING, date ranges
- 20% hard — three+ table JOINs, subqueries, window functions
- 10% adversarial — ambiguous questions, questions with no valid answer, domain-specific terminology
Cover every table and relationship in your schema. Ensure you have cases for each common question pattern your users ask.
FAQ
How often should I re-evaluate my text-to-SQL system?
Re-evaluate whenever you change the model, update the schema, modify the prompt, or add new tables. At minimum, run your evaluation suite weekly in CI/CD. Schema changes are the most common cause of accuracy regression — a renamed column can silently break queries the model previously got right.
Is 80% accuracy good enough for production?
It depends on the use case. For exploratory analytics where users can verify results, 80% is workable with good error messaging. For automated reporting or dashboards where results are consumed without review, you need 95%+ accuracy. Most production systems use error correction loops to bridge this gap.
Can I use Spider or BIRD results to predict production accuracy?
Benchmark accuracy provides a ceiling estimate, not a prediction. Your production accuracy will typically be 5-15% lower than benchmark scores because real users ask messier questions, your schema has domain-specific quirks, and benchmark questions are carefully written to be unambiguous. Always supplement benchmarks with custom evaluation on your own data.
#Evaluation #SpiderBenchmark #BIRDBenchmark #TextToSQL #AccuracyTesting #AgenticAI #MLOps #BenchmarkDriven
CallSphere Team
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.