Improving Text-to-SQL Evaluation Methodology

06/23/2018
by   Catherine Finegan-Dollak, et al.
8

To be informative, an evaluation must measure how well systems generalize to realistic unseen data. We identify limitations of and propose improvements to current evaluations of text-to-SQL systems. First, we compare human-generated and automatically generated questions, characterizing properties of queries necessary for real-world applications. To facilitate evaluation on multiple datasets, we release standardized and improved versions of seven existing datasets and one new text-to-SQL dataset. Second, we show that the current division of data into training and test sets measures robustness to variations in the way questions are asked, but only partially tests how well systems generalize to new queries; therefore, we propose a complementary dataset split for evaluation of future work. Finally, we demonstrate how the common practice of anonymizing variables during evaluation removes an important challenge of the task. Our observations highlight key difficulties, and our methodology enables effective measurement of future development.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/25/2023

UNITE: A Unified Benchmark for Text-to-SQL Evaluation

A practical text-to-SQL system should generalize well on a wide variety ...
research
05/25/2023

CSS: A Large-scale Cross-schema Chinese Text-to-SQL Medical Dataset

The cross-domain text-to-SQL task aims to build a system that can parse ...
research
01/21/2023

Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness

Neural text-to-SQL models have achieved remarkable performance in transl...
research
09/11/2021

Exploring Underexplored Limitations of Cross-Domain Text-to-SQL Generalization

Recently, there has been significant progress in studying neural network...
research
12/17/2022

Know What I don't Know: Handling Ambiguous and Unanswerable Questions for Text-to-SQL

The task of text-to-SQL is to convert a natural language question to its...
research
04/07/2018

Evaluating historical text normalization systems: How well do they generalize?

We highlight several issues in the evaluation of historical text normali...
research
01/04/2022

Speech-to-SQL: Towards Speech-driven SQL Query Generation From Natural Language Question

Speech-based inputs have been gaining significant momentum with the popu...

Please sign up or login with your details

Forgot password? Click here to reset