The price of debiasing automatic metrics in natural language evaluation

by   Arun Tejasvi Chaganty, et al.
Stanford University

For evaluating generation systems, automatic metrics such as BLEU cost nothing to run but have been shown to correlate poorly with human judgment, leading to systematic bias against certain model improvements. On the other hand, averaging human judgments, the unbiased gold standard, is often too expensive. In this paper, we use control variates to combine automatic metrics with human evaluation to obtain an unbiased estimator with lower cost than human evaluation alone. In practice, however, we obtain only a 7-13 reduction on evaluating summarization and open-response question answering systems. We then prove that our estimator is optimal: there is no unbiased estimator with lower cost. Our theory further highlights the two fundamental bottlenecks---the automatic metric and the prompt shown to human evaluators---both of which need to be improved to obtain greater cost savings.


page 5

page 8

page 13

page 14


NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference Checklist

In this study, we analyze NLG automatic metrics based on whether human e...

Evaluating Dialogue Generation Systems via Response Selection

Existing automatic evaluation metrics for open-domain dialogue response ...

Evaluating and Characterizing Human Rationales

Two main approaches for evaluating the quality of machine-generated rati...

Dynamic Human Evaluation for Relative Model Comparisons

Collecting human judgements is currently the most reliable evaluation me...

InfoLM: A New Metric to Evaluate Summarization Data2Text Generation

Assessing the quality of natural language generation systems through hum...

WiSeBE: Window-based Sentence Boundary Evaluation

Sentence Boundary Detection (SBD) has been a major research topic since ...

Near-Negative Distinction: Giving a Second Life to Human Evaluation Datasets

Precisely assessing the progress in natural language generation (NLG) ta...

Please sign up or login with your details

Forgot password? Click here to reset