BLEU Neighbors: A Reference-less Approach to Automatic Evaluation

04/27/2020
by   Kawin Ethayarajh, et al.
0

Evaluation is a bottleneck in the development of natural language generation (NLG) models. Automatic metrics such as BLEU rely on references, but for tasks such as open-ended generation, there are no references to draw upon. Although language diversity can be estimated using statistical measures such as perplexity, measuring language quality requires human evaluation. However, because human evaluation at scale is slow and expensive, it is used sparingly; it cannot be used to rapidly iterate on NLG models, in the way BLEU is used for machine translation. To this end, we propose BLEU Neighbors, a nearest neighbors model for estimating language quality by using the BLEU score as a kernel function. On existing datasets for chitchat dialogue and open-ended sentence generation, we find that – on average – the quality estimation from a BLEU Neighbors model has a lower mean squared error and higher Spearman correlation with the ground truth than individual human annotators. Despite its simplicity, BLEU Neighbors even outperforms state-of-the-art models on automatically grading essays, including models that have access to a gold-standard reference essay.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/13/2020

BLEU might be Guilty but References are not Innocent

The quality of automatic metrics for machine translation has been increa...
research
06/23/2015

deltaBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets

We introduce Discriminative BLEU (deltaBLEU), a novel metric for intrins...
research
08/05/2017

Referenceless Quality Estimation for Natural Language Generation

Traditional automatic evaluation measures for natural language generatio...
research
04/30/2020

Explicit Representation of the Translation Space: Automatic Paraphrasing for Machine Translation Evaluation

Following previous work on automatic paraphrasing, we assess the feasibi...
research
10/14/2018

BLEU is Not Suitable for the Evaluation of Text Simplification

BLEU is widely considered to be an informative metric for text-to-text g...
research
04/04/2019

Unifying Human and Statistical Evaluation for Natural Language Generation

How can we measure whether a natural language generation system produces...
research
07/25/2018

"Bilingual Expert" Can Find Translation Errors

Recent advances in statistical machine translation via the adoption of n...

Please sign up or login with your details

Forgot password? Click here to reset