On the Limitations of Reference-Free Evaluations of Generated Text

10/22/2022
by   Daniel Deutsch, et al.
0

There is significant interest in developing evaluation metrics which accurately estimate the quality of generated text without the aid of a human-written reference text, which can be time consuming and expensive to collect or entirely unavailable in online applications. However, in this work, we demonstrate that these reference-free metrics are inherently biased and limited in their ability to evaluate generated text, and we argue that they should not be used to measure progress on tasks like machine translation or summarization. We show how reference-free metrics are equivalent to using one generation model to evaluate another, which has several limitations: (1) the metrics can be optimized at test time to find the approximate best-possible output, (2) they are inherently biased toward models which are more similar to their own, and (3) they can be biased against higher-quality outputs, including those written by humans. Therefore, we recommend that reference-free metrics should be used as diagnostic tools for analyzing and understanding model behavior instead of measures of how well models perform a task, in which the goal is to achieve as high of a score as possible.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/21/2022

Spurious Correlations in Reference-Free Evaluation of Text Generation

Model-based, reference-free evaluation metrics have been proposed as a f...
research
02/17/2022

Revisiting the Evaluation Metrics of Paraphrase Generation

Paraphrase generation is an important NLP task that has achieved signifi...
research
12/20/2022

DocAsRef: A Pilot Empirical Study on Repurposing Reference-Based Summary Quality Metrics Reference-Freely

Summary quality assessment metrics have two categories: reference-based ...
research
10/12/2018

Pre-gen metrics: Predicting caption quality metrics without generating captions

Image caption generation systems are typically evaluated against referen...
research
09/14/2023

Measuring the Quality of Text-to-Video Model Outputs: Metrics and Dataset

Evaluating the quality of videos generated from text-to-video (T2V) mode...
research
07/02/2022

FRAME: Evaluating Simulatability Metrics for Free-Text Rationales

Free-text rationales aim to explain neural language model (LM) behavior ...
research
09/14/2023

Aligning Speakers: Evaluating and Visualizing Text-based Diarization Using Efficient Multiple Sequence Alignment (Extended Version)

This paper presents a novel evaluation approach to text-based speaker di...

Please sign up or login with your details

Forgot password? Click here to reset