Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation

by   Yixin Liu, et al.

Human evaluation is the foundation upon which the evaluation of both summarization systems and automatic metrics rests. However, existing human evaluation protocols and benchmarks for summarization either exhibit low inter-annotator agreement or lack the scale needed to draw statistically significant conclusions, and an in-depth analysis of human evaluation is lacking. In this work, we address the shortcomings of existing summarization evaluation along the following axes: 1) We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which relies on fine-grained semantic units and allows for high inter-annotator agreement. 2) We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of over 22k summary-level annotations over state-of-the-art systems on three datasets. 3) We compare our ACU protocol with three other human evaluation protocols, underscoring potential confounding factors in evaluation setups. 4) We evaluate existing automatic metrics using the collected human annotations across evaluation protocols and demonstrate how our benchmark leads to more statistically stable and significant results. Furthermore, our findings have important implications for evaluating large language models (LLMs), as we show that LLMs adjusted by human feedback (e.g., GPT-3.5) may overfit unconstrained human evaluation, which is affected by the annotators' prior, input-agnostic preferences, calling for more robust, targeted evaluation methods.


page 1

page 2

page 3

page 4


SummEval: Re-evaluating Summarization Evaluation

The scarcity of comprehensive up-to-date studies on evaluation metrics f...

Analyzing Influential Factors in Human Preference Judgments via GPT-4

Pairwise human judgments are pivotal in guiding large language models (L...

Towards Interpretable and Efficient Automatic Reference-Based Summarization Evaluation

Interpretability and efficiency are two important considerations for the...

LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond

With the recent appearance of LLMs in practical settings, having methods...

Summarization from Leaderboards to Practice: Choosing A Representation Backbone and Ensuring Robustness

Academic literature does not give much guidance on how to build the best...

ESBM: An Entity Summarization BenchMark

Entity summarization is the problem of computing an optimal compact summ...

LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form Summarization

While human evaluation remains best practice for accurately judging the ...

Please sign up or login with your details

Forgot password? Click here to reset