Investigating Crowdsourcing Protocols for Evaluating the Factual Consistency of Summaries

09/19/2021
by   Xiangru Tang, et al.
9

Current pre-trained models applied to summarization are prone to factual inconsistencies which either misrepresent the source text or introduce extraneous information. Thus, comparing the factual consistency of summaries is necessary as we develop improved models. However, the optimal human evaluation setup for factual consistency has not been standardized. To address this issue, we crowdsourced evaluations for factual consistency using the rating-based Likert scale and ranking-based Best-Worst Scaling protocols, on 100 articles from each of the CNN-Daily Mail and XSum datasets over four state-of-the-art models, to determine the most reliable evaluation framework. We find that ranking-based protocols offer a more reliable measure of summary quality across datasets, while the reliability of Likert ratings depends on the target dataset and the evaluation design. Our crowdsourcing templates and summary evaluations will be publicly available to facilitate future research on factual consistency in summarization.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/14/2023

Rank Your Summaries: Enhancing Bengali Text Summarization via Ranking-based Approach

With the increasing need for text summarization techniques that are both...
research
10/06/2020

Multi-Fact Correction in Abstractive Text Summarization

Pre-trained neural abstractive summarization systems have dominated extr...
research
10/08/2021

Evaluation of Summarization Systems across Gender, Age, and Race

Summarization systems are ultimately evaluated by human annotators and r...
research
04/11/2019

Crowdsourcing Lightweight Pyramids for Manual Summary Evaluation

Conducting a manual evaluation is considered an essential part of summar...
research
10/31/2022

Questioning the Validity of Summarization Datasets and Improving Their Factual Consistency

The topic of summarization evaluation has recently attracted a surge of ...
research
12/22/2021

Consistency and Coherence from Points of Contextual Similarity

Factual consistency is one of important summary evaluation dimensions, e...

Please sign up or login with your details

Forgot password? Click here to reset