DiscoScore: Evaluating Text Generation with BERT and Discourse Coherence

01/26/2022
by   Wei Zhao, et al.
0

Recently, there has been a growing interest in designing text generation systems from a discourse coherence perspective, e.g., modeling the interdependence between sentences. Still, recent BERT-based evaluation metrics cannot recognize coherence and fail to punish incoherent elements in system outputs. In this work, we introduce DiscoScore, a parametrized discourse metric, which uses BERT to model discourse coherence from different perspectives, driven by Centering theory. Our experiments encompass 16 non-discourse and discourse metrics, including DiscoScore and popular coherence models, evaluated on summarization and document-level machine translation (MT). We find that (i) the majority of BERT-based metrics correlate much worse with human rated coherence than early discourse metrics, invented a decade ago; (ii) the recent state-of-the-art BARTScore is weak when operated at system level – which is particularly problematic as systems are typically compared in this manner. DiscoScore, in contrast, achieves strong system-level correlation with human ratings, not only in coherence but also in factual consistency and other aspects, and surpasses BARTScore by over 10 correlation points on average. Further, aiming to understand DiscoScore, we provide justifications to the importance of discourse coherence for evaluation metrics, and explain the superiority of one variant over another. Our code is available at <https://github.com/AIPHES/DiscoScore>.

READ FULL TEXT

page 10

page 18

research
10/04/2017

Discourse Structure in Machine Translation Evaluation

In this article, we explore the potential of using sentence-level discou...
research
11/14/2018

Modeling Coherence for Discourse Neural Machine Translation

Discourse coherence plays an important role in the translation of one te...
research
08/16/2023

Detoxify Language Model Step-by-Step

Detoxification for LLMs is challenging since it requires models to avoid...
research
11/28/2019

DiscoTK: Using Discourse Structure for Machine Translation Evaluation

We present novel automatic metrics for machine translation evaluation th...
research
06/30/2021

Evaluation of Thematic Coherence in Microblogs

Collecting together microblogs representing opinions about the same topi...
research
01/12/2023

Learning to Memorize Entailment and Discourse Relations for Persona-Consistent Dialogues

Maintaining engagement and consistency is particularly important in dial...
research
10/13/2020

Improving Text Generation Evaluation with Batch Centering and Tempered Word Mover Distance

Recent advances in automatic evaluation metrics for text have shown that...

Please sign up or login with your details

Forgot password? Click here to reset