Are Large Language Models Good Evaluators for Abstractive Summarization?

05/22/2023
by   Chenhui Shen, et al.
0

Human evaluations are often required for abstractive summary evaluations to give fairer judgments. However, they are often time-consuming, costly, inconsistent, and non-reproducible. To overcome these challenges, we explore the potential of using an out-of-the-box LLM (i.e. "gpt-3.5-turbo") for summarization evaluation without manually selecting demonstrations or complex prompt tuning. We compare different evaluation methods, including 2 methods for Likert-scale scoring and 1 method for head-to-head comparisons, to investigate the performance of the LLM as a zero-shot evaluator. We further propose a meta-correlation metric to measure the stability of the LLM's evaluation capability. With extensive experiments, we show that certain prompt formats can produce better results than others. We also bring attention to the LLM's deteriorating evaluation capability with the rising qualities of summaries. In addition, we find that the LLM's evaluation capability also depends on the evaluated dimensions. We discuss the pros and cons of each method, make recommendations, and suggest some future directions for improvement.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/31/2023

Benchmarking Large Language Models for News Summarization

Large language models (LLMs) have shown promise for automatic summarizat...
research
11/15/2022

Evaluating the Factual Consistency of Large Language Models Through Summarization

While large language models (LLMs) have proven to be effective on a larg...
research
06/01/2023

Multi-Dimensional Evaluation of Text Summarization with In-Context Learning

Evaluation of natural language generation (NLG) is complex and multi-dim...
research
11/29/2022

Zero-Shot Opinion Summarization with GPT-3

Very large language models such as GPT-3 have shown impressive performan...
research
05/03/2023

TempoSum: Evaluating the Temporal Generalization of Abstractive Summarization

Recent pre-trained language models (PLMs) achieve promising results in e...
research
03/22/2022

Towards Abstractive Grounded Summarization of Podcast Transcripts

Podcasts have recently shown a rapid rise in popularity. Summarization o...
research
07/20/2023

Learning and Evaluating Human Preferences for Conversational Head Generation

A reliable and comprehensive evaluation metric that aligns with manual p...

Please sign up or login with your details

Forgot password? Click here to reset