Learning to Compare for Better Training and Evaluation of Open Domain Natural Language Generation Models

02/12/2020
by   Wangchunshu Zhou, et al.
0

Automated evaluation of open domain natural language generation (NLG) models remains a challenge and widely used metrics such as BLEU and Perplexity can be misleading in some cases. In our paper, we propose to evaluate natural language generation models by learning to compare a pair of generated sentences by fine-tuning BERT, which has been shown to have good natural language understanding ability. We also propose to evaluate the model-level quality of NLG models with sample-level comparison results with skill rating system. While able to be trained in a fully self-supervised fashion, our model can be further fine-tuned with a little amount of human preference annotation to better imitate human judgment. In addition to evaluating trained models, we propose to apply our model as a performance indicator during training for better hyperparameter tuning and early-stopping. We evaluate our approach on both story generation and chit-chat dialogue response generation. Experimental results show that our model correlates better with human preference compared with previous automated evaluation approaches. Training with the proposed metric yields better performance in human evaluation, which further demonstrates the effectiveness of the proposed model.

READ FULL TEXT
research
09/07/2021

Naturalness Evaluation of Natural Language Generation in Task-oriented Dialogues using BERT

This paper presents an automatic method to evaluate the naturalness of n...
research
03/17/2022

RoMe: A Robust Metric for Evaluating Natural Language Generation

Evaluating Natural Language Generation (NLG) systems is a challenging ta...
research
12/08/2020

Facts2Story: Controlling Text Generation by Key Facts

Recent advancements in self-attention neural network architectures have ...
research
03/07/2021

Empathetic BERT2BERT Conversational Model: Learning Arabic Language Generation with Little Data

Enabling empathetic behavior in Arabic dialogue agents is an important a...
research
12/23/2021

Measuring Attribution in Natural Language Generation Models

With recent improvements in natural language generation (NLG) models for...
research
05/18/2022

GPoeT-2: A GPT-2 Based Poem Generator

This project aims to produce the next volume of machine-generated poetry...
research
06/01/2023

Preference-grounded Token-level Guidance for Language Model Fine-tuning

Aligning language models (LMs) with preferences is an important problem ...

Please sign up or login with your details

Forgot password? Click here to reset