Automatic Construction of Evaluation Suites for Natural Language Generation Datasets

06/16/2021
by   Simon Mille, et al.
0

Machine learning approaches applied to NLP are often evaluated by summarizing their performance in a single number, for example accuracy. Since most test sets are constructed as an i.i.d. sample from the overall data, this approach overly simplifies the complexity of language and encourages overfitting to the head of the data distribution. As such, rare language phenomena or text about underrepresented groups are not equally included in the evaluation. To encourage more in-depth model analyses, researchers have proposed the use of multiple test sets, also called challenge sets, that assess specific capabilities of a model. In this paper, we develop a framework based on this idea which is able to generate controlled perturbations and identify subsets in text-to-scalar, text-to-text, or data-to-text settings. By applying this framework to the GEM generation benchmark, we propose an evaluation suite made of 80 challenge sets, demonstrate the kinds of analyses that it enables and shed light onto the limits of current generation models.

READ FULL TEXT

page 3

page 4

page 5

page 9

page 10

page 11

page 12

page 15

research
10/11/2018

Sequence-to-Sequence Models for Data-to-Text Natural Language Generation: Word- vs. Character-based Processing and Output Diversity

We present a comparison of word-based and character-based sequence-to-se...
research
05/15/2023

Creative Data Generation: A Review Focusing on Text and Poetry

The rapid advancement in machine learning has led to a surge in automati...
research
08/08/2023

Learning Evaluation Models from Large Language Models for Sequence Generation

Large language models achieve state-of-the-art performance on sequence g...
research
05/10/2017

Analysing Data-To-Text Generation Benchmarks

Recently, several data-sets associating data to text have been created t...
research
05/31/2023

An Invariant Learning Characterization of Controlled Text Generation

Controlled generation refers to the problem of creating text that contai...
research
10/25/2021

Identifying and Benchmarking Natural Out-of-Context Prediction Problems

Deep learning systems frequently fail at out-of-context (OOC) prediction...
research
11/11/2021

SynthBio: A Case Study in Human-AI Collaborative Curation of Text Datasets

NLP researchers need more, higher-quality text datasets. Human-labeled d...

Please sign up or login with your details

Forgot password? Click here to reset