Evaluating Machines by their Real-World Language Use

04/07/2020
by   Rowan Zellers, et al.
0

There is a fundamental gap between how humans understand and use language – in open-ended, real-world situations – and today's NLP benchmarks for language understanding. To narrow this gap, we propose to evaluate machines by their success at real-world language use – which greatly expands the scope of language tasks that can be measured and studied. We introduce TuringAdvice, a new challenge for language understanding systems. Given a complex situation faced by a real person, a machine must generate helpful advice. We make our challenge concrete by introducing RedditAdvice, a dataset and leaderboard for measuring progress. Though we release a training set with 600k examples, our evaluation is dynamic, continually evolving with the language people use: models must generate helpful advice for recently-written situations. Empirical results show that today's models struggle at our task, even those with billions of parameters. The best model, a finetuned T5, writes advice that is at least as helpful as human-written advice in only 9 performance reveals language understanding errors that are hard to spot outside of a generative setting, showing much room for progress.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/10/2021

Tiered Reasoning for Intuitive Physics: Toward Verifiable Commonsense Language Understanding

Large-scale, pre-trained language models (LMs) have achieved human-level...
research
03/11/2022

CoDA21: Evaluating Language Understanding Capabilities of NLP Models With Context-Definition Alignment

Pretrained language models (PLMs) have achieved superhuman performance o...
research
05/02/2019

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

In the last year, new models and methods for pretraining and transfer le...
research
12/12/2019

Extending Machine Language Models toward Human-Level Language Understanding

Language is central to human intelligence. We review recent breakthrough...
research
04/14/2023

Dialogue Games for Benchmarking Language Understanding: Motivation, Taxonomy, Strategy

How does one measure "ability to understand language"? If it is a person...
research
09/10/2021

Beyond the Tip of the Iceberg: Assessing Coherence of Text Classifiers

As large-scale, pre-trained language models achieve human-level and supe...
research
08/21/2019

Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets

Crowdsourcing has been the prevalent paradigm for creating natural langu...

Please sign up or login with your details

Forgot password? Click here to reset