Evaluating Machines by their Real-World Language Use

04/07/2020

∙

There is a fundamental gap between how humans understand and use language – in open-ended, real-world situations – and today's NLP benchmarks for language understanding. To narrow this gap, we propose to evaluate machines by their success at real-world language use – which greatly expands the scope of language tasks that can be measured and studied. We introduce TuringAdvice, a new challenge for language understanding systems. Given a complex situation faced by a real person, a machine must generate helpful advice. We make our challenge concrete by introducing RedditAdvice, a dataset and leaderboard for measuring progress. Though we release a training set with 600k examples, our evaluation is dynamic, continually evolving with the language people use: models must generate helpful advice for recently-written situations. Empirical results show that today's models struggle at our task, even those with billions of parameters. The best model, a finetuned T5, writes advice that is at least as helpful as human-written advice in only 9 performance reveals language understanding errors that are hard to spot outside of a generative setting, showing much room for progress.

READ FULL TEXT

Evaluating Machines by their Real-World Language Use

Tiered Reasoning for Intuitive Physics: Toward Verifiable Commonsense Language Understanding

CoDA21: Evaluating Language Understanding Capabilities of NLP Models With Context-Definition Alignment

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

Extending Machine Language Models toward Human-Level Language Understanding

Dialogue Games for Benchmarking Language Understanding: Motivation, Taxonomy, Strategy

Beyond the Tip of the Iceberg: Assessing Coherence of Text Classifiers

Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets

Evaluating Machines by their Real-World Language Use

Related Research

Tiered Reasoning for Intuitive Physics: Toward Verifiable Commonsense Language Understanding

CoDA21: Evaluating Language Understanding Capabilities of NLP Models With Context-Definition Alignment

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

Extending Machine Language Models toward Human-Level Language Understanding

Dialogue Games for Benchmarking Language Understanding: Motivation, Taxonomy, Strategy

Beyond the Tip of the Iceberg: Assessing Coherence of Text Classifiers

Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets