What are the best systems? New perspectives on NLP Benchmarking

02/08/2022
by   Pierre Colombo, et al.
12

In Machine Learning, a benchmark refers to an ensemble of datasets associated with one or multiple metrics together with a way to aggregate different systems performances. They are instrumental in (i) assessing the progress of new methods along different axes and (ii) selecting the best systems for practical use. This is particularly the case for NLP with the development of large pre-trained models (e.g. GPT, BERT) that are expected to generalize well on a variety of tasks. While the community mainly focused on developing new datasets and metrics, there has been little interest in the aggregation procedure, which is often reduced to a simple average over various performance measures. However, this procedure can be problematic when the metrics are on a different scale, which may lead to spurious conclusions. This paper proposes a new procedure to rank systems based on their performance across different tasks. Motivated by the social choice theory, the final system ordering is obtained through aggregating the rankings induced by each task and is theoretically grounded. We conduct extensive numerical experiments (on over 270k scores) to assess the soundness of our approach both on synthetic and real scores (e.g. GLUE, EXTREM, SEVAL, TAC, FLICKR). In particular, we show that our method yields different conclusions on state-of-the-art systems than the mean-aggregation procedure while being both more reliable and robust.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/11/2022

Vote'n'Rank: Revision of Benchmarking with Social Choice Theory

The development of state-of-the-art systems in different applied areas o...
research
10/20/2021

Better than Average: Paired Evaluation of NLP Systems

Evaluation in NLP is usually done by comparing the scores of competing s...
research
07/21/2021

CausalBERT: Injecting Causal Knowledge Into Pre-trained Models with Minimal Supervision

Recent work has shown success in incorporating pre-trained models like B...
research
01/13/2023

Sem@K: Is my knowledge graph embedding model semantic-aware?

Using knowledge graph embedding models (KGEMs) is a popular approach for...
research
08/31/2023

Enhancing Subtask Performance of Multi-modal Large Language Model

Multi-modal Large Language Model (MLLM) refers to a model expanded from ...
research
12/13/2021

Do Data-based Curricula Work?

Current state-of-the-art NLP systems use large neural networks that requ...
research
07/27/2023

Models of reference production: How do they withstand the test of time?

In recent years, many NLP studies have focused solely on performance imp...

Please sign up or login with your details

Forgot password? Click here to reset