MENLI: Robust Evaluation Metrics from Natural Language Inference

08/15/2022
by   Yanran Chen, et al.
0

Recently proposed BERT-based evaluation metrics perform well on standard evaluation benchmarks but are vulnerable to adversarial attacks, e.g., relating to factuality errors. We argue that this stems (in part) from the fact that they are models of semantic similarity. In contrast, we develop evaluation metrics based on Natural Language Inference (NLI), which we deem a more appropriate modeling. We design a preference-based adversarial attack framework and show that our NLI based metrics are much more robust to the attacks than the recent BERT-based metrics. On standard benchmarks, our NLI based metrics outperform existing summarization metrics, but perform below SOTA MT metrics. However, when we combine existing metrics with our NLI metrics, we obtain both higher adversarial robustness (+20 measured on standard benchmarks (+5

READ FULL TEXT

page 13

page 18

page 19

page 20

research
05/24/2023

Evaluating NLG Evaluation Metrics: A Measurement Theory Perspective

We address the fundamental challenge in Natural Language Generation (NLG...
research
11/02/2022

Dialect-robust Evaluation of Generated Text

Evaluation metrics that are not robust to dialect variation make it impo...
research
09/20/2022

Can we do that simpler? Simple, Efficient, High-Quality Evaluation Metrics for NLG

We explore efficient evaluation metrics for Natural Language Generation ...
research
03/30/2022

Reproducibility Issues for BERT-based Evaluation Metrics

Reproducibility is of utmost concern in machine learning and natural lan...
research
10/08/2021

Global Explainability of BERT-Based Evaluation Metrics by Disentangling along Linguistic Factors

Evaluation metrics are a key ingredient for progress of text generation ...
research
09/08/2022

Towards explainable evaluation of language models on the semantic similarity of visual concepts

Recent breakthroughs in NLP research, such as the advent of Transformer ...
research
10/02/2022

Optimization for Robustness Evaluation beyond ℓ_p Metrics

Empirical evaluation of deep learning models against adversarial attacks...

Please sign up or login with your details

Forgot password? Click here to reset