A Fine-grained Interpretability Evaluation Benchmark for Neural NLP

by   Lijie Wang, et al.

While there is increasing concern about the interpretability of neural models, the evaluation of interpretability remains an open problem, due to the lack of proper evaluation datasets and metrics. In this paper, we present a novel benchmark to evaluate the interpretability of both neural models and saliency methods. This benchmark covers three representative NLP tasks: sentiment analysis, textual similarity and reading comprehension, each provided with both English and Chinese annotated data. In order to precisely evaluate the interpretability, we provide token-level rationales that are carefully annotated to be sufficient, compact and comprehensive. We also design a new metric, i.e., the consistency between the rationales before and after perturbations, to uniformly evaluate the interpretability of models and saliency methods on different tasks. Based on this benchmark, we conduct experiments on three typical models with three saliency methods, and unveil their strengths and weakness in terms of interpretability. We will release this benchmark at <https://xyz> and hope it can facilitate the research in building trustworthy systems.


page 1

page 2

page 3

page 4


DuTrust: A Sentiment Analysis Dataset for Trustworthiness Evaluation

While deep learning models have greatly improved the performance of most...

An Interpretability Evaluation Benchmark for Pre-trained Language Models

While pre-trained language models (LMs) have brought great improvements ...

WYWEB: A NLP Evaluation Benchmark For Classical Chinese

To fully evaluate the overall performance of different NLP models in a g...

What do different evaluation metrics tell us about saliency models?

How best to evaluate a saliency model's ability to predict where humans ...

Comparing radiologists' gaze and saliency maps generated by interpretability methods for chest x-rays

The interpretability of medical image analysis models is considered a ke...

ERASER: A Benchmark to Evaluate Rationalized NLP Models

State-of-the-art models in NLP are now predominantly based on deep neura...

Cross-functional Analysis of Generalisation in Behavioural Learning

In behavioural testing, system functionalities underrepresented in the s...

Please sign up or login with your details

Forgot password? Click here to reset