Robustness of Explanation Methods for NLP Models

06/24/2022
by   Shriya Atmakuri, et al.
0

Explanation methods have emerged as an important tool to highlight the features responsible for the predictions of neural networks. There is mounting evidence that many explanation methods are rather unreliable and susceptible to malicious manipulations. In this paper, we particularly aim to understand the robustness of explanation methods in the context of text modality. We provide initial insights and results towards devising a successful adversarial attack against text explanations. To our knowledge, this is the first attempt to evaluate the adversarial robustness of an explanation method. Our experiments show the explanation method can be largely disturbed for up to 86 tested samples with small changes in the input sentence and its semantics.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/07/2022

Fooling Explanations in Text Classifiers

State-of-the-art text classification models are becoming increasingly re...
research
12/16/2022

Robust Explanation Constraints for Neural Networks

Post-hoc explanation methods are used with the intent of providing insig...
research
11/08/2021

Defense Against Explanation Manipulation

Explainable machine learning attracts increasing attention as it improve...
research
12/28/2022

Robust Ranking Explanations

Gradient-based explanation is the cornerstone of explainable deep networ...
research
11/18/2019

NeuronInspect: Detecting Backdoors in Neural Networks via Output Explanations

Deep neural networks have achieved state-of-the-art performance on vario...
research
02/22/2019

Saliency Learning: Teaching the Model Where to Pay Attention

Deep learning has emerged as a compelling solution to many NLP tasks wit...
research
10/07/2019

Interpretable Disentanglement of Neural Networks by Extracting Class-Specific Subnetwork

We propose a novel perspective to understand deep neural networks in an ...

Please sign up or login with your details

Forgot password? Click here to reset