LimeAttack: Local Explainable Method for Textual Hard-Label Adversarial Attack

08/01/2023
by   Hai Zhu, et al.
0

Natural language processing models are vulnerable to adversarial examples. Previous textual adversarial attacks adopt gradients or confidence scores to calculate word importance ranking and generate adversarial examples. However, this information is unavailable in the real world. Therefore, we focus on a more realistic and challenging setting, named hard-label attack, in which the attacker can only query the model and obtain a discrete prediction label. Existing hard-label attack algorithms tend to initialize adversarial examples by random substitution and then utilize complex heuristic algorithms to optimize the adversarial perturbation. These methods require a lot of model queries and the attack success rate is restricted by adversary initialization. In this paper, we propose a novel hard-label attack algorithm named LimeAttack, which leverages a local explainable method to approximate word importance ranking, and then adopts beam search to find the optimal solution. Extensive experiments show that LimeAttack achieves the better attacking performance compared with existing hard-label attack under the same query budget. In addition, we evaluate the effectiveness of LimeAttack on large language models, and results indicate that adversarial examples remain a significant threat to large language models. The adversarial examples crafted by LimeAttack are highly transferable and effectively improve model robustness in adversarial training.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/20/2022

Learning-based Hybrid Local Search for the Hard-label Textual Attack

Deep neural networks are vulnerable to adversarial examples in Natural L...
research
03/09/2023

BeamAttack: Generating High-quality Textual Adversarial Examples through Beam Search and Mixed Semantic Spaces

Natural language processing models based on neural networks are vulnerab...
research
05/22/2022

Phrase-level Textual Adversarial Attack with Label Preservation

Generating high-quality textual adversarial examples is critical for inv...
research
07/06/2023

NatLogAttack: A Framework for Attacking Natural Language Inference Models with Natural Logic

Reasoning has been a central topic in artificial intelligence from the b...
research
04/27/2021

Improved and Efficient Text Adversarial Attacks using Target Information

There has been recently a growing interest in studying adversarial examp...
research
08/17/2022

A Context-Aware Approach for Textual Adversarial Attack through Probability Difference Guided Beam Search

Textual adversarial attacks expose the vulnerabilities of text classifie...
research
12/29/2020

Generating Natural Language Attacks in a Hard Label Black Box Setting

We study an important and challenging task of attacking natural language...

Please sign up or login with your details

Forgot password? Click here to reset