Hidden Backdoors in Human-Centric Language Models

05/01/2021
by   Shaofeng Li, et al.
13

Natural language processing (NLP) systems have been proven to be vulnerable to backdoor attacks, whereby hidden features (backdoors) are trained into a language model and may only be activated by specific inputs (called triggers), to trick the model into producing unexpected behaviors. In this paper, we create covert and natural triggers for textual backdoor attacks, hidden backdoors, where triggers can fool both modern language models and human inspection. We deploy our hidden backdoors through two state-of-the-art trigger embedding methods. The first approach via homograph replacement, embeds the trigger into deep neural networks through the visual spoofing of lookalike character replacement. The second approach uses subtle differences between text generated by language models and real natural text to produce trigger sentences with correct grammar and high fluency. We demonstrate that the proposed hidden backdoors can be effective across three downstream security-critical NLP tasks, representative of modern human-centric NLP systems, including toxic comment detection, neural machine translation (NMT), and question answering (QA). Our two hidden backdoor attacks can achieve an Attack Success Rate (ASR) of at least 97% with an injection rate of only 3% in toxic comment detection, 95.1% ASR in NMT with less than 0.5% injected data, and finally 91.12% ASR against QA updated with only 27 poisoning data samples on a model previously trained with 92,024 samples (0.029%). We are able to demonstrate the adversary's high success rate of attacks, while maintaining functionality for regular users, with triggers inconspicuous by the human administrators.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/09/2022

Rethink Stealthy Backdoor Attacks in Natural Language Processing

Recently, it has been shown that natural language processing (NLP) model...
research
02/08/2023

Training-free Lexical Backdoor Attacks on Language Models

Large-scale language models have achieved tremendous success across vari...
research
02/02/2023

TransFool: An Adversarial Attack against Neural Machine Translation Models

Deep neural networks have been shown to be vulnerable to small perturbat...
research
09/06/2019

Invisible Backdoor Attacks Against Deep Neural Networks

Deep neural networks (DNNs) have been proven vulnerable to backdoor atta...
research
03/25/2023

Backdoor Attacks with Input-unique Triggers in NLP

Backdoor attack aims at inducing neural models to make incorrect predict...
research
08/21/2023

Large Language Models on Wikipedia-Style Survey Generation: an Evaluation in NLP Concepts

Large Language Models (LLMs) have achieved significant success across va...
research
08/01/2020

Trojaning Language Models for Fun and Profit

Recent years have witnessed a new paradigm of building natural language ...

Please sign up or login with your details

Forgot password? Click here to reset