An Adaptive Black-box Defense against Trojan Attacks (TrojDef)

09/05/2022
by   Guanxiong Liu, et al.
1

Trojan backdoor is a poisoning attack against Neural Network (NN) classifiers in which adversaries try to exploit the (highly desirable) model reuse property to implant Trojans into model parameters for backdoor breaches through a poisoned training process. Most of the proposed defenses against Trojan attacks assume a white-box setup, in which the defender either has access to the inner state of NN or is able to run back-propagation through it. In this work, we propose a more practical black-box defense, dubbed TrojDef, which can only run forward-pass of the NN. TrojDef tries to identify and filter out Trojan inputs (i.e., inputs augmented with the Trojan trigger) by monitoring the changes in the prediction confidence when the input is repeatedly perturbed by random noise. We derive a function based on the prediction outputs which is called the prediction confidence bound to decide whether the input example is Trojan or not. The intuition is that Trojan inputs are more stable as the misclassification only depends on the trigger, while benign inputs will suffer when augmented with noise due to the perturbation of the classification features. Through mathematical analysis, we show that if the attacker is perfect in injecting the backdoor, the Trojan infected model will be trained to learn the appropriate prediction confidence bound, which is used to distinguish Trojan and benign inputs under arbitrary perturbations. However, because the attacker might not be perfect in injecting the backdoor, we introduce a nonlinear transform to the prediction confidence bound to improve the detection accuracy in practical settings. Extensive empirical evaluations show that TrojDef significantly outperforms the-state-of-the-art defenses and is highly stable under different settings, even when the classifier architecture, the training process, or the hyper-parameters change.

READ FULL TEXT

page 1

page 4

page 8

page 10

page 13

research
09/23/2019

MemGuard: Defending against Black-Box Membership Inference Attacks via Adversarial Examples

In a membership inference attack, an attacker aims to infer whether a da...
research
07/08/2021

Output Randomization: A Novel Defense for both White-box and Black-box Adversarial Models

Adversarial examples pose a threat to deep neural network models in a va...
research
07/11/2020

ManiGen: A Manifold Aided Black-box Generator of Adversarial Examples

Machine learning models, especially neural network (NN) classifiers, hav...
research
07/12/2019

Stateful Detection of Black-Box Adversarial Attacks

The problem of adversarial examples, evasion attacks on machine learning...
research
09/04/2023

Efficient Defense Against Model Stealing Attacks on Convolutional Neural Networks

Model stealing attacks have become a serious concern for deep learning m...
research
04/21/2023

Launching a Robust Backdoor Attack under Capability Constrained Scenarios

As deep neural networks continue to be used in critical domains, concern...
research
06/26/2019

Prediction Poisoning: Utility-Constrained Defenses Against Model Stealing Attacks

With the advances of ML models in recent years, we are seeing an increas...

Please sign up or login with your details

Forgot password? Click here to reset