Shared Adversarial Unlearning: Backdoor Mitigation by Unlearning Shared Adversarial Examples

by   Shaokui Wei, et al.

Backdoor attacks are serious security threats to machine learning models where an adversary can inject poisoned samples into the training set, causing a backdoored model which predicts poisoned samples with particular triggers to particular target classes, while behaving normally on benign samples. In this paper, we explore the task of purifying a backdoored model using a small clean dataset. By establishing the connection between backdoor risk and adversarial risk, we derive a novel upper bound for backdoor risk, which mainly captures the risk on the shared adversarial examples (SAEs) between the backdoored model and the purified model. This upper bound further suggests a novel bi-level optimization problem for mitigating backdoor using adversarial training techniques. To solve it, we propose Shared Adversarial Unlearning (SAU). Specifically, SAU first generates SAEs, and then, unlearns the generated SAEs such that they are either correctly classified by the purified model and/or differently classified by the two models, such that the backdoor effect in the backdoored model will be mitigated in the purified model. Experiments on various benchmark datasets and network architectures show that our proposed method achieves state-of-the-art performance for backdoor defense.


page 1

page 2

page 3

page 4


Improving the Generalization of Adversarial Training with Domain Adaptation

By injecting adversarial examples into training data, the adversarial tr...

With False Friends Like These, Who Can Have Self-Knowledge?

Adversarial examples arise from excessive sensitivity of a model. Common...

Using Single-Step Adversarial Training to Defend Iterative Adversarial Examples

Adversarial examples have become one of the largest challenges that mach...

Are Adversarial Examples Created Equal? A Learnable Weighted Minimax Risk for Robustness under Non-uniform Attacks

Adversarial Training is proved to be an efficient method to defend again...

Are You Stealing My Model? Sample Correlation for Fingerprinting Deep Neural Networks

An off-the-shelf model as a commercial service could be stolen by model ...

Overcoming Conflicting Data for Model Updates

In this paper, we explore how to use a small amount of new data to updat...

Local Intrinsic Dimensionality Signals Adversarial Perturbations

The vulnerability of machine learning models to adversarial perturbation...

Please sign up or login with your details

Forgot password? Click here to reset