Why Clean Generalization and Robust Overfitting Both Happen in Adversarial Training
Adversarial training is a standard method to train deep neural networks to be robust to adversarial perturbation. Similar to surprising clean generalization ability in the standard deep learning setting, neural networks trained by adversarial training also generalize well for unseen clean data. However, in constrast with clean generalization, while adversarial training method is able to achieve low robust training error, there still exists a significant robust generalization gap, which promotes us exploring what mechanism leads to both clean generalization and robust overfitting (CGRO) during learning process. In this paper, we provide a theoretical understanding of this CGRO phenomenon in adversarial training. First, we propose a theoretical framework of adversarial training, where we analyze feature learning process to explain how adversarial training leads network learner to CGRO regime. Specifically, we prove that, under our patch-structured dataset, the CNN model provably partially learns the true feature but exactly memorizes the spurious features from training-adversarial examples, which thus results in clean generalization and robust overfitting. For more general data assumption, we then show the efficiency of CGRO classifier from the perspective of representation complexity. On the empirical side, to verify our theoretical analysis in real-world vision dataset, we investigate the dynamics of loss landscape during training. Moreover, inspired by our experiments, we prove a robust generalization bound based on global flatness of loss landscape, which may be an independent interest.
READ FULL TEXT