Do Gradient-based Explanations Tell Anything About Adversarial Robustness to Android Malware?
Machine-learning algorithms trained on features extracted from static code analysis can successfully detect Android malware. However, these approaches can be evaded by sparse evasion attacks that produce adversarial malware samples in which only few features are modified. This can be achieved, e.g., by injecting a small set of fake permissions and system calls into the malicious application, without compromising its intrusive functionality. To improve adversarial robustness against such sparse attacks, learning algorithms should avoid providing decisions which only rely upon a small subset of discriminant features; otherwise, even manipulating some of them may easily allow evading detection. Previous work showed that classifiers which avoid overemphasizing few discriminant features tend to be more robust against sparse attacks, and have developed simple metrics to help identify and select more robust algorithms. In this work, we aim to investigate whether gradient-based attribution methods used to explain classifiers' decisions by identifying the most relevant features can also be used to this end. Our intuition is that a classifier providing more uniform, evener attributions should rely upon a larger set of features, instead of overemphasizing few of them, thus being more robust against sparse attacks. We empirically investigate the connection between gradient-based explanations and adversarial robustness on a case study conducted on Android malware detection, and show that, in some cases, there is a strong correlation between the distribution of such explanations and adversarial robustness. We conclude the paper by discussing how our findings may thus enable the development of more efficient mechanisms both to evaluate and to improve adversarial robustness.
READ FULL TEXT