TY - GEN
T1 - Detecting Adversarial Perturbations through Spatial Behavior in Activation Spaces
AU - Katzir, Ziv
AU - Elovici, Yuval
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/7/1
Y1 - 2019/7/1
N2 - Although neural network-based classifiers outperform humans in a range of tasks, they are still prone to manipulation through adversarial perturbations. Prior research has resulted in the identification of effective defense mechanisms for many reported attack methods, however a defense against the CW attack, as well as a holistic defense mechanism capable of countering multiple different attack methods, are still missing.All attack methods reported so far share a common goal. They aim to avoid detection by limiting the allowed perturbation magnitude, and still trigger incorrect classification. As a result, small perturbations cause classification to shift from one class to another.We coined the term activation spaces to refer to the hyperspaces formed by the activation values of the different network layers. We then use activation spaces to capture the differences in spatial dynamics between normal and adversarial examples, and form a novel adversarial example detector. We induce a set of k-nearest neighbor (k-NN) classifiers, one per activation space, and leverage those classifiers to assign a sequence of class labels to each input of the neural network. We then calculate the likelihood of each observed label sequence and show that sequences associated with adversarial examples are far less likely than those of normal examples.We demonstrate the efficiency of our proposed detector against the CW attack using two image classification datasets (MNIST, CIFAR-10) achieving an AUC of 0.97 for the CIFAR-10 dataset. We further show how our detector can be easily augmented with previously suggested defense methods to form a holistic multi-purpose defense mechanism.
AB - Although neural network-based classifiers outperform humans in a range of tasks, they are still prone to manipulation through adversarial perturbations. Prior research has resulted in the identification of effective defense mechanisms for many reported attack methods, however a defense against the CW attack, as well as a holistic defense mechanism capable of countering multiple different attack methods, are still missing.All attack methods reported so far share a common goal. They aim to avoid detection by limiting the allowed perturbation magnitude, and still trigger incorrect classification. As a result, small perturbations cause classification to shift from one class to another.We coined the term activation spaces to refer to the hyperspaces formed by the activation values of the different network layers. We then use activation spaces to capture the differences in spatial dynamics between normal and adversarial examples, and form a novel adversarial example detector. We induce a set of k-nearest neighbor (k-NN) classifiers, one per activation space, and leverage those classifiers to assign a sequence of class labels to each input of the neural network. We then calculate the likelihood of each observed label sequence and show that sequences associated with adversarial examples are far less likely than those of normal examples.We demonstrate the efficiency of our proposed detector against the CW attack using two image classification datasets (MNIST, CIFAR-10) achieving an AUC of 0.97 for the CIFAR-10 dataset. We further show how our detector can be easily augmented with previously suggested defense methods to form a holistic multi-purpose defense mechanism.
KW - Activation Spaces
KW - Adversarial Perturbations
KW - Detector
UR - http://www.scopus.com/inward/record.url?scp=85073257788&partnerID=8YFLogxK
U2 - 10.1109/IJCNN.2019.8852285
DO - 10.1109/IJCNN.2019.8852285
M3 - Conference contribution
AN - SCOPUS:85073257788
T3 - Proceedings of the International Joint Conference on Neural Networks
BT - 2019 International Joint Conference on Neural Networks, IJCNN 2019
PB - Institute of Electrical and Electronics Engineers
T2 - 2019 International Joint Conference on Neural Networks, IJCNN 2019
Y2 - 14 July 2019 through 19 July 2019
ER -