TY - JOUR
T1 - Novel active learning methods for enhanced PC malware detection in windows OS
AU - Nissim, Nir
AU - Moskovitch, Robert
AU - Rokach, Lior
AU - Elovici, Yuval
N1 - Funding Information:
This research was partly supported by the National Cyber Bureau of the Israeli Ministry of Science, Technology and space. We would like to thank Clint Feher, who assisted in the data-set creation and Yuval Fledel for meaningful discussions and comments in the efficient implementation aspects.
PY - 2014/10/1
Y1 - 2014/10/1
N2 - The formation of new malwares every day poses a significant challenge to anti-virus vendors since antivirus tools, using manually crafted signatures, are only capable of identifying known malware instances and their relatively similar variants. To identify new and unknown malwares for updating their anti-virus signature repository, anti-virus vendors must daily collect new, suspicious files that need to be analyzed manually by information security experts who then label them as malware or benign. Analyzing suspected files is a time-consuming task and it is impossible to manually analyze all of them. Consequently, anti-virus vendors use machine learning algorithms and heuristics in order to reduce the number of suspect files that must be inspected manually. These techniques, however, lack an essential element - they cannot be daily updated. In this work we introduce a solution for this updatability gap. We present an active learning (AL) framework and introduce two new AL methods that will assist anti-virus vendors to focus their analytical efforts by acquiring those files that are most probably malicious. Those new AL methods are designed and oriented towards new malware acquisition. To test the capability of our methods for acquiring new malwares from a stream of unknown files, we conducted a series of experiments over a ten-day period. A comparison of our methods to existing high performance AL methods and to random selection, which is the naïve method, indicates that the AL methods outperformed random selection for all performance measures. Our AL methods outperformed existing AL method in two respects, both related to the number of new malwares acquired daily, the core measure in this study. First, our best performing AL method, termed "Exploitation", acquired on the 9th day of the experiment about 2.6 times more malwares than the existing AL method and 7.8 more times than the random selection. Secondly, while the existing AL method showed a decrease in the number of new malwares acquired over 10 days, our AL methods showed an increase and a daily improvement in the number of new malwares acquired. Both results point towards increased efficiency that can possibly assist anti-virus vendors.
AB - The formation of new malwares every day poses a significant challenge to anti-virus vendors since antivirus tools, using manually crafted signatures, are only capable of identifying known malware instances and their relatively similar variants. To identify new and unknown malwares for updating their anti-virus signature repository, anti-virus vendors must daily collect new, suspicious files that need to be analyzed manually by information security experts who then label them as malware or benign. Analyzing suspected files is a time-consuming task and it is impossible to manually analyze all of them. Consequently, anti-virus vendors use machine learning algorithms and heuristics in order to reduce the number of suspect files that must be inspected manually. These techniques, however, lack an essential element - they cannot be daily updated. In this work we introduce a solution for this updatability gap. We present an active learning (AL) framework and introduce two new AL methods that will assist anti-virus vendors to focus their analytical efforts by acquiring those files that are most probably malicious. Those new AL methods are designed and oriented towards new malware acquisition. To test the capability of our methods for acquiring new malwares from a stream of unknown files, we conducted a series of experiments over a ten-day period. A comparison of our methods to existing high performance AL methods and to random selection, which is the naïve method, indicates that the AL methods outperformed random selection for all performance measures. Our AL methods outperformed existing AL method in two respects, both related to the number of new malwares acquired daily, the core measure in this study. First, our best performing AL method, termed "Exploitation", acquired on the 9th day of the experiment about 2.6 times more malwares than the existing AL method and 7.8 more times than the random selection. Secondly, while the existing AL method showed a decrease in the number of new malwares acquired over 10 days, our AL methods showed an increase and a daily improvement in the number of new malwares acquired. Both results point towards increased efficiency that can possibly assist anti-virus vendors.
KW - Active learning
KW - Machine Learning
KW - Malicious code
KW - Malware
KW - SVM
UR - http://www.scopus.com/inward/record.url?scp=84899704421&partnerID=8YFLogxK
U2 - 10.1016/j.eswa.2014.02.053
DO - 10.1016/j.eswa.2014.02.053
M3 - Article
AN - SCOPUS:84899704421
SN - 0957-4174
VL - 41
SP - 5843
EP - 5857
JO - Expert Systems with Applications
JF - Expert Systems with Applications
IS - 13
ER -