TY - JOUR
T1 - Decompiled APK based malicious code classification
AU - Mateless, Roni
AU - Rejabek, Daniel
AU - Margalit, Oded
AU - Moskovitch, Robert
N1 - Funding Information:
This work was partially supported by International Business Machines Corporation (IBM) Cyber Security Center of Excellence (CCoE) and by the Cyber Security Research Center at Ben-Gurion University, Israel . We would also like to thank Dr. Eitan Menahem, Idan Revivo, and Jordan Ferenz from IBM who assisted in the project.
Funding Information:
Robert Moskovitch is the head of the Complex Data Analytics Lab, as a faculty member of the department of Software and Information Systems Engineering at Ben Gurion University, Israel. Before his post doc fellowship at the department of Biomedical Informatics at Columbia University in NYC, he headed several R&D projects in Information Security at the Deutsche Telekom Innovation Laboratories. He is an Academic Editor at PLOS ONE, member of the editorial board of the Journal of Biomedical Informatics (JBI) and served on other journal editorial boards. He is on the Board of the Artificial Intelligence in Medicine (AIME) conference, as well as a member of program committees of conferences, such as ACM KDD, IJCAI, AIME and more. Recently, he co-edited special issues at JASIST and JBI. He had published more than seventy peer reviewed papers in leading journals and conferences, such as IEEE ICDM, Data Mining and Knowledge Discovery, KAIS, JAMIA, JBI and more, several of which had won best-paper awards.
PY - 2020/9/1
Y1 - 2020/9/1
N2 - Due to the increasing growth in the variety of Android malware, it is important to distinguish between the unique types of each. In this paper, we introduce the use of a decompiled source code for malicious code classification. This decompiled source code provides deeper analysis opportunities and understanding of the nature of malware. Malicious code differs from text due to syntax rules of compilers and the effort of attackers to evade potential detection. Hence, we adapt Natural Language Processing-based techniques under some constraints for malicious code classification. First, the proposed methodology decompiles the Android Package Kit files, then API calls, keywords, and non-obfuscated tokens are extracted from the source code and categorized to stop-tokens, feature-tokens, and long-tail-tokens. We also introduce the use of generalized N-tokens to represent tokens that are typically less frequent. Our approach was evaluated, in comparison to the use of API calls and permissions for features, as a baseline, and their combination, as well as in comparison to the use of neural network architectures based on decompiled Android Package Kits. A rigorous evaluation of comprehensive public real-world Android malware datasets, including 24,553 apps that were categorized to 71 families for the malicious families classification, and 60,000 apps for malicious code detection was performed. Our approach outperformed the baselines in both tasks.
AB - Due to the increasing growth in the variety of Android malware, it is important to distinguish between the unique types of each. In this paper, we introduce the use of a decompiled source code for malicious code classification. This decompiled source code provides deeper analysis opportunities and understanding of the nature of malware. Malicious code differs from text due to syntax rules of compilers and the effort of attackers to evade potential detection. Hence, we adapt Natural Language Processing-based techniques under some constraints for malicious code classification. First, the proposed methodology decompiles the Android Package Kit files, then API calls, keywords, and non-obfuscated tokens are extracted from the source code and categorized to stop-tokens, feature-tokens, and long-tail-tokens. We also introduce the use of generalized N-tokens to represent tokens that are typically less frequent. Our approach was evaluated, in comparison to the use of API calls and permissions for features, as a baseline, and their combination, as well as in comparison to the use of neural network architectures based on decompiled Android Package Kits. A rigorous evaluation of comprehensive public real-world Android malware datasets, including 24,553 apps that were categorized to 71 families for the malicious families classification, and 60,000 apps for malicious code detection was performed. Our approach outperformed the baselines in both tasks.
KW - Android malware
KW - Malicious code
KW - Source code analysis
UR - http://www.scopus.com/inward/record.url?scp=85083338707&partnerID=8YFLogxK
U2 - 10.1016/j.future.2020.03.052
DO - 10.1016/j.future.2020.03.052
M3 - Article
AN - SCOPUS:85083338707
SN - 0167-739X
VL - 110
SP - 135
EP - 147
JO - Future Generation Computer Systems
JF - Future Generation Computer Systems
ER -