TY - GEN
T1 - Unknown malcode detection using OPCODE representation
AU - Moskovitch, Robert
AU - Feher, Clint
AU - Tzachar, Nir
AU - Berger, Eugene
AU - Gitelman, Marina
AU - Dolev, Shlomi
AU - Elovici, Yuval
PY - 2008/12/1
Y1 - 2008/12/1
N2 - The recent growth in network usage has motivated the creation of new malicious code for various purposes, including economic ones. Today's signature-based anti-viruses are very accurate, but cannot detect new malicious code. Recently, classification algorithms were employed successfully for the detection of unknown malicious code. However, most of the studies use byte sequence n-grams representation of the binary code of the executables. We propose the use of (Operation Code) OpCodes, generated by disassembling the executables. We then use n-grams of the OpCodes as features for the classification process. We present a full methodology for the detection of unknown malicious code, based on text categorization concepts. We performed an extensive evaluation of a test collection of more than 30,000 files, in which we evaluated extensively the OpCode n-gram representation and investigated the imbalance problem, referring to real-life scenarios, in which the malicious file content is expected to be about 10% of the total files. Our results indicate that greater than 99% accuracy can be achieved through the use of a training set that has a malicious file percentage lower than 15%, which is higher than in our previous experience with byte sequence n-gram representation [1].
AB - The recent growth in network usage has motivated the creation of new malicious code for various purposes, including economic ones. Today's signature-based anti-viruses are very accurate, but cannot detect new malicious code. Recently, classification algorithms were employed successfully for the detection of unknown malicious code. However, most of the studies use byte sequence n-grams representation of the binary code of the executables. We propose the use of (Operation Code) OpCodes, generated by disassembling the executables. We then use n-grams of the OpCodes as features for the classification process. We present a full methodology for the detection of unknown malicious code, based on text categorization concepts. We performed an extensive evaluation of a test collection of more than 30,000 files, in which we evaluated extensively the OpCode n-gram representation and investigated the imbalance problem, referring to real-life scenarios, in which the malicious file content is expected to be about 10% of the total files. Our results indicate that greater than 99% accuracy can be achieved through the use of a training set that has a malicious file percentage lower than 15%, which is higher than in our previous experience with byte sequence n-gram representation [1].
KW - Classification
KW - Malicious code detection
KW - OpCode
UR - http://www.scopus.com/inward/record.url?scp=58849157332&partnerID=8YFLogxK
U2 - 10.1007/978-3-540-89900-6_21
DO - 10.1007/978-3-540-89900-6_21
M3 - Conference contribution
AN - SCOPUS:58849157332
SN - 3540898999
SN - 9783540898993
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 204
EP - 215
BT - Intelligence and Security Informatics - First European Conference, EuroISI 2008, Proceedings
T2 - 1st European Conference on Intelligence and Security Informatics, EuroISI 2008
Y2 - 3 December 2008 through 5 December 2008
ER -