TY - GEN
T1 - Unknown malicious code detection - Practical issues
AU - Moskovitch, Robert
AU - Elovici, Yuval
PY - 2008/12/1
Y1 - 2008/12/1
N2 - The recent growth in Internet usage has motivated the creation of new malicious code for various purposes, including information warfare. Today's signature-based anti-viruses can detect accurately known malicious code but are very limited in detecting new malicious code. New malicious codes are being created every day, and their number is expected to increase in the coming years. Recently, machine learning methods, such as classification algorithms, were used successfully for the detection of unknown malicious code. These studies were based on a test collection with a limited size of less than 3,000 files, and the proportions of malicious and benign files in both the training and test sets were identical. These test collections do not correspond to real life conditions, in which the percentage of malicious files is significantly lower than that of the benign files. In this study we present a methodology for the detection of unknown malicious code. The executable binary code is represented by n-grams. We performed an extensive evaluation using a test collection of more than 30,000 files, in which we investigated the imbalance problem. Five levels of Malicious Files Percentage (MFP) in the training set (16.7, 33.4, 50, 66.7 and 83.4%) were used to train classifiers. 17 levels of MFP (5, 7.5, 10, 12.5, 15, 20, 30, 40, 50, 60, 70, 80, 85, 87.5, 90, 92.5 and 95%) were set in the test set to represent various benign/malicious files ratio during the detection. Our evaluation results suggest that varying classification algorithms react differently to the various benign/malicious files ratio. For 10% MFP in the test set, representing real life conditions, in general the highest performance achieved for the use of less than 33.3% MFP in the training set, and in specific classifiers was above 95% of accuracy was achieved. Additionally we present a chronological evaluation, in which the dataset from 2000 to 2007 was divided to training sets and tests sets. Evaluation results show that an update in the training set is needed.
AB - The recent growth in Internet usage has motivated the creation of new malicious code for various purposes, including information warfare. Today's signature-based anti-viruses can detect accurately known malicious code but are very limited in detecting new malicious code. New malicious codes are being created every day, and their number is expected to increase in the coming years. Recently, machine learning methods, such as classification algorithms, were used successfully for the detection of unknown malicious code. These studies were based on a test collection with a limited size of less than 3,000 files, and the proportions of malicious and benign files in both the training and test sets were identical. These test collections do not correspond to real life conditions, in which the percentage of malicious files is significantly lower than that of the benign files. In this study we present a methodology for the detection of unknown malicious code. The executable binary code is represented by n-grams. We performed an extensive evaluation using a test collection of more than 30,000 files, in which we investigated the imbalance problem. Five levels of Malicious Files Percentage (MFP) in the training set (16.7, 33.4, 50, 66.7 and 83.4%) were used to train classifiers. 17 levels of MFP (5, 7.5, 10, 12.5, 15, 20, 30, 40, 50, 60, 70, 80, 85, 87.5, 90, 92.5 and 95%) were set in the test set to represent various benign/malicious files ratio during the detection. Our evaluation results suggest that varying classification algorithms react differently to the various benign/malicious files ratio. For 10% MFP in the test set, representing real life conditions, in general the highest performance achieved for the use of less than 33.3% MFP in the training set, and in specific classifiers was above 95% of accuracy was achieved. Additionally we present a chronological evaluation, in which the dataset from 2000 to 2007 was divided to training sets and tests sets. Evaluation results show that an update in the training set is needed.
KW - Anti virus
KW - Imbalance problem
KW - Machine learning
KW - Malicious code detection
KW - Text categorization
UR - http://www.scopus.com/inward/record.url?scp=82055202874&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:82055202874
SN - 9781622765331
T3 - 7th European Conference on Information Warfare and Security 2008, ECIW 2008
SP - 145
EP - 152
BT - 7th European Conference on Information Warfare and Security 2008, ECIW 2008
T2 - 7th European Conference on Information Warfare and Security 2008, ECIW 2008
Y2 - 30 June 2008 through 1 July 2008
ER -