TY - GEN
T1 - Relationship of Jaccard and edit distance in malware clustering and online identification (Extended abstract)
AU - Dolev, Shlomi
AU - Ghanayim, Mohammad
AU - Binun, Alexander
AU - Frenkel, Sergey
AU - Sun, Yeali S.
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/12/8
Y1 - 2017/12/8
N2 - In this paper, we examine the possibility to utilize the well-known approximations of Jaccard metric in order to reduce computational complexity of Edit Distance metric estimation. The scope of our analytical results is the representing strings rather than the original (raw) textual data, still in practice we obtained a solid indication that the results can be applied to (raw) strings that have low n-gram repetitions. We formulate inequalities between the Jaccard metric and the Edit Distance, that impose upper and lower bounds on the Edit Distance values in terms of the Jaccard values. We validate our inequality over strings of API call traces where (the small) clusters obtained are refined by applying Edit Distance. Jaccard is a measure of similarity between two sets, while Edit Distance is a measure for two strings, such as traces of API calls. The computation associated with creating n-grams and using Jaccard similarity is much more efficient than the computation of Edit Distance (linear versus quadratic time complexity). Thus, our new bounds on the Edit Distance given the Jaccard value are of practical interest. Another new aspect we coped with in our research is the inherent imbalance between malicious and benign API traces that are harvested from the system, as most of the traces are benign. We performed clustering only on the malware traces where each cluster concentrates malware with some specific common essence. The obtained clustering is used with great success in classifying new query traces for being either benign or malware. The traces for our research were obtained from the KVM hypervisor Runtime Execution Introspection and Profiling (REIP) system based on Virtual Machine Introspection (VMI) techniques to profile hooked Windows API calls.
AB - In this paper, we examine the possibility to utilize the well-known approximations of Jaccard metric in order to reduce computational complexity of Edit Distance metric estimation. The scope of our analytical results is the representing strings rather than the original (raw) textual data, still in practice we obtained a solid indication that the results can be applied to (raw) strings that have low n-gram repetitions. We formulate inequalities between the Jaccard metric and the Edit Distance, that impose upper and lower bounds on the Edit Distance values in terms of the Jaccard values. We validate our inequality over strings of API call traces where (the small) clusters obtained are refined by applying Edit Distance. Jaccard is a measure of similarity between two sets, while Edit Distance is a measure for two strings, such as traces of API calls. The computation associated with creating n-grams and using Jaccard similarity is much more efficient than the computation of Edit Distance (linear versus quadratic time complexity). Thus, our new bounds on the Edit Distance given the Jaccard value are of practical interest. Another new aspect we coped with in our research is the inherent imbalance between malicious and benign API traces that are harvested from the system, as most of the traces are benign. We performed clustering only on the malware traces where each cluster concentrates malware with some specific common essence. The obtained clustering is used with great success in classifying new query traces for being either benign or malware. The traces for our research were obtained from the KVM hypervisor Runtime Execution Introspection and Profiling (REIP) system based on Virtual Machine Introspection (VMI) techniques to profile hooked Windows API calls.
UR - http://www.scopus.com/inward/record.url?scp=85046549864&partnerID=8YFLogxK
U2 - 10.1109/NCA.2017.8171380
DO - 10.1109/NCA.2017.8171380
M3 - Conference contribution
AN - SCOPUS:85046549864
T3 - 2017 IEEE 16th International Symposium on Network Computing and Applications, NCA 2017
SP - 1
EP - 5
BT - 2017 IEEE 16th International Symposium on Network Computing and Applications, NCA 2017
A2 - Avresky, Dimiter R.
A2 - Gkoulalas-Divanis, Aris
A2 - Avresky, Dimiter R.
A2 - Correia, Miguel P.
PB - Institute of Electrical and Electronics Engineers
T2 - 16th IEEE International Symposium on Network Computing and Applications, NCA 2017
Y2 - 30 October 2017 through 1 November 2017
ER -