Relationship of Jaccard and edit distance in malware clustering and online identification (Extended abstract)

Shlomi Dolev, Mohammad Ghanayim, Alexander Binun, Sergey Frenkel, Yeali S. Sun

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

9 Scopus citations

Abstract

In this paper, we examine the possibility to utilize the well-known approximations of Jaccard metric in order to reduce computational complexity of Edit Distance metric estimation. The scope of our analytical results is the representing strings rather than the original (raw) textual data, still in practice we obtained a solid indication that the results can be applied to (raw) strings that have low n-gram repetitions. We formulate inequalities between the Jaccard metric and the Edit Distance, that impose upper and lower bounds on the Edit Distance values in terms of the Jaccard values. We validate our inequality over strings of API call traces where (the small) clusters obtained are refined by applying Edit Distance. Jaccard is a measure of similarity between two sets, while Edit Distance is a measure for two strings, such as traces of API calls. The computation associated with creating n-grams and using Jaccard similarity is much more efficient than the computation of Edit Distance (linear versus quadratic time complexity). Thus, our new bounds on the Edit Distance given the Jaccard value are of practical interest. Another new aspect we coped with in our research is the inherent imbalance between malicious and benign API traces that are harvested from the system, as most of the traces are benign. We performed clustering only on the malware traces where each cluster concentrates malware with some specific common essence. The obtained clustering is used with great success in classifying new query traces for being either benign or malware. The traces for our research were obtained from the KVM hypervisor Runtime Execution Introspection and Profiling (REIP) system based on Virtual Machine Introspection (VMI) techniques to profile hooked Windows API calls.

Original languageEnglish
Title of host publication2017 IEEE 16th International Symposium on Network Computing and Applications, NCA 2017
EditorsDimiter R. Avresky, Aris Gkoulalas-Divanis, Dimiter R. Avresky, Miguel P. Correia
PublisherInstitute of Electrical and Electronics Engineers
Pages1-5
Number of pages5
ISBN (Electronic)9781538614655
DOIs
StatePublished - 8 Dec 2017
Event16th IEEE International Symposium on Network Computing and Applications, NCA 2017 - Cambridge, United States
Duration: 30 Oct 20171 Nov 2017

Publication series

Name2017 IEEE 16th International Symposium on Network Computing and Applications, NCA 2017
Volume2017-January

Conference

Conference16th IEEE International Symposium on Network Computing and Applications, NCA 2017
Country/TerritoryUnited States
CityCambridge
Period30/10/171/11/17

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Networks and Communications
  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'Relationship of Jaccard and edit distance in malware clustering and online identification (Extended abstract)'. Together they form a unique fingerprint.

Cite this