AMSI-Based Detection of Malicious PowerShell Code Using Contextual Embeddings

Danny Hendler, Shay Kels, Amir Rubin

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

4 Scopus citations

Abstract

PowerShell is a command-line shell, supporting a scripting language. It is widely used in organizations for configuration management and task automation but is also increasingly used for launching cyber attacks against organizations, mainly because it is pre-installed on Windows machines and exposes strong functionality that may be leveraged by attackers. This makes the problem of detecting malicious PowerShell code both urgent and challenging. Microsoft's Antimalware Scan Interface (AMSI), built into Windows 10, allows defending systems to scan all the code passed to scripting engines such as PowerShell prior to its execution. In this work, we conduct the first study of malicious PowerShell code detection using the information made available by AMSI. We present several novel deep-learning based detectors of malicious PowerShell code that employ pretrained contextual embeddings of words from the PowerShell "language". A contextual word embedding is able to project semantically-similar words to proximate vectors in the embedding space. A known problem in the cybersecurity domain is that labeled data is relatively scarce, in comparison with unlabeled data, making it difficult to devise effective supervised detection of malicious activity of many types. This is also the case with PowerShell code. Our work shows that this problem can be mitigated by learning a pretrained contextual embedding based on unlabeled data. We trained and evaluated our models using real-world data, collected using AMSI. The contextual embedding was learnt using a large corpus of unlabeled PowerShell scripts and modules collected from public repositories. Our performance analysis establishes that the use of unlabeled data for the embedding significantly improved the performance of our detectors. Our best-performing model uses an architecture that enables the processing of textual signals from both the character and token levels and obtains a true-positive rate of nearly 90% while maintaining a low false-positive rate of less than 0.1%.

Original languageEnglish
Title of host publicationProceedings of the 15th ACM Asia Conference on Computer and Communications Security, ASIA CCS 2020
PublisherAssociation for Computing Machinery, Inc
Pages679-693
Number of pages15
ISBN (Electronic)9781450367509
DOIs
StatePublished - 5 Oct 2020
Event15th ACM Asia Conference on Computer and Communications Security, ASIA CCS 2020 - Virtual, Online, Taiwan, Province of China
Duration: 5 Oct 20209 Oct 2020

Publication series

NameProceedings of the 15th ACM Asia Conference on Computer and Communications Security, ASIA CCS 2020

Conference

Conference15th ACM Asia Conference on Computer and Communications Security, ASIA CCS 2020
Country/TerritoryTaiwan, Province of China
CityVirtual, Online
Period5/10/209/10/20

Keywords

  • contextual embedding
  • cybersecurity
  • neural networks
  • powershell

ASJC Scopus subject areas

  • Software
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'AMSI-Based Detection of Malicious PowerShell Code Using Contextual Embeddings'. Together they form a unique fingerprint.

Cite this