TY - GEN
T1 - AMSI-Based Detection of Malicious PowerShell Code Using Contextual Embeddings
AU - Hendler, Danny
AU - Kels, Shay
AU - Rubin, Amir
N1 - Funding Information:
This research is partially supported by the Cyber Security Research Center at Ben-Gurion University.
Publisher Copyright:
© 2020 ACM.
PY - 2020/10/5
Y1 - 2020/10/5
N2 - PowerShell is a command-line shell, supporting a scripting language. It is widely used in organizations for configuration management and task automation but is also increasingly used for launching cyber attacks against organizations, mainly because it is pre-installed on Windows machines and exposes strong functionality that may be leveraged by attackers. This makes the problem of detecting malicious PowerShell code both urgent and challenging. Microsoft's Antimalware Scan Interface (AMSI), built into Windows 10, allows defending systems to scan all the code passed to scripting engines such as PowerShell prior to its execution. In this work, we conduct the first study of malicious PowerShell code detection using the information made available by AMSI. We present several novel deep-learning based detectors of malicious PowerShell code that employ pretrained contextual embeddings of words from the PowerShell "language". A contextual word embedding is able to project semantically-similar words to proximate vectors in the embedding space. A known problem in the cybersecurity domain is that labeled data is relatively scarce, in comparison with unlabeled data, making it difficult to devise effective supervised detection of malicious activity of many types. This is also the case with PowerShell code. Our work shows that this problem can be mitigated by learning a pretrained contextual embedding based on unlabeled data. We trained and evaluated our models using real-world data, collected using AMSI. The contextual embedding was learnt using a large corpus of unlabeled PowerShell scripts and modules collected from public repositories. Our performance analysis establishes that the use of unlabeled data for the embedding significantly improved the performance of our detectors. Our best-performing model uses an architecture that enables the processing of textual signals from both the character and token levels and obtains a true-positive rate of nearly 90% while maintaining a low false-positive rate of less than 0.1%.
AB - PowerShell is a command-line shell, supporting a scripting language. It is widely used in organizations for configuration management and task automation but is also increasingly used for launching cyber attacks against organizations, mainly because it is pre-installed on Windows machines and exposes strong functionality that may be leveraged by attackers. This makes the problem of detecting malicious PowerShell code both urgent and challenging. Microsoft's Antimalware Scan Interface (AMSI), built into Windows 10, allows defending systems to scan all the code passed to scripting engines such as PowerShell prior to its execution. In this work, we conduct the first study of malicious PowerShell code detection using the information made available by AMSI. We present several novel deep-learning based detectors of malicious PowerShell code that employ pretrained contextual embeddings of words from the PowerShell "language". A contextual word embedding is able to project semantically-similar words to proximate vectors in the embedding space. A known problem in the cybersecurity domain is that labeled data is relatively scarce, in comparison with unlabeled data, making it difficult to devise effective supervised detection of malicious activity of many types. This is also the case with PowerShell code. Our work shows that this problem can be mitigated by learning a pretrained contextual embedding based on unlabeled data. We trained and evaluated our models using real-world data, collected using AMSI. The contextual embedding was learnt using a large corpus of unlabeled PowerShell scripts and modules collected from public repositories. Our performance analysis establishes that the use of unlabeled data for the embedding significantly improved the performance of our detectors. Our best-performing model uses an architecture that enables the processing of textual signals from both the character and token levels and obtains a true-positive rate of nearly 90% while maintaining a low false-positive rate of less than 0.1%.
KW - contextual embedding
KW - cybersecurity
KW - neural networks
KW - powershell
UR - http://www.scopus.com/inward/record.url?scp=85096385568&partnerID=8YFLogxK
U2 - 10.1145/3320269.3384742
DO - 10.1145/3320269.3384742
M3 - Conference contribution
AN - SCOPUS:85096385568
T3 - Proceedings of the 15th ACM Asia Conference on Computer and Communications Security, ASIA CCS 2020
SP - 679
EP - 693
BT - Proceedings of the 15th ACM Asia Conference on Computer and Communications Security, ASIA CCS 2020
PB - Association for Computing Machinery, Inc
T2 - 15th ACM Asia Conference on Computer and Communications Security, ASIA CCS 2020
Y2 - 5 October 2020 through 9 October 2020
ER -