Authorship attribution of software is the task of identifying the author of a given piece of code. Code attribution is of importance in multiple scenarios, ranging from software plagiarism to cybersecurity. In this paper, we introduce authorship attribution of software packages that better reflect real-world scenarios in which code is organized in packages and written by teams. We present a novel approach for software package authorship attribution called Pkg2Vec, based on a hierarchical deep neural network (DNN) architecture, corresponding to the hierarchical nature of software (code) packages. The hierarchical neural network model consists of a token level encoder and an attention mechanism for a function level encoder, together producing package embedding. Beyond package embedding, we use keywords and API calls as resilient features, which reflect the programmer's intention and style. Pkg2Vec is evaluated on a large dataset of public packages and compared to a number of other source code authorship attribution state-of-the-art algorithms.
- Code embedding
- Hierarchical neural networks
- Source code authorship attribution
ASJC Scopus subject areas
- Hardware and Architecture
- Computer Networks and Communications