Pkg2Vec: Hierarchical package embedding for code authorship attribution

Research output: Contribution to journalArticlepeer-review

6 Scopus citations

Abstract

Authorship attribution of software is the task of identifying the author of a given piece of code. Code attribution is of importance in multiple scenarios, ranging from software plagiarism to cybersecurity. In this paper, we introduce authorship attribution of software packages that better reflect real-world scenarios in which code is organized in packages and written by teams. We present a novel approach for software package authorship attribution called Pkg2Vec, based on a hierarchical deep neural network (DNN) architecture, corresponding to the hierarchical nature of software (code) packages. The hierarchical neural network model consists of a token level encoder and an attention mechanism for a function level encoder, together producing package embedding. Beyond package embedding, we use keywords and API calls as resilient features, which reflect the programmer's intention and style. Pkg2Vec is evaluated on a large dataset of public packages and compared to a number of other source code authorship attribution state-of-the-art algorithms.

Original languageEnglish
Pages (from-to)49-60
Number of pages12
JournalFuture Generation Computer Systems
Volume116
DOIs
StatePublished - 1 Mar 2021

Keywords

  • Code embedding
  • Hierarchical neural networks
  • Source code authorship attribution

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'Pkg2Vec: Hierarchical package embedding for code authorship attribution'. Together they form a unique fingerprint.

Cite this