TY - GEN
T1 - JTrans
T2 - 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2022
AU - Wang, Hao
AU - Qu, Wenjie
AU - Katz, Gilad
AU - Zhu, Wenyu
AU - Gao, Zeyu
AU - Qiu, Han
AU - Zhuge, Jianwei
AU - Zhang, Chao
N1 - Funding Information:
This work was supported in part by National Key R&D Program of China (2021YFB2701000), National Natural Science Foundation of China under Grant 61972224, Beijing National Research Center for Information Science and Technology under Grant BNR2022RC01006, and Ant Group through CCF-Ant Innovative Research Program No. RF20210021. We would like to thank Jingwei Yi and Bolun Zhang for their great comments and help on experiments.
Publisher Copyright:
© 2022 Owner/Author.
PY - 2022/7/18
Y1 - 2022/7/18
N2 - Binary code similarity detection (BCSD) has important applications in various fields such as vulnerabilities detection, software component analysis, and reverse engineering. Recent studies have shown that deep neural networks (DNNs) can comprehend instructions or control-flow graphs (CFG) of binary code and support BCSD. In this study, we propose a novel Transformer-based approach, namely jTrans, to learn representations of binary code. It is the first solution that embeds control flow information of binary code into Transformer-based language models, by using a novel jump-aware representation of the analyzed binaries and a newly-designed pre-training task. Additionally, we release to the community a newly-created large dataset of binaries, BinaryCorp, which is the most diverse to date. Evaluation results show that jTrans outperforms state-of-the-art (SOTA) approaches on this more challenging dataset by 30.5% (i.e., from 32.0% to 62.5%). In a real-world task of known vulnerability searching, jTrans achieves a recall that is 2X higher than existing SOTA baselines.
AB - Binary code similarity detection (BCSD) has important applications in various fields such as vulnerabilities detection, software component analysis, and reverse engineering. Recent studies have shown that deep neural networks (DNNs) can comprehend instructions or control-flow graphs (CFG) of binary code and support BCSD. In this study, we propose a novel Transformer-based approach, namely jTrans, to learn representations of binary code. It is the first solution that embeds control flow information of binary code into Transformer-based language models, by using a novel jump-aware representation of the analyzed binaries and a newly-designed pre-training task. Additionally, we release to the community a newly-created large dataset of binaries, BinaryCorp, which is the most diverse to date. Evaluation results show that jTrans outperforms state-of-the-art (SOTA) approaches on this more challenging dataset by 30.5% (i.e., from 32.0% to 62.5%). In a real-world task of known vulnerability searching, jTrans achieves a recall that is 2X higher than existing SOTA baselines.
KW - Binary Analysis
KW - Datasets
KW - Neural Networks
KW - Similarity Detection
UR - http://www.scopus.com/inward/record.url?scp=85136784760&partnerID=8YFLogxK
U2 - 10.1145/3533767.3534367
DO - 10.1145/3533767.3534367
M3 - Conference contribution
AN - SCOPUS:85136784760
T3 - ISSTA 2022 - Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis
SP - 1
EP - 13
BT - ISSTA 2022 - Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis
A2 - Ryu, Sukyoung
A2 - Smaragdakis, Yannis
PB - Association for Computing Machinery, Inc
Y2 - 18 July 2022 through 22 July 2022
ER -