TY - GEN
T1 - Code type revealing using experiments framework
AU - Sharon, Rami
AU - Gudes, Ehud
PY - 2012/8/1
Y1 - 2012/8/1
N2 - Identifying the type of a code, whether in a file or byte stream, is a challenge that many software companies are facing. Many applications, security and others, base their behavior on the type of code they receive as an input. Today's traditional identification methods rely on file extensions, magic numbers, propriety headers and trailers or specific type identifying rules. All these are vulnerable to content tampering and discovering it requires investing long and tedious working hours of professionals. This study is aimed to find a method of identifying the best settings to automatically create type signatures that will effectively overcome the content manipulation problem. In this paper we lay out a framework for creating type signatures based on byte N-Grams. The framework allows setting various parameters such as NGram sizes and windows, selecting statistical tests and defining rules for score calculations. The framework serves as a test lab that allows finding the right parameters to satisfy a predefined threshold of type identification accuracy. We demonstrate the framework using basic settings that achieved an F-Measure success rate of 0.996 on 1400 test files.
AB - Identifying the type of a code, whether in a file or byte stream, is a challenge that many software companies are facing. Many applications, security and others, base their behavior on the type of code they receive as an input. Today's traditional identification methods rely on file extensions, magic numbers, propriety headers and trailers or specific type identifying rules. All these are vulnerable to content tampering and discovering it requires investing long and tedious working hours of professionals. This study is aimed to find a method of identifying the best settings to automatically create type signatures that will effectively overcome the content manipulation problem. In this paper we lay out a framework for creating type signatures based on byte N-Grams. The framework allows setting various parameters such as NGram sizes and windows, selecting statistical tests and defining rules for score calculations. The framework serves as a test lab that allows finding the right parameters to satisfy a predefined threshold of type identification accuracy. We demonstrate the framework using basic settings that achieved an F-Measure success rate of 0.996 on 1400 test files.
KW - Byte N-Gram statistical analysis
KW - Code type
KW - Content type revealing framework
KW - File Type
UR - http://www.scopus.com/inward/record.url?scp=84864366509&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-31540-4_15
DO - 10.1007/978-3-642-31540-4_15
M3 - Conference contribution
AN - SCOPUS:84864366509
SN - 9783642315398
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 193
EP - 206
BT - Data and Applications Security and Privacy XXVI - 26th Annual IFIP WG 11.3 Conference, DBSec 2012, Proceedings
T2 - 26th Annual WG 11.3 Conference on Data and Applications Security and Privacy, DBSec 2012
Y2 - 11 July 2012 through 13 July 2012
ER -