TY - GEN
T1 - Deep dive into authorship verification of email messages with convolutional neural network
AU - Litvak, Marina
N1 - Publisher Copyright:
© 2019, Springer Nature Switzerland AG.
PY - 2019/1/1
Y1 - 2019/1/1
N2 - Authorship verification is the task of determining whether a specific individual did or did not write a text, which very naturally can be reduced to the binary-classification problem. This paper deals with the authorship verification of short email messages. Hereafter, we use “message” to identify the content of the information that is transmitted by email. The proposed method implements the binary classification with a sequence-to-sequence (seq2seq) model and trains a convolutional neural network (CNN) on positive (written by the “target” user) and negative (written by “someone else”) examples. The proposed method differs from previously published works, which represent text by numerous stylometric features, by requiring neither advanced text preprocessing nor explicit feature extraction. All messages are submitted to the CNN “as is,” after padding to the maximal length and replacing all words by their ID numbers. CNN learns the most appropriate features with backpropagation and then performs classification. The experiments performed on the Enron dataset using the TensorFlow framework show that the CNN classifier verifies message authorship very accurately.
AB - Authorship verification is the task of determining whether a specific individual did or did not write a text, which very naturally can be reduced to the binary-classification problem. This paper deals with the authorship verification of short email messages. Hereafter, we use “message” to identify the content of the information that is transmitted by email. The proposed method implements the binary classification with a sequence-to-sequence (seq2seq) model and trains a convolutional neural network (CNN) on positive (written by the “target” user) and negative (written by “someone else”) examples. The proposed method differs from previously published works, which represent text by numerous stylometric features, by requiring neither advanced text preprocessing nor explicit feature extraction. All messages are submitted to the CNN “as is,” after padding to the maximal length and replacing all words by their ID numbers. CNN learns the most appropriate features with backpropagation and then performs classification. The experiments performed on the Enron dataset using the TensorFlow framework show that the CNN classifier verifies message authorship very accurately.
KW - Authorship verification
KW - Binary classification
KW - Convolutional neural network
UR - http://www.scopus.com/inward/record.url?scp=85063432409&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-11680-4_14
DO - 10.1007/978-3-030-11680-4_14
M3 - Conference contribution
AN - SCOPUS:85063432409
SN - 9783030116798
T3 - Communications in Computer and Information Science
SP - 129
EP - 136
BT - Information Management and Big Data - 5th International Conference, SIMBig 2018, Proceedings
A2 - Lossio-Ventura, Juan Antonio
A2 - Muñante, Denisse
A2 - Alatrista-Salas, Hugo
PB - Springer Verlag
T2 - 5th International Conference on Information Management and Big Data, SIMBig 2018
Y2 - 3 September 2018 through 5 September 2018
ER -