Abstract
We adapt the higher criticism (HC) goodness-of-fit test to measure the closeness between word-frequency tables. We apply this measure to authorship attribution challenges, where the goal is to identify the author of a document using other documents whose authorship is known. The method is simple yet performs well without handcrafting and tuning, reporting accu-racy at the state-of-the-art level in various current challenges. As an inherent side effect, the HC calculation identifies a subset of discriminating words. In practice, the identified words have low variance across documents belonging to a corpus of homogeneous authorship. We conclude that in comparing the similarity of a new document and a corpus of a single author, HC is mostly affected by words characteristic of the author and is relatively unaffected by topic structure.
Original language | English |
---|---|
Pages (from-to) | 1236-1252 |
Number of pages | 17 |
Journal | Annals of Applied Statistics |
Volume | 16 |
Issue number | 2 |
DOIs | |
State | Published - 1 Jun 2022 |
Externally published | Yes |
Keywords
- authorship attribution
- feature selection
- Higher criticism
- nonparametric methods
- two-sample testing
- unsupervised learning
ASJC Scopus subject areas
- Statistics and Probability
- Modeling and Simulation
- Statistics, Probability and Uncertainty