HIGHER CRITICISM FOR DISCRIMINATING WORD-FREQUENCY TABLES AND AUTHORSHIP ATTRIBUTION

Alon Kipnis

Research output: Contribution to journalArticlepeer-review

4 Scopus citations

Abstract

We adapt the higher criticism (HC) goodness-of-fit test to measure the closeness between word-frequency tables. We apply this measure to authorship attribution challenges, where the goal is to identify the author of a document using other documents whose authorship is known. The method is simple yet performs well without handcrafting and tuning, reporting accu-racy at the state-of-the-art level in various current challenges. As an inherent side effect, the HC calculation identifies a subset of discriminating words. In practice, the identified words have low variance across documents belonging to a corpus of homogeneous authorship. We conclude that in comparing the similarity of a new document and a corpus of a single author, HC is mostly affected by words characteristic of the author and is relatively unaffected by topic structure.

Original languageEnglish
Pages (from-to)1236-1252
Number of pages17
JournalAnnals of Applied Statistics
Volume16
Issue number2
DOIs
StatePublished - 1 Jun 2022
Externally publishedYes

Keywords

  • authorship attribution
  • feature selection
  • Higher criticism
  • nonparametric methods
  • two-sample testing
  • unsupervised learning

ASJC Scopus subject areas

  • Statistics and Probability
  • Modeling and Simulation
  • Statistics, Probability and Uncertainty

Fingerprint

Dive into the research topics of 'HIGHER CRITICISM FOR DISCRIMINATING WORD-FREQUENCY TABLES AND AUTHORSHIP ATTRIBUTION'. Together they form a unique fingerprint.

Cite this