Automatic Gender Identification from Text

Vladimir Younkin, Marina Litvak, Irina Rabaev

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

The gender identification of authors in literary texts is a compelling research area at the intersection of computational linguistics and natural language processing, offering insights into historical biases and socio-cultural dynamics while enriching our understanding of literary traditions. This study is inspired by the historical context of women adopting male pseudonyms to navigate a male-dominated literary domain. By leveraging machine learning and state-of-the-art language models, we investigate the feasibility and accuracy of inferring an author’s gender from their writings. Our key contributions include (1) the creation of a large-scale, diverse dataset of literary texts spanning various literary epochs and (2) the evaluation of multiple classification models. Our experiments reveal that the best-performing model achieves an accuracy above (Formula presented.), highlighting the potential of computational methods to uncover stylistic and linguistic markers tied to gender. These findings open avenues for further research into stylistic and linguistic patterns across literary history and their relationship to authorial identity.

Original languageEnglish
Article number12041
JournalApplied Sciences (Switzerland)
Volume14
Issue number24
DOIs
StatePublished - 1 Dec 2024
Externally publishedYes

Keywords

  • BERT
  • GPT2
  • Logistic Regression
  • RoBERTa
  • T5
  • dataset
  • gender identification

ASJC Scopus subject areas

  • General Materials Science
  • Instrumentation
  • General Engineering
  • Process Chemistry and Technology
  • Computer Science Applications
  • Fluid Flow and Transfer Processes

Fingerprint

Dive into the research topics of 'Automatic Gender Identification from Text'. Together they form a unique fingerprint.

Cite this