Abstract
The gender identification of authors in literary texts is a compelling research area at the intersection of computational linguistics and natural language processing, offering insights into historical biases and socio-cultural dynamics while enriching our understanding of literary traditions. This study is inspired by the historical context of women adopting male pseudonyms to navigate a male-dominated literary domain. By leveraging machine learning and state-of-the-art language models, we investigate the feasibility and accuracy of inferring an author’s gender from their writings. Our key contributions include (1) the creation of a large-scale, diverse dataset of literary texts spanning various literary epochs and (2) the evaluation of multiple classification models. Our experiments reveal that the best-performing model achieves an accuracy above (Formula presented.), highlighting the potential of computational methods to uncover stylistic and linguistic markers tied to gender. These findings open avenues for further research into stylistic and linguistic patterns across literary history and their relationship to authorial identity.
| Original language | English |
|---|---|
| Article number | 12041 |
| Journal | Applied Sciences (Switzerland) |
| Volume | 14 |
| Issue number | 24 |
| DOIs | |
| State | Published - 1 Dec 2024 |
| Externally published | Yes |
Keywords
- BERT
- GPT2
- Logistic Regression
- RoBERTa
- T5
- dataset
- gender identification
ASJC Scopus subject areas
- General Materials Science
- Instrumentation
- General Engineering
- Process Chemistry and Technology
- Computer Science Applications
- Fluid Flow and Transfer Processes