Multiple methods of extractive text summarization have been proposed in recent years. The most common approach involves ranking sentences by various sentence scoring metrics (features) and calculating the final sentence score as a linear combination of selected features. Most currently used features are statistical and require only basic natural language processing (NLP), such as tokenization. In this chapter, we present and evaluate a set of novel-statistical and linguistic-features for sentence ranking and extraction in single documents. Statistical features are based on Topic modeling (TM). Linguistic features utilize an advanced NLP and include multi-word expressions based, named entities based, and parts of speech (POS) based. We show that the use of linguistic, POS based and topic based features improve the automated summaries compared to state-of-the-art statistical metrics. In addition, we explore the contribution of various regression algorithms for the sentence ranking task. These algorithms include: a genetic algorithm (GA), classification and regression trees, Cubist, and a linear regression model (LM). For this purpose, we introduce a sentence ranking methodology based on the similarity score between a candidate sentence and gold standard summaries. Our experiments are performed on four textual corpora accompanied by human-generated gold standard summaries: Document Understanding Conference(DUC) 2002, 2004 and 2007, and MultiLing-2013. The popular linear regression model achieved the best results in all evaluated datasets. Additionally, the linear regression model, which included POS based features, outperformed the models with statistical features only.
|Title of host publication||Multilingual Text Analysis|
|Subtitle of host publication||Challenges, Models, and Approaches|
|Publisher||World Scientific Publishing Co.|
|Number of pages||36|
|State||Published - 1 Jan 2019|
ASJC Scopus subject areas
- Computer Science (all)