Rich Feature Spaces and Regression Models in Single-Document Extractive Summarization

Alexander Dlikman, Marina Litvak, Mark Last

Research output: Chapter in Book/Report/Conference proceedingChapterpeer-review

1 Scopus citations


Multiple methods of extractive text summarization have been proposed in recent years. The most common approach involves ranking sentences by various sentence scoring metrics (features) and calculating the final sentence score as a linear combination of selected features. Most currently used features are statistical and require only basic natural language processing (NLP), such as tokenization. In this chapter, we present and evaluate a set of novel-statistical and linguistic-features for sentence ranking and extraction in single documents. Statistical features are based on Topic modeling (TM). Linguistic features utilize an advanced NLP and include multi-word expressions based, named entities based, and parts of speech (POS) based. We show that the use of linguistic, POS based and topic based features improve the automated summaries compared to state-of-the-art statistical metrics. In addition, we explore the contribution of various regression algorithms for the sentence ranking task. These algorithms include: a genetic algorithm (GA), classification and regression trees, Cubist, and a linear regression model (LM). For this purpose, we introduce a sentence ranking methodology based on the similarity score between a candidate sentence and gold standard summaries. Our experiments are performed on four textual corpora accompanied by human-generated gold standard summaries: Document Understanding Conference(DUC) 2002, 2004 and 2007, and MultiLing-2013. The popular linear regression model achieved the best results in all evaluated datasets. Additionally, the linear regression model, which included POS based features, outperformed the models with statistical features only.

Original languageEnglish
Title of host publicationMultilingual Text Analysis
Subtitle of host publicationChallenges, Models, and Approaches
PublisherWorld Scientific Publishing Co.
Number of pages36
ISBN (Electronic)9789813274884
ISBN (Print)9789813274877
StatePublished - 1 Jan 2019

ASJC Scopus subject areas

  • Computer Science (all)


Dive into the research topics of 'Rich Feature Spaces and Regression Models in Single-Document Extractive Summarization'. Together they form a unique fingerprint.

Cite this