Abstract
In this chapter, we describe an approach for extractive summarization based on the minimum description length (MDL) principle and relying on the Krimp dataset compression algorithm1. We represent text as a transactional dataset, with sentences as transactions and normalized words as items; then describing the dataset by frequent itemsets of different types that provide the best compressed representation. The summary is compiled from sentences that best describe the document. The problem of extractive summarization is therefore reduced to the maximal coverage problem, following the assumption that a summary that best describes the original text should cover most of the itemsets describing the document. We test this approach on generic summarization tasks in English and Chinese, and on a query-based summarization (QS) task for English.
| Original language | English |
|---|---|
| Title of host publication | Multilingual Text Analysis |
| Subtitle of host publication | Challenges, Models, and Approaches |
| Publisher | World Scientific Publishing Co. |
| Pages | 81-117 |
| Number of pages | 37 |
| ISBN (Electronic) | 9789813274884 |
| ISBN (Print) | 9789813274877 |
| DOIs | |
| State | Published - 1 Jan 2019 |
| Externally published | Yes |
ASJC Scopus subject areas
- General Computer Science