Automatic summarization is typically aimed at selecting as much information as possible from text documents using a predefined number of words. Extracting complete sentences into a summary is not an optimal way to solve this problem due to redundant information that is contained in some sentences. Removing the redundant information and compiling a summary from compressed sentences should provide a much more accurate result. Major challenges of compressive approaches include the cost of creating large summarization corpora for training the supervised methods, the linguistic quality of compressed sentences, the coverage of the relevant content, and the time complexity of the compression procedure. In this work, we attempt to address these challenges by proposing an unsupervised polynomial-time compressive summarization algorithm. The proposed algorithm iteratively removes redundant parts from original sentences. It uses constituency-based parse trees and hand-crafted rules for generating elementary discourse units (EDUs) from their subtrees (standing for phrases) and selects ones with a sufficient tree gain. We define a parse tree gain as a weighted function of its node weights, which can be computed by any extractive summarization model capable of assigning importance weights to terms. The results of automatic evaluations on a single-document summarization task confirm that the proposed sentence compression procedure helps to avoid redundant information in the generated summaries. Furthermore, the results of human evaluations confirm that the linguistic quality—in terms of readability and coherency—is preserved in the compressed summaries while improving their coverage. However, the same evaluations show that compression in general harms the grammatical correctness of compressed sentences though, in most cases, this effect is not significant for the proposed compression procedure.
- Budgeted sentence compression
- Compressive summarization
- Polytope model