Abstract
Automated categorization of textual information is becoming an increasingly important task in the digital world. However, most classification algorithms build upon manual labeling of text documents, which is a time-consuming and costly process. In this paper, we present a novel methodology for clustering-based classification of stationary document streams using active learning. The proposed active learning clusteringbased classification algorithm (ACCA) obtains a continuous stream of unlabeled documents. The arriving documents are clustered incrementally so that each incoming document is inserted into an existing cluster or used to start a new cluster of its own. The number of possible clusters is unlimited. From time to time, an expert is called to label several clusters for the classification mechanism. With arrival of more documents, the expert can be called less frequently, since most of the incoming documents will eventually belong to existing labeled clusters. Our algorithm is aimed at finding the fastest way of reaching the point where most arriving documents can be classified automatically without the experts assistance. The evaluation experiments on two benchmark corpora show that active learning and clustering can increase the percentage of automatically and accurately categorized documents over time.
Original language | English |
---|---|
Title of host publication | Data Mining in Time Series and Streaming Databases |
Publisher | World Scientific Publishing Co. Pte Ltd |
Chapter | 5 |
Pages | 92-117 |
Number of pages | 26 |
ISBN (Electronic) | 9789813228047 |
ISBN (Print) | 9789813228030 |
DOIs | |
State | Published - 11 Jan 2018 |
ASJC Scopus subject areas
- Computer Science (all)