Clustering-based classification of document streams with active learning

Mark Last, Maxim Stoliar, Menahem Friedman

Research output: Chapter in Book/Report/Conference proceedingChapterpeer-review

1 Scopus citations

Abstract

Automated categorization of textual information is becoming an increasingly important task in the digital world. However, most classification algorithms build upon manual labeling of text documents, which is a time-consuming and costly process. In this paper, we present a novel methodology for clustering-based classification of stationary document streams using active learning. The proposed active learning clusteringbased classification algorithm (ACCA) obtains a continuous stream of unlabeled documents. The arriving documents are clustered incrementally so that each incoming document is inserted into an existing cluster or used to start a new cluster of its own. The number of possible clusters is unlimited. From time to time, an expert is called to label several clusters for the classification mechanism. With arrival of more documents, the expert can be called less frequently, since most of the incoming documents will eventually belong to existing labeled clusters. Our algorithm is aimed at finding the fastest way of reaching the point where most arriving documents can be classified automatically without the experts assistance. The evaluation experiments on two benchmark corpora show that active learning and clustering can increase the percentage of automatically and accurately categorized documents over time.

Original languageEnglish
Title of host publicationData Mining in Time Series and Streaming Databases
PublisherWorld Scientific Publishing Co. Pte Ltd
Chapter5
Pages92-117
Number of pages26
ISBN (Electronic)9789813228047
ISBN (Print)9789813228030
DOIs
StatePublished - 11 Jan 2018

ASJC Scopus subject areas

  • General Computer Science

Fingerprint

Dive into the research topics of 'Clustering-based classification of document streams with active learning'. Together they form a unique fingerprint.

Cite this