Abstract
High quality, labeled data is essential for successfully applying machine learning methods to real-world problems. However, in many cases, the amount of labeled data is insufficient and labeling that data is expensive or time consuming. Co-training algorithms, which use unlabeled data in order to improve classification, have proven to be effective in such cases. Generally, cotraining algorithms work by using two classifiers trained on two different views of the data to label large amounts of unlabeled data, and hence they help minimize the human effort required to label new data. In this paper we propose simple and effective strategies for improving the basic co-training framework. The proposed strategies improve two aspects of the co-training algorithm: the manner in which the features set is partitioned and the method of selecting additional instances. An experimental study over 25 datasets, proves that the proposed strategies are especially effective for imbalanced datasets. In addition, in order to better understand the inner workings of the co-training process, we provide an in-depth analysis of the effects of classifier error rates and performance imbalance between the two "views" of the data. We believe this analysis offers insights that could be used for future research.
Original language | English |
---|---|
Pages (from-to) | 81-100 |
Number of pages | 20 |
Journal | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
Volume | 8401 |
DOIs | |
State | Published - 1 Jan 2014 |
Keywords
- Co-training
- Imbalanced datasets
- Semi-supervised learning
ASJC Scopus subject areas
- Theoretical Computer Science
- Computer Science (all)