TY - GEN
T1 - Boosting margin based distance functions for clustering
AU - Hertz, Tomer
AU - Bar-Hillel, Aharon
AU - Weinshall, Daphna
PY - 2004/12/1
Y1 - 2004/12/1
N2 - The performance of graph based clustering methods critically depends on the quality of the distance function used to compute similarities between pairs of neighboring nodes. In this paper we learn distance functions by training binary classifiers with margins. The classifiers are defined over the product space of pairs of points and are trained to distinguish whether two points come from the same class or not. The signed margin is used as the distance value. Our main contribution is a distance learning method (DistBoost), which combines boosting hypotheses over the product space with a weak learner based on partitioning the original feature space. Each weak hypothesis is a Gaussian mixture model computed using a semi-supervised constrained EM algorithm, which is trained using both unlabeled and labeled data. We also consider SVM and decision trees boosting as margin based classifiers in the product space. We experimentally compare the margin based distance functions with other existing metric learning methods, and with existing techniques for the direct incorporation of constraints into various clustering algorithms. Clustering performance is measured on some benchmark databases from the UCI repository, a sample from the MNIST database, and a data set of color images of animals. In most cases the DistBoost algorithm significantly and robustly outperformed its competitors.
AB - The performance of graph based clustering methods critically depends on the quality of the distance function used to compute similarities between pairs of neighboring nodes. In this paper we learn distance functions by training binary classifiers with margins. The classifiers are defined over the product space of pairs of points and are trained to distinguish whether two points come from the same class or not. The signed margin is used as the distance value. Our main contribution is a distance learning method (DistBoost), which combines boosting hypotheses over the product space with a weak learner based on partitioning the original feature space. Each weak hypothesis is a Gaussian mixture model computed using a semi-supervised constrained EM algorithm, which is trained using both unlabeled and labeled data. We also consider SVM and decision trees boosting as margin based classifiers in the product space. We experimentally compare the margin based distance functions with other existing metric learning methods, and with existing techniques for the direct incorporation of constraints into various clustering algorithms. Clustering performance is measured on some benchmark databases from the UCI repository, a sample from the MNIST database, and a data set of color images of animals. In most cases the DistBoost algorithm significantly and robustly outperformed its competitors.
UR - http://www.scopus.com/inward/record.url?scp=14344265725&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:14344265725
SN - 1581138385
T3 - Proceedings, Twenty-First International Conference on Machine Learning, ICML 2004
SP - 393
EP - 400
BT - Proceedings, Twenty-First International Conference on Machine Learning, ICML 2004
A2 - Greiner, R.
A2 - Schuurmans, D.
T2 - Proceedings, Twenty-First International Conference on Machine Learning, ICML 2004
Y2 - 4 July 2004 through 8 July 2004
ER -