TY - JOUR
T1 - Revisiting DP-Means
T2 - 38th Conference on Uncertainty in Artificial Intelligence, UAI 2022
AU - Dinari, Or
AU - Freifeld, Oren
N1 - Funding Information:
This work was supported by the Lynn and William Frankel Center at BGU CS, by the Israeli Council for Higher Education via the BGU Data Science Research Center, and by Israel Science Foundation Personal Grant #360/21. O.D. was also funded by the Jabotinsky Scholarship from Israels Ministry of Technology and Science, and by BGUs Hi-Tech Scholarship.
Publisher Copyright:
© 2022 UAI. All Rights Reserved.
PY - 2022/1/1
Y1 - 2022/1/1
N2 - DP-means, a nonparametric generalization of K-means, extends the latter to the case where the number of clusters is unknown. Unlike K-means, however, DP-means is hard to parallelize, a limitation hindering its usage in large-scale tasks. This work bridges this practicality gap by rendering the DP-means approach a viable, fast, and highly-scalable solution. First, we study the strengths and weaknesses of previous attempts to parallelize the DP-means algorithm. Next, we propose a new parallel algorithm, called PDC-DP-Means (Parallel Delayed Cluster DP-Means), based in part on delayed creation of clusters. Compared with DP-Means, PDC-DP-Means provides not only a major speedup but also performance gains. Finally, we propose two extensions of PDC-DP-Means. The first combines it with an existing method, leading to further speedups. The second extends PDC-DP-Means to a Mini-Batch setting (with an optional support for an online mode), allowing for another major speedup. We verify the utility of the proposed methods on multiple datasets. We also show that the proposed methods outperform other nonparametric methods (e.g., DBSCAN). Our highly-efficient code can be used to reproduce our experiments and is available at https://github.com/BGU-CS-VIL/pdc-dp-means.
AB - DP-means, a nonparametric generalization of K-means, extends the latter to the case where the number of clusters is unknown. Unlike K-means, however, DP-means is hard to parallelize, a limitation hindering its usage in large-scale tasks. This work bridges this practicality gap by rendering the DP-means approach a viable, fast, and highly-scalable solution. First, we study the strengths and weaknesses of previous attempts to parallelize the DP-means algorithm. Next, we propose a new parallel algorithm, called PDC-DP-Means (Parallel Delayed Cluster DP-Means), based in part on delayed creation of clusters. Compared with DP-Means, PDC-DP-Means provides not only a major speedup but also performance gains. Finally, we propose two extensions of PDC-DP-Means. The first combines it with an existing method, leading to further speedups. The second extends PDC-DP-Means to a Mini-Batch setting (with an optional support for an online mode), allowing for another major speedup. We verify the utility of the proposed methods on multiple datasets. We also show that the proposed methods outperform other nonparametric methods (e.g., DBSCAN). Our highly-efficient code can be used to reproduce our experiments and is available at https://github.com/BGU-CS-VIL/pdc-dp-means.
UR - http://www.scopus.com/inward/record.url?scp=85163372895&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85163372895
SN - 2640-3498
VL - 180
SP - 569
EP - 578
JO - Proceedings of Machine Learning Research
JF - Proceedings of Machine Learning Research
Y2 - 1 August 2022 through 5 August 2022
ER -