Revisiting DP-Means: Fast Scalable Algorithms via Parallelism and Delayed Cluster Creation

Or Dinari, Oren Freifeld

Research output: Contribution to journalConference articlepeer-review

Abstract

DP-means, a nonparametric generalization of K-means, extends the latter to the case where the number of clusters is unknown. Unlike K-means, however, DP-means is hard to parallelize, a limitation hindering its usage in large-scale tasks. This work bridges this practicality gap by rendering the DP-means approach a viable, fast, and highly-scalable solution. First, we study the strengths and weaknesses of previous attempts to parallelize the DP-means algorithm. Next, we propose a new parallel algorithm, called PDC-DP-Means (Parallel Delayed Cluster DP-Means), based in part on delayed creation of clusters. Compared with DP-Means, PDC-DP-Means provides not only a major speedup but also performance gains. Finally, we propose two extensions of PDC-DP-Means. The first combines it with an existing method, leading to further speedups. The second extends PDC-DP-Means to a Mini-Batch setting (with an optional support for an online mode), allowing for another major speedup. We verify the utility of the proposed methods on multiple datasets. We also show that the proposed methods outperform other nonparametric methods (e.g., DBSCAN). Our highly-efficient code can be used to reproduce our experiments and is available at https://github.com/BGU-CS-VIL/pdc-dp-means.

Original languageEnglish
Pages (from-to)569-578
Number of pages10
JournalProceedings of Machine Learning Research
Volume180
StatePublished - 1 Jan 2022
Event38th Conference on Uncertainty in Artificial Intelligence, UAI 2022 - Eindhoven, Netherlands
Duration: 1 Aug 20225 Aug 2022

ASJC Scopus subject areas

  • Artificial Intelligence
  • Software
  • Control and Systems Engineering
  • Statistics and Probability

Fingerprint

Dive into the research topics of 'Revisiting DP-Means: Fast Scalable Algorithms via Parallelism and Delayed Cluster Creation'. Together they form a unique fingerprint.

Cite this