CPU- and GPU-based Distributed Sampling in Dirichlet Process Mixtures for Large-scale Analysis.

Or Dinari, Raz Zamir, W. Fisher III John, Oren Freifeld

Research output: Working paper/PreprintPreprint


In the realm of unsupervised learning, Bayesian nonparametric mixture models, exemplified by the Dirichlet Process Mixture Model (DPMM), provide a principled approach for adapting the complexity of the model to the data. Such models are particularly useful in clustering tasks where the number of clusters is unknown. Despite their potential and mathematical elegance, however, DPMMs have yet to become a mainstream tool widely adopted by practitioners. This is arguably due to a misconception that these models scale poorly as well as the lack of high-performance (and user-friendly) software tools that can handle large datasets efficiently. In this paper we bridge this practical gap by proposing a new, easy-to-use, statistical software package for scalable DPMM inference. More concretely, we provide efficient and easily-modifiable implementations for highperformance distributed sampling-based inference in DPMMs where
the user is free to choose between either a multiple-machine, multiplecore, CPU implementation (written in Julia) and a multiple-stream GPU implementation (written in CUDA/C++). Both the CPU and GPU implementations come with a common (and optional) python wrapper, providing the user with a single point of entry with the same interface. On the algorithmic side, our implementations leverage a
Original languageEnglish
StatePublished - 19 Apr 2022


Dive into the research topics of 'CPU- and GPU-based Distributed Sampling in Dirichlet Process Mixtures for Large-scale Analysis.'. Together they form a unique fingerprint.

Cite this