Streaming k-means on well-clusterable data

  • Vladimir Braverman
  • , Adam Meyerson
  • , Rafail Ostrovsky
  • , Alan Roytman
  • , Michael Shindler
  • , Brian Tagiku

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

57 Scopus citations

Abstract

One of the central problems in data-analysis is k-means clustering. In recent years, considerable attention in the literature addressed the streaming variant of this problem, culminating in a series of results (Har-Peled and Mazumdar; Frahling and Sohler; Frahling, Monemizadeh, and Sohler; Chen) that produced a (1 + ε)-approxiniation for A:-means clustering in the streaming setting. Unfortunately, since optimizing the k-means objective is Max-SNP hard, all algorithms that achieve a (1 + ε)-approximation must take time exponential in k unless P=NP. Thus, to avoid exponential dependence on k, some additional assumptions must be made to guarantee high quality approximation and polynomial running time. A recent paper of Ostrovsky, Rabani, Schulman, and Swamy (FOCS 2006) introduced the very natural assumption of data separability, the assumption closely reflects how k-means is used in practice and allowed the authors to create a high-quality approximation for k-nieans clustering in the non-streaming setting with polynomial running time even for large values of k. Their work left open a natural and important question: are similar results possible in a streaming setting? This is the question we answer in this paper, albeit using substantially different techniques. We show a near-optimal streaming approximation algorithm for k-means in high-dimensional Euclidean space with sublinear memory and a single pass, under the same data separability assumption. Our algorithm offers significant improvements in both space and run-ning time over previous work while yielding asymptotically best-possible performance (assuming that the running time must be fully polynomial and P ≠ NP). The novel techniques we develop along the way imply a number of additional results: we provide a high-probability performance guarantee for online facility location (in contrast, Meyerson's FOCS 2001 algorithm gave bounds only in expectation); we develop a constant approximation method for the general class of semi-metric clustering problems; we improve (even without re-separability) by a logarithmic factor space requirements for streaming constant-approximation for k-median; finally we design a "re-sampling method" in a streaming setting to convert any constant approximation for clustering to a [1 + 0(σ2)]-approximation for σ-separable data.

Original languageEnglish
Title of host publicationProceedings of the 22nd Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2011
PublisherAssociation for Computing Machinery
Pages26-40
Number of pages15
ISBN (Print)9780898719932
DOIs
StatePublished - 1 Jan 2011
Externally publishedYes

Publication series

NameProceedings of the Annual ACM-SIAM Symposium on Discrete Algorithms

ASJC Scopus subject areas

  • Software
  • General Mathematics

Fingerprint

Dive into the research topics of 'Streaming k-means on well-clusterable data'. Together they form a unique fingerprint.

Cite this