Scalable multi stage clustering of tagged micro-messages

Oren Tsur, Adi Littman, Ari Rappoport

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

14 Scopus citations

Abstract

The growing popularity of microblogging backed by services like Twitter, Facebook, Google+ and LinkedIn, raises the challenge of clustering short and extremely sparse documents. In this work we propose SMSC - a scalable, accurate and efficient multi stage clustering algorithm. Our algorithm leverages users practice of adding tags to some messages by bootstrapping over virtual non sparse documents. We experiment on a large corpus of tweets from Twitter, and evaluate results against a gold-standard classification validated by seven clustering evaluation measures (information theoretic, paired and greedy). Results show that the algorithm presented is both accurate and efficient, significantly outperforming other algorithms. Under reasonable practical assumptions, our algorithm scales up sublinearly in time. Copyright is held by the author/owner(s).

Original languageEnglish
Title of host publicationWWW'12 - Proceedings of the 21st Annual Conference on World Wide Web Companion
Pages621-622
Number of pages2
DOIs
StatePublished - 21 May 2012
Externally publishedYes
Event21st Annual Conference on World Wide Web, WWW'12 - Lyon, France
Duration: 16 Apr 201220 Apr 2012

Publication series

NameWWW'12 - Proceedings of the 21st Annual Conference on World Wide Web Companion

Conference

Conference21st Annual Conference on World Wide Web, WWW'12
Country/TerritoryFrance
CityLyon
Period16/04/1220/04/12

Keywords

  • Clustering
  • Hashtags
  • Microblogging
  • Scalability
  • Short documents
  • Twitter

Fingerprint

Dive into the research topics of 'Scalable multi stage clustering of tagged micro-messages'. Together they form a unique fingerprint.

Cite this