Unsupervised hierarchical probabilistic segmentation of discrete events

Guy Shani, Asela Gunawardana, Christopher Meek

Research output: Contribution to journalArticlepeer-review

2 Scopus citations


Segmentation, the task of splitting a long sequence of symbols into chunks, can provide important information about the nature of the sequence that is understandable to humans. We focus on unsupervised segmentation, where the algorithm never sees examples of successful segmentation, but still needs to discover meaningful segments. In this paper we present an unsupervised learning algorithm for segmenting sequences of symbols or categorical events. Our algorithm hierarchically builds a lexicon of segments and computes a maximum likelihood segmentation given the current lexicon. Thus, our algorithm is most appropriate to hierarchical sequences, where smaller segments are grouped into larger segments. Our probabilistic approach also allows us to suggest conditional entropy as a measure of the quality of a segmentation in the absence of labeled data. We compare our algorithm to two previous approaches from the unsupervised segmentation literature, showing it to provide superior segmentation over a number of benchmarks. Our specific motivation for developing this general algorithm is to understand the behavior of software programs after deployment by analyzing their traces. We explain and motivate the importance of this problem, and present segmentation results from the interactions of a web service and its clients.

Original languageEnglish
Pages (from-to)483-501
Number of pages19
JournalIntelligent Data Analysis
Issue number4
StatePublished - 1 Aug 2011


  • Software analysis
  • multigram
  • probabilistic segmentation
  • sequence segmentation

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Vision and Pattern Recognition
  • Artificial Intelligence


Dive into the research topics of 'Unsupervised hierarchical probabilistic segmentation of discrete events'. Together they form a unique fingerprint.

Cite this