TY - GEN
T1 - Hierarchical probabilistic segmentation of discrete events
AU - Shani, Guy
AU - Meek, Christopher
AU - Gunawardana, Asela
PY - 2009/12/1
Y1 - 2009/12/1
N2 - Segmentation, the task of splitting a long sequence of discrete symbols into chunks, can provide important information about the nature of the sequence that is understandable to humans. Algorithms for segmenting mostly belong to the supervised learning family, where a labeled corpus is available to the algorithm in the learning phase. We are interested, however, in the unsupervised scenario, where the algorithm never sees examples of successful segmentation, but still needs to discover meaningful segments. In this paper we present an unsupervised learning algorithm for segmenting sequences of symbols or categorical events. Our algorithm, Hierarchical Multigram, hierarchically builds a lexicon of segments and computes a maximum likelihood segmentation given the current lexicon. Thus, our algorithm is most appropriate to hierarchical sequences, where smaller segments are grouped into larger segments. Our probabilistic approach also allows us to suggest conditional entropy as a measurement of the quality of a segmentation in the absence of labeled data. We compare our algorithm to two previous approaches from the unsupervised segmentation literature, showing it to provide superior segmentation over a number of benchmarks. We also compare our algorithm to previous approaches over a segmentation of the unlabeled interactions of a web service and its client.
AB - Segmentation, the task of splitting a long sequence of discrete symbols into chunks, can provide important information about the nature of the sequence that is understandable to humans. Algorithms for segmenting mostly belong to the supervised learning family, where a labeled corpus is available to the algorithm in the learning phase. We are interested, however, in the unsupervised scenario, where the algorithm never sees examples of successful segmentation, but still needs to discover meaningful segments. In this paper we present an unsupervised learning algorithm for segmenting sequences of symbols or categorical events. Our algorithm, Hierarchical Multigram, hierarchically builds a lexicon of segments and computes a maximum likelihood segmentation given the current lexicon. Thus, our algorithm is most appropriate to hierarchical sequences, where smaller segments are grouped into larger segments. Our probabilistic approach also allows us to suggest conditional entropy as a measurement of the quality of a segmentation in the absence of labeled data. We compare our algorithm to two previous approaches from the unsupervised segmentation literature, showing it to provide superior segmentation over a number of benchmarks. We also compare our algorithm to previous approaches over a segmentation of the unlabeled interactions of a web service and its client.
UR - http://www.scopus.com/inward/record.url?scp=77951154856&partnerID=8YFLogxK
U2 - 10.1109/ICDM.2009.87
DO - 10.1109/ICDM.2009.87
M3 - Conference contribution
AN - SCOPUS:77951154856
SN - 9780769538952
T3 - Proceedings - IEEE International Conference on Data Mining, ICDM
SP - 974
EP - 979
BT - ICDM 2009 - The 9th IEEE International Conference on Data Mining
T2 - 9th IEEE International Conference on Data Mining, ICDM 2009
Y2 - 6 December 2009 through 9 December 2009
ER -