Abstract
In unlabeled and unsegmented conversation, i.e. no a-priori knowledge about speakers' identity and segments boundaries is provided, it is very important to cluster the conversation (make a segmentation and labeling) with the best possible resolution. For low-resolution cases, i.e. the duration of the segment is long; the segments might contain data from several speakers. On the other hand, when short segments are used (high resolution) not enough statistics is provided to allow correct decision about the identity of the speakers. In this work the performance of a system, which employs different segment lengths, is presented. We assumed that the number of speakers, R, is known, and high-quality conversations were used. Each speaker was modeled by a Self-Organizing-Map (SOM). An iterative algorithm allows the data to move from one model to another and adjust the SOMs. The restriction that the data can move only in small groups but not by moving each and every feature vector separately force the SOMs to adjust to speakers (instead of phonemes or other vocal events). We found that the optimal segment duration was half-second. The system has a clustering performance of about 90% for tow-speaker conversation and over 80% for three-speaker conversations.
Original language | English |
---|---|
Pages | 169-174 |
Number of pages | 6 |
State | Published - 1 Jan 2001 |
Event | Speaker Recognition Workshop 2001: A Speaker Odyssey, ODYSSEY 2001 - Crete, Greece Duration: 18 Jun 2001 → 22 Jun 2001 |
Conference
Conference | Speaker Recognition Workshop 2001: A Speaker Odyssey, ODYSSEY 2001 |
---|---|
Country/Territory | Greece |
City | Crete |
Period | 18/06/01 → 22/06/01 |
ASJC Scopus subject areas
- Signal Processing
- Software
- Human-Computer Interaction