Abstract
Speaker Change Detection (SCD) is the problem of splitting an audio-recording by its speaker-turns. Many real-world problems, such as the Speaker Diarization (SD) or automatic speech transcription, are influenced by the quality of the speaker-turns estimation. Previous works have already shown that auxiliary textual information (for mono-lingual systems) can be of great use for detection of speaker-turns and the diarization systems’ performance. In this paper, we suggest a framework for speaker-turn estimation, as well as the determination of clustered speaker identities to the SD system, and examine our approach over a multi-lingual dataset that consists of three mono-lingual datasets—in English, French, and Hebrew. As such, we propose a generic and language-independent framework for the SCD problem that is learned through textual information using state-of-the-art transformer-based techniques and speech-embedding modules. Comprehensive experimental evaluation shows that (i) our multi-lingual SCD framework is competitive enough when compared to a framework over mono-lingual datasets, and that (ii) textual information improves the solution's quality compared to the speech signal-based approach. In addition, we show that our multi-lingual SCD approach does not harm the performance of SD systems.
Original language | English |
---|---|
Article number | 119238 |
Journal | Expert Systems with Applications |
Volume | 213 |
DOIs | |
State | Published - 1 Mar 2023 |
Externally published | Yes |
Keywords
- Speaker change detection
- Speaker diarization
- Speaker embedding
- Speech recognition
- Transformers
ASJC Scopus subject areas
- General Engineering
- Computer Science Applications
- Artificial Intelligence