Speech and multilingual natural language framework for speaker change detection and diarization

Or Haim Anidjar, Yannick Estève, Chen Hajaj, Amit Dvir, Itshak Lapidot

Research output: Contribution to journalArticlepeer-review

2 Scopus citations

Abstract

Speaker Change Detection (SCD) is the problem of splitting an audio-recording by its speaker-turns. Many real-world problems, such as the Speaker Diarization (SD) or automatic speech transcription, are influenced by the quality of the speaker-turns estimation. Previous works have already shown that auxiliary textual information (for mono-lingual systems) can be of great use for detection of speaker-turns and the diarization systems’ performance. In this paper, we suggest a framework for speaker-turn estimation, as well as the determination of clustered speaker identities to the SD system, and examine our approach over a multi-lingual dataset that consists of three mono-lingual datasets—in English, French, and Hebrew. As such, we propose a generic and language-independent framework for the SCD problem that is learned through textual information using state-of-the-art transformer-based techniques and speech-embedding modules. Comprehensive experimental evaluation shows that (i) our multi-lingual SCD framework is competitive enough when compared to a framework over mono-lingual datasets, and that (ii) textual information improves the solution's quality compared to the speech signal-based approach. In addition, we show that our multi-lingual SCD approach does not harm the performance of SD systems.

Original languageEnglish
Article number119238
JournalExpert Systems with Applications
Volume213
DOIs
StatePublished - 1 Mar 2023
Externally publishedYes

Keywords

  • Speaker change detection
  • Speaker diarization
  • Speaker embedding
  • Speech recognition
  • Transformers

ASJC Scopus subject areas

  • General Engineering
  • Computer Science Applications
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Speech and multilingual natural language framework for speaker change detection and diarization'. Together they form a unique fingerprint.

Cite this