Improving speech emotion recognition capabilities in the short and long term using temporal bucketing and active learning

Research output: Contribution to journalArticlepeer-review

Abstract

Over the last four decades, great effort has been made to improve Speech Emotion Recognition (SER) capabilities, and researchers have proposed a variety of innovative methods for that purpose, including data science methods. In this paper, we present two novel methods for improving emotion inference from speech, both of which focus on improved utilization of speech data's temporal dimension: Temporal Bucketing SER in the short-term and a pool-based active learning (AL) method for SER systems in the long-term. In our evaluation, Temporal Bucketing outperformed state-of-the-art (SOTA) methods in the task of recognizing the emotions in speech on five widely used, publicly available, diverse datasets. Specifically, we obtained outstanding accuracy results of 91.11 %, 95.29 %, 89.36 %, 95.14 %, and 92.91 % on RAVDESS Speech, RAVDESS Song, IEMOCAP, EMO-DB, and SAVEE, respectively. These capabilities should also be continuously improved over time (in the long term), adapting to the changing reality by improving the learning model towards new speech samples containing emotions. Thus, our proposed AL method leverages the Temporal Bucketing SER method, utilizing two selective sampling criteria, by which the most informative samples are labeled and acquired to improve the learning model. In our evaluation of the proposed AL method, it obtained 90 % of the maximum achievable accuracy on the five datasets, acquiring an average of 8.12 % fewer samples than SOTA AL methods. The general saving in labeling costs is even greater, ranging from 5 % to 20 %, demonstrating the efficiency of our proposed AL method and selection criteria compared to both SOTA AL methods and passive-learning.

Original languageEnglish
Article number110912
JournalComputers in Biology and Medicine
Volume196
DOIs
StatePublished - 1 Sep 2025

Keywords

  • Active learning
  • Classification
  • Speech emotion recognition
  • Time series data

ASJC Scopus subject areas

  • Health Informatics
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'Improving speech emotion recognition capabilities in the short and long term using temporal bucketing and active learning'. Together they form a unique fingerprint.

Cite this