TY - JOUR
T1 - Improving speech emotion recognition capabilities in the short and long term using temporal bucketing and active learning
AU - Gurowiec, Itzik
AU - Nissim, Nir
N1 - Publisher Copyright:
© 2025
PY - 2025/9/1
Y1 - 2025/9/1
N2 - Over the last four decades, great effort has been made to improve Speech Emotion Recognition (SER) capabilities, and researchers have proposed a variety of innovative methods for that purpose, including data science methods. In this paper, we present two novel methods for improving emotion inference from speech, both of which focus on improved utilization of speech data's temporal dimension: Temporal Bucketing SER in the short-term and a pool-based active learning (AL) method for SER systems in the long-term. In our evaluation, Temporal Bucketing outperformed state-of-the-art (SOTA) methods in the task of recognizing the emotions in speech on five widely used, publicly available, diverse datasets. Specifically, we obtained outstanding accuracy results of 91.11 %, 95.29 %, 89.36 %, 95.14 %, and 92.91 % on RAVDESS Speech, RAVDESS Song, IEMOCAP, EMO-DB, and SAVEE, respectively. These capabilities should also be continuously improved over time (in the long term), adapting to the changing reality by improving the learning model towards new speech samples containing emotions. Thus, our proposed AL method leverages the Temporal Bucketing SER method, utilizing two selective sampling criteria, by which the most informative samples are labeled and acquired to improve the learning model. In our evaluation of the proposed AL method, it obtained 90 % of the maximum achievable accuracy on the five datasets, acquiring an average of 8.12 % fewer samples than SOTA AL methods. The general saving in labeling costs is even greater, ranging from 5 % to 20 %, demonstrating the efficiency of our proposed AL method and selection criteria compared to both SOTA AL methods and passive-learning.
AB - Over the last four decades, great effort has been made to improve Speech Emotion Recognition (SER) capabilities, and researchers have proposed a variety of innovative methods for that purpose, including data science methods. In this paper, we present two novel methods for improving emotion inference from speech, both of which focus on improved utilization of speech data's temporal dimension: Temporal Bucketing SER in the short-term and a pool-based active learning (AL) method for SER systems in the long-term. In our evaluation, Temporal Bucketing outperformed state-of-the-art (SOTA) methods in the task of recognizing the emotions in speech on five widely used, publicly available, diverse datasets. Specifically, we obtained outstanding accuracy results of 91.11 %, 95.29 %, 89.36 %, 95.14 %, and 92.91 % on RAVDESS Speech, RAVDESS Song, IEMOCAP, EMO-DB, and SAVEE, respectively. These capabilities should also be continuously improved over time (in the long term), adapting to the changing reality by improving the learning model towards new speech samples containing emotions. Thus, our proposed AL method leverages the Temporal Bucketing SER method, utilizing two selective sampling criteria, by which the most informative samples are labeled and acquired to improve the learning model. In our evaluation of the proposed AL method, it obtained 90 % of the maximum achievable accuracy on the five datasets, acquiring an average of 8.12 % fewer samples than SOTA AL methods. The general saving in labeling costs is even greater, ranging from 5 % to 20 %, demonstrating the efficiency of our proposed AL method and selection criteria compared to both SOTA AL methods and passive-learning.
KW - Active learning
KW - Classification
KW - Speech emotion recognition
KW - Time series data
UR - https://www.scopus.com/pages/publications/105013099371
U2 - 10.1016/j.compbiomed.2025.110912
DO - 10.1016/j.compbiomed.2025.110912
M3 - Article
C2 - 40819495
AN - SCOPUS:105013099371
SN - 0010-4825
VL - 196
JO - Computers in Biology and Medicine
JF - Computers in Biology and Medicine
M1 - 110912
ER -