TY - JOUR
T1 - Speech Enhancement Using Masking for Binaural Reproduction of Ambisonics Signals
AU - Lugasi, Moti
AU - Rafaely, Boaz
N1 - Funding Information:
Manuscript received July 7, 2019; revised January 22, 2020 and May 14, 2020; accepted May 22, 2020. Date of publication May 28, 2020; date of current version June 18, 2020. This work was supported by Facebook Reality Labs. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Jun Du. (Corresponding author: Moti Lugasi.) The authors are with the School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Beer-Sheva 84105, Israel (e-mail: motilu@post.bgu.ac.il; br@bgu.ac.il). Digital Object Identifier 10.1109/TASLP.2020.2998294
Publisher Copyright:
© 2014 IEEE.
PY - 2020/1/1
Y1 - 2020/1/1
N2 - Speech enhancement in a single channel has been well studied in the literature in applications such as speech communication systems. However, in emerging applications such as virtual reality and spatial audio, in addition to attenuating undesired signals, the ability to preserve the spatial information of the desired signal captured in a noisy environment is of great importance. Nevertheless, there are only a few studies in the literature that propose solutions to this challenge. Most of these studies present solutions that attenuate the undesired signals, while preserving only limited spatial information regarding the desired signal, such as the direction of arrival (DOA). Methods that preserve complete spatial information have only recently been suggested, and have not been studied comprehensively. In this paper, two such methods based on time-frequency masking are investigated with the aim of attenuating the undesired signal, while preserving the spatial components of the desired signal. The first is referred to as spatial masking and is based on masking in the plane wave density (PWD) domain, and the second on masking in the spherical harmonics (SH) domain. The two methods are compared with a reference method, based on beamforming followed by single-channel time-frequency masking. Objective analysis and two listening tests were conducted in order to evaluate the performance of these methods for speech enhancement. It was shown that the spatial masking based method better preserves the desired component of the sound field, while the performance of the SH based method more strongly depends on the sources' distances. On the other hand, the SH based method better preserves the DOA of the residual noise, while the DOA of the residual noise under the spatial masking based method is strongly affected by the undesired signal.
AB - Speech enhancement in a single channel has been well studied in the literature in applications such as speech communication systems. However, in emerging applications such as virtual reality and spatial audio, in addition to attenuating undesired signals, the ability to preserve the spatial information of the desired signal captured in a noisy environment is of great importance. Nevertheless, there are only a few studies in the literature that propose solutions to this challenge. Most of these studies present solutions that attenuate the undesired signals, while preserving only limited spatial information regarding the desired signal, such as the direction of arrival (DOA). Methods that preserve complete spatial information have only recently been suggested, and have not been studied comprehensively. In this paper, two such methods based on time-frequency masking are investigated with the aim of attenuating the undesired signal, while preserving the spatial components of the desired signal. The first is referred to as spatial masking and is based on masking in the plane wave density (PWD) domain, and the second on masking in the spherical harmonics (SH) domain. The two methods are compared with a reference method, based on beamforming followed by single-channel time-frequency masking. Objective analysis and two listening tests were conducted in order to evaluate the performance of these methods for speech enhancement. It was shown that the spatial masking based method better preserves the desired component of the sound field, while the performance of the SH based method more strongly depends on the sources' distances. On the other hand, the SH based method better preserves the DOA of the residual noise, while the DOA of the residual noise under the spatial masking based method is strongly affected by the undesired signal.
KW - Speech enhancement
KW - Wiener masking
KW - binaural reproduction
KW - noise reduction
KW - plane wave decomposition
KW - spatial masking
KW - spherical arrays
UR - http://www.scopus.com/inward/record.url?scp=85087500386&partnerID=8YFLogxK
U2 - 10.1109/TASLP.2020.2998294
DO - 10.1109/TASLP.2020.2998294
M3 - Article
AN - SCOPUS:85087500386
SN - 2329-9290
VL - 28
SP - 1767
EP - 1777
JO - IEEE/ACM Transactions on Audio Speech and Language Processing
JF - IEEE/ACM Transactions on Audio Speech and Language Processing
M1 - 9103085
ER -