TY - GEN
T1 - TONET
T2 - 47th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022
AU - Chen, Ke
AU - Yu, Shuai
AU - Wang, Cheng I.
AU - Li, Wei
AU - Berg-Kirkpatrick, Taylor
AU - Dubnov, Shlomo
N1 - Funding Information:
1code available at https://github.com/RetroCirce/TONet This work was supported by National Key R&D Program of China (2019YFC1711800), NSFC (62171138).
Publisher Copyright:
© 2022 IEEE
PY - 2022/1/1
Y1 - 2022/1/1
N2 - Singing melody extraction is an important problem in the field of music information retrieval. Existing methods typically rely on frequency-domain representations to estimate the sung frequencies. However, this design does not lead to human-level performance in the perception of melody information for both tone (pitch-class) and octave. In this paper, we propose TONet, a plug-and-play model that improves both tone and octave perceptions by leveraging a novel input representation and a novel network architecture. First, we present an improved input representation, the Tone-CFP, that explicitly groups harmonics via a rearrangement of frequency-bins. Second, we introduce an encoder-decoder architecture that is designed to obtain a salience feature map, a tone feature map, and an octave feature map. Third, we propose a tone-octave fusion mechanism to improve the final salience feature map. Experiments are done to verify the capability of TONet with various baseline backbone models. Our results show that tone-octave fusion with Tone-CFP can significantly improve the singing voice extraction performance across various datasets - with substantial gains in octave and tone accuracy.
AB - Singing melody extraction is an important problem in the field of music information retrieval. Existing methods typically rely on frequency-domain representations to estimate the sung frequencies. However, this design does not lead to human-level performance in the perception of melody information for both tone (pitch-class) and octave. In this paper, we propose TONet, a plug-and-play model that improves both tone and octave perceptions by leveraging a novel input representation and a novel network architecture. First, we present an improved input representation, the Tone-CFP, that explicitly groups harmonics via a rearrangement of frequency-bins. Second, we introduce an encoder-decoder architecture that is designed to obtain a salience feature map, a tone feature map, and an octave feature map. Third, we propose a tone-octave fusion mechanism to improve the final salience feature map. Experiments are done to verify the capability of TONet with various baseline backbone models. Our results show that tone-octave fusion with Tone-CFP can significantly improve the singing voice extraction performance across various datasets - with substantial gains in octave and tone accuracy.
KW - Melody Extraction
KW - Self-Attention
KW - Tone-CFP
KW - Tone-Octave Information Fusion
UR - http://www.scopus.com/inward/record.url?scp=85131260148&partnerID=8YFLogxK
U2 - 10.1109/ICASSP43922.2022.9747304
DO - 10.1109/ICASSP43922.2022.9747304
M3 - Conference contribution
AN - SCOPUS:85131260148
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 621
EP - 625
BT - 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings
PB - Institute of Electrical and Electronics Engineers
Y2 - 23 May 2022 through 27 May 2022
ER -