TY - GEN
T1 - HTS-AT
T2 - 47th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022
AU - Chen, Ke
AU - Du, Xingjian
AU - Zhu, Bilei
AU - Ma, Zejun
AU - Berg-Kirkpatrick, Taylor
AU - Dubnov, Shlomo
N1 - Publisher Copyright:
© 2022 IEEE
PY - 2022/1/1
Y1 - 2022/1/1
N2 - Audio classification is an important task of mapping audio samples into their corresponding labels. Recently, the transformer model with self-attention mechanisms has been adopted in this field. However, existing audio transformers require large GPU memories and long training time, meanwhile relying on pretrained vision models to achieve high performance, which limits the model's scalability in audio tasks. To combat these problems, we introduce HTS-AT: an audio transformer with a hierarchical structure to reduce the model size and training time. It is further combined with a token-semantic module to map final outputs into class featuremaps, thus enabling the model for the audio event detection (i.e. localization in time). We evaluate HTS-AT on three datasets of audio classification where it achieves new state-of-the-art (SOTA) results on AudioSet and ESC-50, and equals the SOTA on Speech Command V2. It also achieves better performance in event localization than the previous CNN-based models. Moreover, HTS-AT requires only 35% model parameters and 15% training time of the previous audio transformer. These results demonstrate the high performance and high efficiency of HTS-AT.
AB - Audio classification is an important task of mapping audio samples into their corresponding labels. Recently, the transformer model with self-attention mechanisms has been adopted in this field. However, existing audio transformers require large GPU memories and long training time, meanwhile relying on pretrained vision models to achieve high performance, which limits the model's scalability in audio tasks. To combat these problems, we introduce HTS-AT: an audio transformer with a hierarchical structure to reduce the model size and training time. It is further combined with a token-semantic module to map final outputs into class featuremaps, thus enabling the model for the audio event detection (i.e. localization in time). We evaluate HTS-AT on three datasets of audio classification where it achieves new state-of-the-art (SOTA) results on AudioSet and ESC-50, and equals the SOTA on Speech Command V2. It also achieves better performance in event localization than the previous CNN-based models. Moreover, HTS-AT requires only 35% model parameters and 15% training time of the previous audio transformer. These results demonstrate the high performance and high efficiency of HTS-AT.
KW - Audio Classification
KW - Sound Event Detection
KW - Token-Semantic Module
KW - Transformer
UR - http://www.scopus.com/inward/record.url?scp=85131251403&partnerID=8YFLogxK
U2 - 10.1109/ICASSP43922.2022.9746312
DO - 10.1109/ICASSP43922.2022.9746312
M3 - Conference contribution
AN - SCOPUS:85131251403
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 646
EP - 650
BT - 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings
PB - Institute of Electrical and Electronics Engineers
Y2 - 23 May 2022 through 27 May 2022
ER -