Abstract
We present ZeroBAS, a neural method to synthesize binaural speech from monaural speech recordings and positional information without training on any binaural data. To our knowledge, this is the first published zero-shot neural approach to mono-to-binaural speech synthesis. Specifically, we show that a parameter-free geometric time warping and amplitude scaling based on source location suffices to get an initial binaural synthesis that can be refined by iteratively applying a pretrained denoising vocoder. Furthermore, we find this leads to generalization across room conditions, which we measure by introducing a new dataset, TUT Mono-to-Binaural, to evaluate state-of-the-art monaural-to-binaural synthesis methods on unseen conditions. Our zero-shot method is perceptually on-par with the performance of supervised methods on previous standard mono-to-binaural dataset, and even surpasses them on our out-of-distribution TUT Mono-to-Binaural dataset.
| Original language | English |
|---|---|
| Pages (from-to) | 4168-4172 |
| Number of pages | 5 |
| Journal | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH |
| DOIs | |
| State | Published - 1 Jan 2025 |
| Externally published | Yes |
| Event | 26th Interspeech Conference 2025 - Rotterdam, Netherlands Duration: 17 Aug 2025 → 21 Aug 2025 |
Keywords
- diffusion
- mono-to-binaural
- speech synthesis
- zero-shot
ASJC Scopus subject areas
- Software
- Signal Processing
- Language and Linguistics
- Modeling and Simulation
- Human-Computer Interaction
Fingerprint
Dive into the research topics of 'Zero-Shot Mono-to-Binaural Speech Synthesis'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver