Zero-Shot Mono-to-Binaural Speech Synthesis

  • Alon Levkovitch
  • , Julian Salazar
  • , Soroosh Mariooryad
  • , R. J. Skerry-Ryan
  • , Nadav Bar
  • , Bastiaan Kleijn
  • , Eliya Nachmani

Research output: Contribution to journalConference articlepeer-review

Abstract

We present ZeroBAS, a neural method to synthesize binaural speech from monaural speech recordings and positional information without training on any binaural data. To our knowledge, this is the first published zero-shot neural approach to mono-to-binaural speech synthesis. Specifically, we show that a parameter-free geometric time warping and amplitude scaling based on source location suffices to get an initial binaural synthesis that can be refined by iteratively applying a pretrained denoising vocoder. Furthermore, we find this leads to generalization across room conditions, which we measure by introducing a new dataset, TUT Mono-to-Binaural, to evaluate state-of-the-art monaural-to-binaural synthesis methods on unseen conditions. Our zero-shot method is perceptually on-par with the performance of supervised methods on previous standard mono-to-binaural dataset, and even surpasses them on our out-of-distribution TUT Mono-to-Binaural dataset.

Original languageEnglish
Pages (from-to)4168-4172
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
DOIs
StatePublished - 1 Jan 2025
Externally publishedYes
Event26th Interspeech Conference 2025 - Rotterdam, Netherlands
Duration: 17 Aug 202521 Aug 2025

Keywords

  • diffusion
  • mono-to-binaural
  • speech synthesis
  • zero-shot

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Language and Linguistics
  • Modeling and Simulation
  • Human-Computer Interaction

Fingerprint

Dive into the research topics of 'Zero-Shot Mono-to-Binaural Speech Synthesis'. Together they form a unique fingerprint.

Cite this