NANSY++: Unified Voice Synthesis with Neural Analysis and Synthesis

11/17/2022
by   Hyeong-Seok Choi, et al.
0

Various applications of voice synthesis have been developed independently despite the fact that they generate "voice" as output in common. In addition, most of the voice synthesis models still require a large number of audio data paired with annotated labels (e.g., text transcription and music score) for training. To this end, we propose a unified framework of synthesizing and manipulating voice signals from analysis features, dubbed NANSY++. The backbone network of NANSY++ is trained in a self-supervised manner that does not require any annotations paired with audio. After training the backbone network, we efficiently tackle four voice applications - i.e. voice conversion, text-to-speech, singing voice synthesis, and voice designing - by partially modeling the analysis features required for each task. Extensive experiments show that the proposed framework offers competitive advantages such as controllability, data efficiency, and fast training convergence, while providing high quality synthesis. Audio samples: tinyurl.com/8tnsy3uc.

READ FULL TEXT
research
05/30/2023

Make-A-Voice: Unified Voice Synthesis With Discrete Representation

Various applications of voice synthesis have been developed independentl...
research
10/13/2021

A Melody-Unsupervision Model for Singing Voice Synthesis

Recent studies in singing voice synthesis have achieved high-quality res...
research
12/20/2019

Learning Singing From Speech

We propose an algorithm that is capable of synthesizing high quality tar...
research
10/27/2021

Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations

We present a neural analysis and synthesis (NANSY) framework that can ma...
research
02/18/2021

AudioVisual Speech Synthesis: A brief literature review

This brief literature review studies the problem of audiovisual speech s...
research
08/06/2021

An Empirical Study on End-to-End Singing Voice Synthesis with Encoder-Decoder Architectures

With the rapid development of neural network architectures and speech pr...
research
12/03/2022

UniSyn: An End-to-End Unified Model for Text-to-Speech and Singing Voice Synthesis

Text-to-speech (TTS) and singing voice synthesis (SVS) aim at generating...

Please sign up or login with your details

Forgot password? Click here to reset