Learn to Sing by Listening: Building Controllable Virtual Singer by Unsupervised Learning from Voice Recordings

by   Wei Xue, et al.

The virtual world is being established in which digital humans are created indistinguishable from real humans. Producing their audio-related capabilities is crucial since voice conveys extensive personal characteristics. We aim to create a controllable audio-form virtual singer; however, supervised modeling and controlling all different factors of the singing voice, such as timbre, tempo, pitch, and lyrics, is extremely difficult since accurately labeling all such information needs enormous labor work. In this paper, we propose a framework that could digitize a person's voice by simply "listening" to the clean voice recordings of any content in a fully unsupervised manner and predict singing voices even only using speaking recordings. A variational auto-encoder (VAE) based framework is developed, which leverages a set of pre-trained models to encode the audio as various hidden embeddings representing different factors of the singing voice, and further decodes the embeddings into raw audio. By manipulating the hidden embeddings for different factors, the resulting singing voices can be controlled, and new virtual singers can also be further generated by interpolating between timbres. Evaluations of different types of experiments demonstrate the proposed method's effectiveness. The proposed method is the critical technique for producing the AI choir, which empowered the human-AI symbiotic orchestra in Hong Kong in July 2022.


page 5

page 7

page 8

page 11

page 12

page 13

page 14


Controlled AutoEncoders to Generate Faces from Voices

Multiple studies in the past have shown that there is a strong correlati...

MetaAID 2.0: An Extensible Framework for Developing Metaverse Applications via Human-controllable Pre-trained Models

Pre-trained models (PM) have achieved promising results in content gener...

PitchNet: Unsupervised Singing Voice Conversion with Pitch Adversarial Network

Singing voice conversion is to convert a singer's voice to another one's...

Audiovisual Singing Voice Separation

Separating a song into vocal and accompaniment components is an active r...

Speaker Embeddings as Individuality Proxy for Voice Stress Detection

Since the mental states of the speaker modulate speech, stress introduce...

Analysis and transformations of intensity in singing voice

In this paper we introduce a neural auto-encoder that transforms the voi...

AI based Presentation Creator With Customized Audio Content Delivery

In this paper, we propose an architecture to solve a novel problem state...

Please sign up or login with your details

Forgot password? Click here to reset