Co-speech gesture generation is crucial for automatic digital avatar
ani...
Zero-shot text-to-speech aims at synthesizing voices with unseen speech
...
Scaling text-to-speech to a large and wild dataset has been proven to be...
We are interested in a novel task, namely low-resource text-to-talking
a...
Large diffusion models have been successful in text-to-audio (T2A) synth...
Direct speech-to-speech translation (S2ST) aims to convert speech from o...
Improving text representation has attracted much attention to achieve
ex...
We are interested in a challenging task, Realistic-Music-Score based Sin...
The speech-to-singing (STS) voice conversion task aims to generate singi...
Generating talking person portraits with arbitrary speech audio is a cru...
Large language models (LLMs) have exhibited remarkable capabilities acro...
Listening to long video/audio recordings from video conferencing and onl...
ICASSP2023 General Meeting Understanding and Generation Challenge (MUG)
...
Generating photo-realistic video portrait with arbitrary speech audio is...
Large-scale multimodal generative modeling has created milestones in
tex...
Video to sound generation aims to generate realistic and natural sound g...
Denoising diffusion probabilistic models (DDPMs) have recently achieved
...
Polyphone disambiguation aims to capture accurate pronunciation knowledg...
Direct speech-to-speech translation (S2ST) systems leverage recent progr...
Style transfer for out-of-domain (OOD) speech synthesis aims to generate...
We are interested in a novel task, singing voice beautifying (SVB). Give...
Multi-speaker singing voice synthesis is to generate the singing voice s...
High-fidelity multi-singer singing voice synthesis is challenging for ne...
Sign language translation as a kind of technology with profound social
s...
High-fidelity singing voice synthesis is challenging for neural vocoders...
Non-autoregressive text-to-speech (NAR-TTS) models such as FastSpeech 2 ...
Lip reading, aiming to recognize spoken sentences according to the given...
As a key component of talking face generation, lip movements generation
...
Recently, there has been an increasing interest in neural speech synthes...
Singing voice synthesis (SVS) system is built to synthesize high-quality...
While neural-based text to speech (TTS) models can synthesize natural an...
Lipreading is an impressive technique and there has been a definite
impr...
Non-autoregressive translation (NAT) achieves faster inference speed but...
Non-autoregressive (NAR) models generate all the tokens of a sequence in...