Zero-shot text-to-speech (TTS) synthesis aims to clone any unseen speake...
The automatic co-speech gesture generation draws much attention in compu...
Mapping two modalities, speech and text, into a shared representation sp...
Recently, excellent progress has been made in speech recognition. Howeve...
The single-speaker singing voice synthesis (SVS) usually underperforms a...
This paper presents an end-to-end high-quality singing voice synthesis (...
The spontaneous behavior that often occurs in conversations makes speech...
For text-to-speech (TTS) synthesis, prosodic structure prediction (PSP) ...
Recent advances in neural text-to-speech (TTS) models bring thousands of...
In this paper, we introduce the DiffuseStyleGesture+, our solution for t...
Current talking face generation methods mainly focus on speech-lip
synch...
Expressive speech synthesis is crucial for many human-computer interacti...
The previous SpEx+ has yielded outstanding performance in speaker extrac...
Previously, Target Speaker Extraction (TSE) has yielded outstanding
perf...
Visual information can serve as an effective cue for target speaker
extr...
Previous studies have shown that large language models (LLMs) like GPTs ...
Speech-driven gesture generation is highly challenging due to the random...
In this paper, we present ZeroPrompt (Figure 1-(a)) and the correspondin...
Nowadays, recognition-synthesis-based methods have been quite popular wi...
Subband-based approaches process subbands in parallel through the model ...
Automatic dubbing, which generates a corresponding version of the input
...
The art of communication beyond speech there are gestures. The automatic...
Music-driven 3D dance generation has become an intensive research topic ...
Due to the mismatch between the source and target domains, how to better...
Recent advances in text-to-speech have significantly improved the
expres...
In recent years, In-context Learning (ICL) has gained increasing attenti...
Large pretrained language models (LMs) have shown impressive In-Context
...
With the increasing ability of large language models (LLMs), in-context
...
Despite the surprising few-shot performance of in-context learning (ICL)...
Explaining the black-box predictions of NLP models naturally and accurat...
FullSubNet has shown its promising performance on speech enhancement by
...
In this paper, we present TrimTail, a simple but effective emission
regu...
The recently proposed Conformer architecture which combines convolution ...
We propose an unsupervised learning method to disentangle speech into co...
Recently, dataset-generation-based zero-shot learning has shown promisin...
Recently, diffusion models have emerged as a new paradigm for generative...
Traditional training paradigms for extractive and abstractive summarizat...
This paper describes the ReprGesture entry to the Generation and Evaluat...
One-shot voice conversion (VC) with only a single target speaker's speec...
Ordinal regression with anchored reference samples (ORARS) has been prop...
Nowadays, owing to the superior capacity of the large pre-trained langua...
Previous works on expressive speech synthesis focus on modelling the
mon...
Deep neural networks have brought significant advancements to speech emo...
The accuracy of prosodic structure prediction is crucial to the naturaln...
Although deep learning and end-to-end models have been widely used and s...
Non-parallel data voice conversion (VC) have achieved considerable
break...
Previous works on expressive speech synthesis mainly focus on current
se...
Previously proposed FullSubNet has achieved outstanding performance in D...
There is a growing interest in dataset generation recently due to the
su...
We propose a novel robust and efficient Speech-to-Animation (S2A) approa...