Zero-shot text-to-speech aims at synthesizing voices with unseen speech
...
Cross-lingual timbre and style generalizable text-to-speech (TTS) aims t...
Conversational recommender systems (CRSs) have become crucial emerging
r...
End-to-end (E2E) systems have shown comparable performance to hybrid sys...
Scaling text-to-speech to a large and wild dataset has been proven to be...
We are interested in a novel task, namely low-resource text-to-talking
a...
Large diffusion models have been successful in text-to-audio (T2A) synth...
Direct speech-to-speech translation (S2ST) has gradually become popular ...
Speech fluency/disfluency can be evaluated by analyzing a range of phone...
Generating talking person portraits with arbitrary speech audio is a cru...
Large-scale Language Models (LLMs) are constrained by their inability to...
Deep learning based methods have become a paradigm for cover song
identi...
As a key component of automated speech recognition (ASR) and the front-e...
Recent studies on pronunciation scoring have explored the effect of
intr...
Speech-to-speech translation directly translates a speech utterance to
a...
In multi-talker scenarios such as meetings and conversations, speech
pro...
ASR model deployment environment is ever-changing, and the incoming spee...
One of the limitations in end-to-end automatic speech recognition framew...
Unsupervised video domain adaptation is a practical yet challenging task...
Speaker change detection is an important task in multi-party interaction...
Chinese dialect text-to-speech(TTS) system usually can only be utilized ...
Though achieving impressive results on many NLP tasks, the BERT-like mas...
Unsupervised cross-lingual speech representation learning (XLSR) has rec...
Deep learning-based pronunciation scoring models highly rely on the
avai...
In this paper, we propose S3T, a self-supervised pre-training method wit...
This paper describes our submission to ICASSP 2022 Multi-channel Multi-p...
Audio classification is an important task of mapping audio samples into ...
Nowadays, most methods in end-to-end contextual speech recognition bias ...
An end-to-end (E2E) speech recognition model implicitly learns a biased
...
Deep learning techniques for separating audio into different sound sourc...
Clothes style transfer for person video generation is a challenging task...
Recently, phonetic posteriorgrams (PPGs) based methods have been quite
p...
In expressive speech synthesis, there are high requirements for emotion
...
In the recent trend of semi-supervised speech recognition, both
self-sup...
Many previous audio-visual voice-related works focus on speech, ignoring...
This work describes an encoder pre-training procedure using frame-wise l...
Recurrent neural transducer (RNN-T) is a promising end-to-end (E2E) mode...
In this work we propose an inference technique, asynchronous revision, t...
Singing voice conversion (SVC) aims to convert the voice of one singer t...
Detecting anchor's voice in live musical streams is an important
preproc...
We present in this paper ByteCover, which is a new feature learning meth...
The rise of video-sharing platforms has attracted more and more people t...
Accent conversion (AC) transforms a non-native speaker's accent into a n...
This paper presents ByteSing, a Chinese singing voice synthesis (SVS) sy...
In this paper, we propose a hybrid text normalization system using multi...
In Mandarin text-to-speech (TTS) system, the front-end text processing m...
Frame stacking is broadly applied in end-to-end neural network training ...
Recurrent neural networks (RNNs), especially long short-term memory (LST...
As training data rapid growth, large-scale parallel training with multi-...