FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech Synthesis

07/08/2022
by   Yongqi Wang, et al.
0

Unconstrained lip-to-speech synthesis aims to generate corresponding speeches from silent videos of talking faces with no restriction on head poses or vocabulary. Current works mainly use sequence-to-sequence models to solve this problem, either in an autoregressive architecture or a flow-based non-autoregressive architecture. However, these models suffer from several drawbacks: 1) Instead of directly generating audios, they use a two-stage pipeline that first generates mel-spectrograms and then reconstructs audios from the spectrograms. This causes cumbersome deployment and degradation of speech quality due to error propagation; 2) The audio reconstruction algorithm used by these models limits the inference speed and audio quality, while neural vocoders are not available for these models since their output spectrograms are not accurate enough; 3) The autoregressive model suffers from high inference latency, while the flow-based model has high memory occupancy: neither of them is efficient enough in both time and memory usage. To tackle these problems, we propose FastLTS, a non-autoregressive end-to-end model which can directly synthesize high-quality speech audios from unconstrained talking videos with low latency, and has a relatively small model size. Besides, different from the widely used 3D-CNN visual frontend for lip movement encoding, we for the first time propose a transformer-based visual frontend for this task. Experiments show that our model achieves 19.76× speedup for audio waveform generation compared with the current autoregressive model on input sequences of 3 seconds, and obtains superior audio quality.

READ FULL TEXT

page 6

page 8

research
07/03/2023

RobustL2S: Speaker-Specific Lip-to-Speech Synthesis exploiting Self-Supervised Representations

Significant progress has been made in speaker dependent Lip-to-Speech sy...
research
12/22/2016

SampleRNN: An Unconditional End-to-End Neural Audio Generation Model

In this paper we propose a novel model for unconditional audio generatio...
research
05/11/2020

Listen Attentively, and Spell Once: Whole Sentence Generation via a Non-Autoregressive Architecture for Low-Latency Speech Recognition

Although attention based end-to-end models have achieved promising perfo...
research
11/06/2020

Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis

We describe a sequence-to-sequence neural network which can directly gen...
research
11/25/2020

FBWave: Efficient and Scalable Neural Vocoders for Streaming Text-To-Speech on the Edge

Nowadays more and more applications can benefit from edge-based text-to-...
research
05/16/2023

SoundStorm: Efficient Parallel Audio Generation

We present SoundStorm, a model for efficient, non-autoregressive audio g...
research
04/01/2021

Multi-rate attention architecture for fast streamable Text-to-speech spectrum modeling

Typical high quality text-to-speech (TTS) systems today use a two-stage ...

Please sign up or login with your details

Forgot password? Click here to reset