SVTS: Scalable Video-to-Speech Synthesis

05/04/2022
by   Rodrigo Mira, et al.
11

Video-to-speech synthesis (also known as lip-to-speech) refers to the translation of silent lip movements into the corresponding audio. This task has received an increasing amount of attention due to its self-supervised nature (i.e., can be trained without manual labelling) combined with the ever-growing collection of audio-visual data available online. Despite these strong motivations, contemporary video-to-speech works focus mainly on small- to medium-sized corpora with substantial constraints in both vocabulary and setting. In this work, we introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder, which converts the mel-frequency spectrograms into waveform audio. We achieve state-of-the art results for GRID and considerably outperform previous approaches on LRW. More importantly, by focusing on spectrogram prediction using a simple feedforward model, we can efficiently and effectively scale our method to very large and unconstrained datasets: To the best of our knowledge, we are the first to show intelligible results on the challenging LRS3 dataset.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/31/2023

Audio-visual video-to-speech synthesis with synthesized input audio

Video-to-speech synthesis involves reconstructing the speech signal of a...
research
12/09/2021

LipSound2: Self-Supervised Pre-Training for Lip-to-Speech Reconstruction and Lip Reading

The aim of this work is to investigate the impact of crossmodal self-sup...
research
03/30/2023

SynthVSR: Scaling Up Visual Speech Recognition With Synthetic Supervision

Recently reported state-of-the-art results in visual speech recognition ...
research
06/27/2023

Large-scale unsupervised audio pre-training for video-to-speech synthesis

Video-to-speech synthesis is the task of reconstructing the speech signa...
research
03/01/2023

On the Audio-visual Synchronization for Lip-to-Speech Synthesis

Most lip-to-speech (LTS) synthesis models are trained and evaluated unde...
research
04/27/2021

End-to-End Video-To-Speech Synthesis using Generative Adversarial Networks

Video-to-speech is the process of reconstructing the audio speech from a...
research
03/10/2022

KSoF: The Kassel State of Fluency Dataset – A Therapy Centered Dataset of Stuttering

Stuttering is a complex speech disorder that negatively affects an indiv...

Please sign up or login with your details

Forgot password? Click here to reset