Face-Dubbing++: Lip-Synchronous, Voice Preserving Translation of Videos

06/09/2022
by   Alexander Waibel, et al.
15

In this paper, we propose a neural end-to-end system for voice preserving, lip-synchronous translation of videos. The system is designed to combine multiple component models and produces a video of the original speaker speaking in the target language that is lip-synchronous with the target speech, yet maintains emphases in speech, voice characteristics, face video of the original speaker. The pipeline starts with automatic speech recognition including emphasis detection, followed by a translation model. The translated text is then synthesized by a Text-to-Speech model that recreates the original emphases mapped from the original sentence. The resulting synthetic voice is then mapped back to the original speakers' voice using a voice conversion model. Finally, to synchronize the lips of the speaker with the translated audio, a conditional generative adversarial network-based model generates frames of adapted lip movements with respect to the input face image as well as the output of the voice conversion model. In the end, the system combines the generated video with the converted audio to produce the final output. The result is a video of a speaker speaking in another language without actually knowing it. To evaluate our design, we present a user study of the complete system as well as separate evaluations of the single components. Since there is no available dataset to evaluate our whole system, we collect a test set and evaluate our system on this test set. The results indicate that our system is able to generate convincing videos of the original speaker speaking the target language while preserving the original speaker's characteristics. The collected dataset will be shared.

READ FULL TEXT

page 5

page 7

page 10

page 11

research
11/06/2020

Large-scale multilingual audio visual dubbing

We describe a system for large-scale audiovisual translation and dubbing...
research
09/20/2023

TRAVID: An End-to-End Video Translation Framework

In today's globalized world, effective communication with people from di...
research
02/27/2020

SkinAugment: Auto-Encoding Speaker Conversions for Automatic Speech Translation

We propose autoencoding speaker conversion for training data augmentatio...
research
10/13/2022

Pre-Avatar: An Automatic Presentation Generation Framework Leveraging Talking Avatar

Since the beginning of the COVID-19 pandemic, remote conferencing and sc...
research
07/19/2021

Translatotron 2: Robust direct speech-to-speech translation

We present Translatotron 2, a neural direct speech-to-speech translation...
research
03/01/2020

Towards Automatic Face-to-Face Translation

In light of the recent breakthroughs in automatic machine translation sy...
research
10/13/2016

A Survey of Voice Translation Methodologies - Acoustic Dialect Decoder

Speech Translation has always been about giving source text or audio inp...

Please sign up or login with your details

Forgot password? Click here to reset