Speech Driven Video Editing via an Audio-Conditioned Diffusion Model

01/10/2023
by   Dan Bigioi, et al.
0

In this paper we propose a method for end-to-end speech driven video editing using a denoising diffusion model. Given a video of a person speaking, we aim to re-synchronise the lip and jaw motion of the person in response to a separate auditory speech recording without relying on intermediate structural representations such as facial landmarks or a 3D face model. We show this is possible by conditioning a denoising diffusion model with audio spectral features to generate synchronised facial motion. We achieve convincing results on the task of unstructured single-speaker video editing, achieving a word error rate of 45 demonstrate how our approach can be extended to the multi-speaker domain. To our knowledge, this is the first work to explore the feasibility of applying denoising diffusion models to the task of audio-driven video editing.

READ FULL TEXT

page 3

page 5

page 7

page 8

research
08/18/2023

StableVideo: Text-driven Consistency-aware Diffusion Video Editing

Diffusion-based methods can generate realistic images and videos, but th...
research
06/05/2022

Zero-Shot Voice Conditioning for Denoising Diffusion TTS Models

We present a novel way of conditioning a pretrained denoising diffusion ...
research
12/06/2022

Diffusion Video Autoencoders: Toward Temporally Consistent Face Video Editing via Disentangled Video Encoding

Inspired by the impressive performance of recent face image editing meth...
research
06/14/2019

Video-Driven Speech Reconstruction using Generative Adversarial Networks

Speech is a means of communication which relies on both audio and visual...
research
10/16/2021

Intelligent Video Editing: Incorporating Modern Talking Face Generation Algorithms in a Video Editor

This paper proposes a video editor based on OpenShot with several state-...
research
10/06/2021

EdiTTS: Score-based Editing for Controllable Text-to-Speech

We present EdiTTS, an off-the-shelf speech editing methodology based on ...
research
09/14/2023

DiffTalker: Co-driven audio-image diffusion for talking faces via intermediate landmarks

Generating realistic talking faces is a complex and widely discussed tas...

Please sign up or login with your details

Forgot password? Click here to reset