Everybody's Talkin': Let Me Talk as You Want

by   Linsen Song, et al.

We present a method to edit a target portrait footage by taking a sequence of audio as input to synthesize a photo-realistic video. This method is unique because it is highly dynamic. It does not assume a person-specific rendering network yet capable of translating arbitrary source audio into arbitrary video output. Instead of learning a highly heterogeneous and nonlinear mapping from audio to the video directly, we first factorize each target video frame into orthogonal parameter spaces, i.e., expression, geometry, and pose, via monocular 3D face reconstruction. Next, a recurrent network is introduced to translate source audio into expression parameters that are primarily related to the audio content. The audio-translated expression parameters are then used to synthesize a photo-realistic human subject in each video frame, with the movement of the mouth regions precisely mapped to the source audio. The geometry and pose parameters of the target human portrait are retained, therefore preserving the context of the original video footage. Finally, we introduce a novel video rendering network and a dynamic programming method to construct a temporally coherent and photo-realistic video. Extensive experiments demonstrate the superiority of our method over existing approaches. Our method is end-to-end learnable and robust to voice variations in the source audio.


page 1

page 4

page 6

page 7

page 8

page 12

page 13


Neural Voice Puppetry: Audio-driven Facial Reenactment

We present Neural Voice Puppetry, a novel approach for audio-driven faci...

VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild

We present VideoReTalking, a new system to edit the faces of a real-worl...

Facial Keypoint Sequence Generation from Audio

Whenever we speak, our voice is accompanied by facial movements and expr...

Neural Relighting and Expression Transfer On Video Portraits

Photo-realistic video portrait reenactment benefits virtual production a...

Text-based Editing of Talking-head Video

Editing talking-head video to change the speech content or to remove fil...

Robust Pose Transfer with Dynamic Details using Neural Video Rendering

Pose transfer of human videos aims to generate a high fidelity video of ...

Dynamic Neural Portraits

We present Dynamic Neural Portraits, a novel approach to the problem of ...

Please sign up or login with your details

Forgot password? Click here to reset