Joint Audio-Text Model for Expressive Speech-Driven 3D Facial Animation

by   Yingruo Fan, et al.

Speech-driven 3D facial animation with accurate lip synchronization has been widely studied. However, synthesizing realistic motions for the entire face during speech has rarely been explored. In this work, we present a joint audio-text model to capture the contextual information for expressive speech-driven 3D facial animation. The existing datasets are collected to cover as many different phonemes as possible instead of sentences, thus limiting the capability of the audio-based model to learn more diverse contexts. To address this, we propose to leverage the contextual text embeddings extracted from the powerful pre-trained language model that has learned rich contextual representations from large-scale text data. Our hypothesis is that the text features can disambiguate the variations in upper face expressions, which are not strongly correlated with the audio. In contrast to prior approaches which learn phoneme-level features from the text, we investigate the high-level contextual text features for speech-driven 3D facial animation. We show that the combined acoustic and textual modalities can synthesize realistic facial expressions while maintaining audio-lip synchronization. We conduct the quantitative and qualitative evaluations as well as the perceptual user study. The results demonstrate the superior performance of our model against existing state-of-the-art approaches.


page 1

page 3

page 6

page 7


MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement

This paper presents a generic method for generating full facial 3D anima...

CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior

Speech-driven 3D facial animation has been widely studied, yet there is ...

Text/Speech-Driven Full-Body Animation

Due to the increasing demand in films and games, synthesizing 3D avatar ...

FaceFormer: Speech-Driven 3D Facial Animation with Transformers

Speech-driven 3D facial animation is challenging due to the complex geom...

FaceXHuBERT: Text-less Speech-driven E(X)pressive 3D Facial Animation Synthesis Using Self-Supervised Speech Representation Learning

This paper presents FaceXHuBERT, a text-less speech-driven 3D facial ani...

Talking Head Generation with Probabilistic Audio-to-Visual Diffusion Priors

In this paper, we introduce a simple and novel framework for one-shot au...

Unsupervised Learning of Style-Aware Facial Animation from Real Acting Performances

This paper presents a novel approach for text/speech-driven animation of...

Please sign up or login with your details

Forgot password? Click here to reset