Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation

03/24/2022
by   Xian Liu, et al.
0

Generating speech-consistent body and gesture movements is a long-standing problem in virtual avatar creation. Previous studies often synthesize pose movement in a holistic manner, where poses of all joints are generated simultaneously. Such a straightforward pipeline fails to generate fine-grained co-speech gestures. One observation is that the hierarchical semantics in speech and the hierarchical structures of human gestures can be naturally described into multiple granularities and associated together. To fully utilize the rich connections between speech audio and human gestures, we propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation. In HA2G, a Hierarchical Audio Learner extracts audio representations across semantic granularities. A Hierarchical Pose Inferer subsequently renders the entire human pose gradually in a hierarchical manner. To enhance the quality of synthesized gestures, we develop a contrastive learning strategy based on audio-text alignment for better audio representations. Extensive experiments and human evaluation demonstrate that the proposed method renders realistic co-speech gestures and outperforms previous methods in a clear margin. Project page: https://alvinliu0.github.io/projects/HA2G

READ FULL TEXT

page 3

page 7

page 18

research
12/05/2022

Audio-Driven Co-Speech Gesture Video Generation

Co-speech gesture is crucial for human-machine interaction and digital e...
research
01/14/2021

Generating coherent spontaneous speech and gesture from text

Embodied human communication encompasses both verbal (speech) and non-ve...
research
10/04/2022

Rhythmic Gesticulator: Rhythm-Aware Co-Speech Gesture Synthesis with Hierarchical Neural Embeddings

Automatic synthesis of realistic co-speech gestures is an increasingly i...
research
09/17/2023

LivelySpeaker: Towards Semantic-Aware Co-Speech Gesture Generation

Gestures are non-verbal but important behaviors accompanying people's sp...
research
05/02/2023

AQ-GT: a Temporally Aligned and Quantized GRU-Transformer for Co-Speech Gesture Synthesis

The generation of realistic and contextually relevant co-speech gestures...
research
06/13/2020

Dynamic gesture retrieval: searching videos by human pose sequence

The number of static human poses is limited, it is hard to retrieve the ...
research
03/26/2023

GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents

The automatic generation of stylized co-speech gestures has recently rec...

Please sign up or login with your details

Forgot password? Click here to reset