Integrating Holistic and Local Information to Estimate Emotional Reaction Intensity

05/09/2023
by   Yini Fang, et al.
0

Video-based Emotional Reaction Intensity (ERI) estimation measures the intensity of subjects' reactions to stimuli along several emotional dimensions from videos of the subject as they view the stimuli. We propose a multi-modal architecture for video-based ERI combining video and audio information. Video input is encoded spatially first, frame-by-frame, combining features encoding holistic aspects of the subjects' facial expressions and features encoding spatially localized aspects of their expressions. Input is then combined across time: from frame-to-frame using gated recurrent units (GRUs), then globally by a transformer. We handle variable video length with a regression token that accumulates information from all frames into a fixed-dimensional vector independent of video length. Audio information is handled similarly: spectral information extracted within each frame is integrated across time by a cascade of GRUs and a transformer with regression token. The video and audio regression tokens' outputs are merged by concatenation, then input to a final fully connected layer producing intensity estimates. Our architecture achieved excellent performance on the Hume-Reaction dataset in the ERI Esimation Challenge of the Fifth Competition on Affective Behavior Analysis in-the-Wild (ABAW5). The Pearson Correlation Coefficients between estimated and subject self-reported scores, averaged across all emotions, were 0.455 on the validation dataset and 0.4547 on the test dataset, well above the baselines. The transformer's self-attention mechanism enables our architecture to focus on the most critical video frames regardless of length. Ablation experiments establish the advantages of combining holistic/local features and of multi-modal integration. Code available at https://github.com/HKUST-NISL/ABAW5.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/20/2023

Facial Affective Analysis based on MAE and Multi-modal Information for 5th ABAW Competition

Human affective behavior analysis focuses on analyzing human expressions...
research
03/16/2023

Multimodal Feature Extraction and Fusion for Emotional Reaction Intensity Estimation and Expression Classification in Videos with Transformers

In this paper, we present our solutions to the two sub-challenges of Aff...
research
09/12/2023

DF-TransFusion: Multimodal Deepfake Detection via Lip-Audio Cross-Attention and Facial Self-Attention

With the rise in manipulated media, deepfake detection has become an imp...
research
08/12/2022

Class-attention Video Transformer for Engagement Intensity Prediction

In order to deal with variant-length long videos, prior works extract mu...
research
03/16/2023

Emotional Reaction Intensity Estimation Based on Multimodal Data

This paper introduces our method for the Emotional Reaction Intensity (E...
research
12/31/2020

A Multi-modal Deep Learning Model for Video Thumbnail Selection

Thumbnail is the face of online videos. The explosive growth of videos b...
research
06/11/2023

REACT2023: the first Multi-modal Multiple Appropriate Facial Reaction Generation Challenge

The Multi-modal Multiple Appropriate Facial Reaction Generation Challeng...

Please sign up or login with your details

Forgot password? Click here to reset