Continuous Emotion Recognition with Audio-visual Leader-follower Attentive Fusion

07/02/2021
by   Su Zhang, et al.
0

We propose an audio-visual spatial-temporal deep neural network with: (1) a visual block containing a pretrained 2D-CNN followed by a temporal convolutional network (TCN); (2) an aural block containing several parallel TCNs; and (3) a leader-follower attentive fusion block combining the audio-visual information. The TCN with large history coverage enables our model to exploit spatial-temporal information within a much larger window length (i.e., 300) than that from the baseline and state-of-the-art methods (i.e., 36 or 48). The fusion block emphasizes the visual modality while exploits the noisy aural modality using the inter-modality attention mechanism. To make full use of the data and alleviate over-fitting, cross-validation is carried out on the training and validation set. The concordance correlation coefficient (CCC) centering is used to merge the results from each fold. On the test (validation) set of the Aff-Wild2 database, the achieved CCC is 0.463 (0.469) for valence and 0.492 (0.649) for arousal, which significantly outperforms the baseline method with the corresponding CCC of 0.200 (0.210) and 0.190 (0.230) for valence and arousal, respectively. The code is available at https://github.com/sucv/ABAW2.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/24/2022

Continuous Emotion Recognition using Visual-audio-linguistic information: A Technical Report for ABAW3

We propose a cross-modal co-attention model for continuous emotion recog...
research
03/18/2023

Multimodal Continuous Emotion Recognition: A Technical Report for ABAW5

We used two multimodal models for continuous valence-arousal recognition...
research
08/06/2020

Attentive Fusion Enhanced Audio-Visual Encoding for Transformer Based Robust Speech Recognition

Audio-visual information fusion enables a performance improvement in spe...
research
06/12/2021

Multi-level Attention Fusion Network for Audio-visual Event Recognition

Event classification is inherently sequential and multimodal. Therefore,...
research
06/18/2017

3D Convolutional Neural Networks for Cross Audio-Visual Matching Recognition

Audio-visual recognition (AVR) has been considered as a solution for spe...
research
09/09/2022

Learning Audio-Visual embedding for Person Verification in the Wild

It has already been observed that audio-visual embedding is more robust ...
research
07/11/2020

Look and Listen: A Multi-modality Late Fusion Approach to Scene Classification for Autonomous Machines

The novelty of this study consists in a multi-modality approach to scene...

Please sign up or login with your details

Forgot password? Click here to reset