Exploring the Contextual Dynamics of Multimodal Emotion Recognition in Videos

04/28/2020

∙

Emotional expressions form a key part of user behavior on today's digital platforms. While multimodal emotion recognition techniques are gaining research attention, there is a lack of deeper understanding on how visual and non-visual features can be used in better recognizing emotions for certain contexts, but not others. This study analyzes the interplay between the effects of multimodal emotion features derived from facial expressions, tone and text in conjunction with two key contextual factors: 1) the gender of the speaker, and 2) the duration of the emotional episode. Using a large dataset of more than 2500 manually annotated videos from YouTube, we found that while multimodal features consistently outperformed bimodal and unimodal features, their performances varied significantly for different emotions, gender and duration contexts. Multimodal features were found to perform particularly better for male than female speakers in recognizing most emotions except for fear. Furthermore, multimodal features performed particularly better for shorter than for longer videos in recognizing neutral, happiness, and surprise, but not sadness, anger, disgust and fear. These findings offer new insights towards the development of more context-aware emotion recognition and empathetic systems.

READ FULL TEXT

Exploring the Contextual Dynamics of Multimodal Emotion Recognition in Videos

Sign in with Google

Consider DeepAI Pro