Cross-Attention is Not Enough: Incongruity-Aware Hierarchical Multimodal Sentiment Analysis and Emotion Recognition

by   Yaoting Wang, et al.

Fusing multiple modalities for affective computing tasks has proven effective for performance improvement. However, how multimodal fusion works is not well understood, and its use in the real world usually results in large model sizes. In this work, on sentiment and emotion analysis, we first analyze how the salient affective information in one modality can be affected by the other in crossmodal attention. We find that inter-modal incongruity exists at the latent level due to crossmodal attention. Based on this finding, we propose a lightweight model via Hierarchical Crossmodal Transformer with Modality Gating (HCT-MG), which determines a primary modality according to its contribution to the target task and then hierarchically incorporates auxiliary modalities to alleviate inter-modal incongruity and reduce information redundancy. The experimental evaluation on three benchmark datasets: CMU-MOSI, CMU-MOSEI, and IEMOCAP verifies the efficacy of our approach, showing that it: 1) achieves better performance than prior work as well as manual selection of the primary modality; 2) can recognize hard samples whose emotions are hard to tell; 3) mitigates the inter-modal incongruity at the latent level when modalities have mismatched affective tendencies; 4) reduces model size to less than 1M parameters while outperforming existing models of similar sizes.


page 3

page 7


LMR-CBT: Learning Modality-fused Representations with CB-Transformer for Multimodal Emotion Recognition from Unaligned Multimodal Sequences

Learning modality-fused representations and processing unaligned multimo...

UniMSE: Towards Unified Multimodal Sentiment Analysis and Emotion Recognition

Multimodal sentiment analysis (MSA) and emotion recognition in conversat...

Which is Making the Contribution: Modulating Unimodal and Cross-modal Dynamics for Multimodal Sentiment Analysis

Multimodal sentiment analysis (MSA) draws increasing attention with the ...

Modality-based Factorization for Multimodal Fusion

We propose a multimodal data fusion method by obtaining a M+1 dimensiona...

DM^2S^2: Deep Multi-Modal Sequence Sets with Hierarchical Modality Attention

There is increasing interest in the use of multimodal data in various we...

A Self-Adjusting Fusion Representation Learning Model for Unaligned Text-Audio Sequences

Inter-modal interaction plays an indispensable role in multimodal sentim...

EffMulti: Efficiently Modeling Complex Multimodal Interactions for Emotion Analysis

Humans are skilled in reading the interlocutor's emotion from multimodal...

Please sign up or login with your details

Forgot password? Click here to reset