Cross-Attention is Not Enough: Incongruity-Aware Hierarchical Multimodal Sentiment Analysis and Emotion Recognition

05/23/2023

∙

Fusing multiple modalities for affective computing tasks has proven effective for performance improvement. However, how multimodal fusion works is not well understood, and its use in the real world usually results in large model sizes. In this work, on sentiment and emotion analysis, we first analyze how the salient affective information in one modality can be affected by the other in crossmodal attention. We find that inter-modal incongruity exists at the latent level due to crossmodal attention. Based on this finding, we propose a lightweight model via Hierarchical Crossmodal Transformer with Modality Gating (HCT-MG), which determines a primary modality according to its contribution to the target task and then hierarchically incorporates auxiliary modalities to alleviate inter-modal incongruity and reduce information redundancy. The experimental evaluation on three benchmark datasets: CMU-MOSI, CMU-MOSEI, and IEMOCAP verifies the efficacy of our approach, showing that it: 1) achieves better performance than prior work as well as manual selection of the primary modality; 2) can recognize hard samples whose emotions are hard to tell; 3) mitigates the inter-modal incongruity at the latent level when modalities have mismatched affective tendencies; 4) reduces model size to less than 1M parameters while outperforming existing models of similar sizes.

READ FULL TEXT

Cross-Attention is Not Enough: Incongruity-Aware Hierarchical Multimodal Sentiment Analysis and Emotion Recognition

Sign in with Google

Consider DeepAI Pro