Interpretation on Multi-modal Visual Fusion

08/19/2023
by   Hao Chen, et al.
0

In this paper, we present an analytical framework and a novel metric to shed light on the interpretation of the multimodal vision community. Our approach involves measuring the proposed semantic variance and feature similarity across modalities and levels, and conducting semantic and quantitative analyses through comprehensive experiments. Specifically, we investigate the consistency and speciality of representations across modalities, evolution rules within each modality, and the collaboration logic used when optimizing a multi-modality model. Our studies reveal several important findings, such as the discrepancy in cross-modal features and the hybrid multi-modal cooperation rule, which highlights consistency and speciality simultaneously for complementary inference. Through our dissection and findings on multi-modal fusion, we facilitate a rethinking of the reasonability and necessity of popular multi-modal vision fusion strategies. Furthermore, our work lays the foundation for designing a trustworthy and universal multi-modal fusion model for a variety of tasks in the future.

READ FULL TEXT

page 3

page 6

page 7

research
05/02/2023

On Uni-Modal Feature Learning in Supervised Multi-Modal Learning

We abstract the features (i.e. learned representations) of multi-modal d...
research
12/08/2021

FLAVA: A Foundational Language And Vision Alignment Model

State-of-the-art vision and vision-and-language models rely on large-sca...
research
05/17/2021

Multi-modal Visual Place Recognition in Dynamics-Invariant Perception Space

Visual place recognition is one of the essential and challenging problem...
research
10/06/2021

Efficient Multi-Modal Embeddings from Structured Data

Multi-modal word semantics aims to enhance embeddings with perceptual in...
research
06/15/2020

Survey on Deep Multi-modal Data Analytics: Collaboration, Rivalry and Fusion

With the development of web technology, multi-modal or multi-view data h...
research
08/02/2021

Multimodal Feature Fusion for Video Advertisements Tagging Via Stacking Ensemble

Automated tagging of video advertisements has been a critical yet challe...
research
10/11/2022

Cross-modal Search Method of Technology Video based on Adversarial Learning and Feature Fusion

Technology videos contain rich multi-modal information. In cross-modal i...

Please sign up or login with your details

Forgot password? Click here to reset