Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos

by   Saghir Alfasly, et al.

With the assumption that a video dataset is multimodality annotated in which auditory and visual modalities both are labeled or class-relevant, current multimodal methods apply modality fusion or cross-modality attention. However, effectively leveraging the audio modality in vision-specific annotated videos for action recognition is of particular challenge. To tackle this challenge, we propose a novel audio-visual framework that effectively leverages the audio modality in any solely vision-specific annotated dataset. We adopt the language models (e.g., BERT) to build a semantic audio-video label dictionary (SAVLD) that maps each video label to its most K-relevant audio labels in which SAVLD serves as a bridge between audio and video datasets. Then, SAVLD along with a pretrained audio multi-label model are used to estimate the audio-visual modality relevance during the training phase. Accordingly, a novel learnable irrelevant modality dropout (IMD) is proposed to completely drop out the irrelevant audio modality and fuse only the relevant modalities. Moreover, we present a new two-stream video Transformer for efficiently modeling the visual modalities. Results on several vision-specific annotated datasets including Kinetics400 and UCF-101 validated our framework as it outperforms most relevant action recognition methods.


page 4

page 13

page 14


MAiVAR-T: Multimodal Audio-image and Video Action Recognizer using Transformers

In line with the human capacity to perceive the world by simultaneously ...

E-Sports Talent Scouting Based on Multimodal Twitch Stream Data

We propose and investigate feasibility of a novel task that consists in ...

Bridging the Emotional Semantic Gap via Multimodal Relevance Estimation

Human beings have rich ways of emotional expressions, including facial a...

AV-MaskEnhancer: Enhancing Video Representations through Audio-Visual Masked Autoencoder

Learning high-quality video representation has shown significant applica...

Impact of annotation modality on label quality and model performance in the automatic assessment of laughter in-the-wild

Laughter is considered one of the most overt signals of joy. Laughter is...

HMS: Hierarchical Modality Selection for Efficient Video Recognition

Videos are multimodal in nature. Conventional video recognition pipeline...

MMG-Ego4D: Multi-Modal Generalization in Egocentric Action Recognition

In this paper, we study a novel problem in egocentric action recognition...

Please sign up or login with your details

Forgot password? Click here to reset