Video and Audio are Images: A Cross-Modal Mixer for Original Data on Video-Audio Retrieval

by   Zichen Yuan, et al.

Cross-modal retrieval has become popular in recent years, particularly with the rise of multimedia. Generally, the information from each modality exhibits distinct representations and semantic information, which makes feature tends to be in separate latent spaces encoded with dual-tower architecture and makes it difficult to establish semantic relationships between modalities, resulting in poor retrieval performance. To address this issue, we propose a novel framework for cross-modal retrieval which consists of a cross-modal mixer, a masked autoencoder for pre-training, and a cross-modal retriever for downstream tasks.In specific, we first adopt cross-modal mixer and mask modeling to fuse the original modality and eliminate redundancy. Then, an encoder-decoder architecture is applied to achieve a fuse-then-separate task in the pre-training phase.We feed masked fused representations into the encoder and reconstruct them with the decoder, ultimately separating the original data of two modalities. In downstream tasks, we use the pre-trained encoder to build the cross-modal retrieval method. Extensive experiments on 2 real-world datasets show that our approach outperforms previous state-of-the-art methods in video-audio matching tasks, improving retrieval accuracy by up to 2 times. Furthermore, we prove our model performance by transferring it to other downstream tasks as a universal model.


page 24

page 25

page 26


OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation

In this paper, we propose an Omni-perception Pre-Trainer (OPT) for cross...

Bridge-Tower: Building Bridges Between Encoders in Vision-Language Representation Learning

Vision-Language (VL) models with the Two-Tower architecture have dominat...

CoCo-BERT: Improving Video-Language Pre-training with Contrastive Cross-modal Matching and Denoising

BERT-type structure has led to the revolution of vision-language pre-tra...

AdvCLIP: Downstream-agnostic Adversarial Examples in Multimodal Contrastive Learning

Multimodal contrastive learning aims to train a general-purpose feature ...

Learning Audio-Visual Correlations from Variational Cross-Modal Generation

People can easily imagine the potential sound while seeing an event. Thi...

Image-text Retrieval: A Survey on Recent Research and Development

In the past few years, cross-modal image-text retrieval (ITR) has experi...

Learning Controls Using Cross-Modal Representations: Bridging Simulation and Reality for Drone Racing

Machines are a long way from robustly solving open-world perception-cont...

Please sign up or login with your details

Forgot password? Click here to reset