Masked Vision and Language Modeling for Multi-modal Representation Learning

08/03/2022
by   Gukyeong Kwon, et al.
22

In this paper, we study how to use masked signal modeling in vision and language (V+L) representation learning. Instead of developing masked language modeling (MLM) and masked image modeling (MIM) independently, we propose to build joint masked vision and language modeling, where the masked signal of one modality is reconstructed with the help from another modality. This is motivated by the nature of image-text paired data that both of the image and the text convey almost the same information but in different formats. The masked signal reconstruction of one modality conditioned on another modality can also implicitly learn cross-modal alignment between language tokens and image patches. Our experiments on various V+L tasks show that the proposed method not only achieves state-of-the-art performances by using a large amount of data, but also outperforms the other competitors by a significant margin in the regimes of limited training data.

READ FULL TEXT

page 2

page 9

page 14

research
01/29/2022

MVPTR: Multi-Stage Vision-Language Pre-Training via Multi-Level Semantic Alignment

In this paper, we propose a Multi-stage Vision-language Pre-TRaining (MV...
research
02/21/2022

Vision-Language Pre-Training with Triple Contrastive Learning

Vision-language representation learning largely benefits from image-text...
research
09/24/2021

MLIM: Vision-and-Language Model Pre-training with Masked Language and Image Modeling

Vision-and-Language Pre-training (VLP) improves model performance for do...
research
05/12/2022

A Generalist Agent

Inspired by progress in large-scale language modeling, we apply a simila...
research
07/31/2022

Augmenting Vision Language Pretraining by Learning Codebook with Visual Semantics

Language modality within the vision language pretraining framework is in...
research
02/28/2022

Multi-modal Alignment using Representation Codebook

Aligning signals from different modalities is an important step in visio...
research
08/06/2023

Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation

In recent years, 3D representation learning has turned to 2D vision-lang...

Please sign up or login with your details

Forgot password? Click here to reset