An Empirical Study of Training End-to-End Vision-and-Language Transformers

11/03/2021
by   Zi-Yi Dou, et al.
0

Vision-and-language (VL) pre-training has proven to be highly effective on various VL downstream tasks. While recent work has shown that fully transformer-based VL models can be more efficient than previous region-feature-based methods, their performance on downstream tasks are often degraded significantly. In this paper, we present METER (Multimodal End-to-end TransformER), through which we systematically investigate how to design and pre-train a fully transformer-based VL model in an end-to-end manner. Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion (e.g., merged attention vs. co-attention), architecture design (e.g., encoder-only vs. encoder-decoder), and pre-training objectives (e.g., masked image modeling). We conduct comprehensive experiments on a wide range of VL tasks, and provide insights on how to train a performant VL transformer while maintaining fast inference speed. Notably, METER achieves an accuracy of 77.64% on the VQAv2 test-std set using only 4M images for pre-training, surpassing the state-of-the-art region-feature-based VinVL model by +1.04%, and outperforming the previous best fully transformer-based ALBEF model by +1.6%.

READ FULL TEXT

page 1

page 2

page 5

page 6

page 7

page 8

page 9

page 10

research
06/03/2021

E2E-VLP: End-to-End Vision-Language Pre-training Enhanced by Visual Learning

Vision-language pre-training (VLP) on large-scale image-text pairs has a...
research
06/05/2023

Efficient GPT Model Pre-training using Tensor Train Matrix Representation

Large-scale transformer models have shown remarkable performance in lang...
research
09/04/2022

An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling

Masked visual modeling (MVM) has been recently proven effective for visu...
research
12/08/2021

MLP Architectures for Vision-and-Language Modeling: An Empirical Study

We initiate the first empirical study on the use of MLP architectures fo...
research
10/24/2022

Effective Pre-Training Objectives for Transformer-based Autoencoders

In this paper, we study trade-offs between efficiency, cost and accuracy...
research
02/05/2021

ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision

Vision-and-Language Pretraining (VLP) has improved performance on variou...
research
06/02/2022

MMTM: Multi-Tasking Multi-Decoder Transformer for Math Word Problems

Recently, quite a few novel neural architectures were derived to solve m...

Please sign up or login with your details

Forgot password? Click here to reset