Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network

01/27/2021
by   Yehao Li, et al.
0

Despite having impressive vision-language (VL) pretraining with BERT-based encoder for VL understanding, the pretraining of a universal encoder-decoder for both VL understanding and generation remains challenging. The difficulty originates from the inherently different peculiarities of the two disciplines, e.g., VL understanding tasks capitalize on the unrestricted message passing across modalities, while generation tasks only employ visual-to-textual message passing. In this paper, we start with a two-stream decoupled design of encoder-decoder structure, in which two decoupled cross-modal encoder and decoder are involved to separately perform each type of proxy tasks, for simultaneous VL understanding and generation pretraining. Moreover, for VL pretraining, the dominant way is to replace some input visual/word tokens with mask tokens and enforce the multi-modal encoder/decoder to reconstruct the original tokens, but no mask token is involved when fine-tuning on downstream tasks. As an alternative, we propose a primary scheduled sampling strategy that elegantly mitigates such discrepancy via pretraining encoder-decoder in a two-pass manner. Extensive experiments demonstrate the compelling generalizability of our pretrained encoder-decoder by fine-tuning on four VL understanding and generation downstream tasks. Source code is available at <https://github.com/YehLi/TDEN>.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/30/2022

MAE-AST: Masked Autoencoding Audio Spectrogram Transformer

In this paper, we propose a simple yet powerful improvement over the rec...
research
05/13/2023

CodeT5+: Open Code Large Language Models for Code Understanding and Generation

Large language models (LLMs) pretrained on vast source code have achieve...
research
06/08/2022

Towards Understanding Why Mask-Reconstruction Pretraining Helps in Downstream Tasks

For unsupervised pretraining, mask-reconstruction pretraining (MRP) appr...
research
07/14/2022

Bootstrapped Masked Autoencoders for Vision BERT Pretraining

We propose bootstrapped masked autoencoders (BootMAE), a new approach fo...
research
03/14/2023

AdPE: Adversarial Positional Embeddings for Pretraining Vision Transformers via MAE+

Unsupervised learning of vision transformers seeks to pretrain an encode...
research
04/17/2023

Learning to Compress Prompts with Gist Tokens

Prompting is now the primary way to utilize the multitask capabilities o...
research
09/04/2019

Mixture Content Selection for Diverse Sequence Generation

Generating diverse sequences is important in many NLP applications such ...

Please sign up or login with your details

Forgot password? Click here to reset