Declaration-based Prompt Tuning for Visual Question Answering

05/05/2022
by   Yuhang Liu, et al.
2

In recent years, the pre-training-then-fine-tuning paradigm has yielded immense success on a wide spectrum of cross-modal tasks, such as visual question answering (VQA), in which a visual-language (VL) model is first optimized via self-supervised task objectives, e.g., masked language modeling (MLM) and image-text matching (ITM), and then fine-tuned to adapt to downstream task (e.g., VQA) via a brand-new objective function, e.g., answer prediction. The inconsistency of the objective forms not only severely limits the generalization of pre-trained VL models to downstream tasks, but also requires a large amount of labeled data for fine-tuning. To alleviate the problem, we propose an innovative VL fine-tuning paradigm (named Declaration-based Prompt Tuning, abbreviated as DPT), which jointly optimizes the objectives of pre-training and fine-tuning of VQA model, boosting the effective adaptation of pre-trained VL models to the downstream task. Specifically, DPT reformulates the objective form of VQA task via (1) textual adaptation, which converts the given questions into declarative sentence-form for prompt-tuning, and (2) task adaptation, which optimizes the objective function of VQA problem in the manner of pre-training phase. Experimental results on GQA dataset show that DPT outperforms the fine-tuned counterpart by a large margin regarding accuracy in both fully-supervised (2.68 the data and codes will be available to facilitate future research.

READ FULL TEXT

page 3

page 12

page 13

page 14

research
09/24/2021

CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models

Pre-Trained Vision-Language Models (VL-PTMs) have shown promising capabi...
research
04/10/2023

CAVL: Learning Contrastive and Adaptive Representations of Vision and Language

Visual and linguistic pre-training aims to learn vision and language rep...
research
04/02/2023

Instance-level Trojan Attacks on Visual Question Answering via Adversarial Learning in Neuron Activation Space

Malicious perturbations embedded in input data, known as Trojan attacks,...
research
01/22/2023

Champion Solution for the WSDM2023 Toloka VQA Challenge

In this report, we present our champion solution to the WSDM2023 Toloka ...
research
04/03/2023

Disentangled Pre-training for Image Matting

Image matting requires high-quality pixel-level human annotations to sup...
research
07/01/2019

Pentagon at MEDIQA 2019: Multi-task Learning for Filtering and Re-ranking Answers using Language Inference and Question Entailment

Parallel deep learning architectures like fine-tuned BERT and MT-DNN, ha...
research
08/17/2023

Towards Automatically Addressing Self-Admitted Technical Debt: How Far Are We?

Upon evolving their software, organizations and individual developers ha...

Please sign up or login with your details

Forgot password? Click here to reset