Finetuning Pretrained Vision-Language Models with Correlation Information Bottleneck for Robust Visual Question Answering

by   Jingjing Jiang, et al.

Benefiting from large-scale Pretrained Vision-Language Models (VL-PMs), the performance of Visual Question Answering (VQA) has started to approach human oracle performance. However, finetuning large-scale VL-PMs with limited data for VQA usually faces overfitting and poor generalization issues, leading to a lack of robustness. In this paper, we aim to improve the robustness of VQA systems (ie, the ability of the systems to defend against input variations and human-adversarial attacks) from the perspective of Information Bottleneck when finetuning VL-PMs for VQA. Generally, internal representations obtained by VL-PMs inevitably contain irrelevant and redundant information for the downstream VQA task, resulting in statistically spurious correlations and insensitivity to input variations. To encourage representations to converge to a minimal sufficient statistic in vision-language learning, we propose the Correlation Information Bottleneck (CIB) principle, which seeks a tradeoff between representation compression and redundancy by minimizing the mutual information (MI) between the inputs and internal representations while maximizing the MI between the outputs and the representations. Meanwhile, CIB measures the internal correlations among visual and linguistic inputs and representations by a symmetrized joint MI estimation. Extensive experiments on five VQA benchmarks of input robustness and two VQA benchmarks of human-adversarial robustness demonstrate the effectiveness and superiority of the proposed CIB in improving the robustness of VQA systems.


SwapMix: Diagnosing and Regularizing the Over-Reliance on Visual Context in Visual Question Answering

While Visual Question Answering (VQA) has progressed rapidly, previous w...

Image Captioning for Effective Use of Language Models in Knowledge-Based Visual Question Answering

Integrating outside knowledge for reasoning in visio-linguistic tasks su...

Learning from Lexical Perturbations for Consistent Visual Question Answering

Existing Visual Question Answering (VQA) models are often fragile and se...

Rethinking Evaluation Practices in Visual Question Answering: A Case Study on Out-of-Distribution Generalization

Vision-and-language (V L) models pretrained on large-scale multimodal ...

MixPHM: Redundancy-Aware Parameter-Efficient Tuning for Low-Resource Visual Question Answering

Recently, finetuning pretrained vision-language models (VLMs) has become...

Vision-Language Models as Success Detectors

Detecting successful behaviour is crucial for training intelligent agent...

Aligned Image-Word Representations Improve Inductive Transfer Across Vision-Language Tasks

An important goal of computer vision is to build systems that learn visu...

Please sign up or login with your details

Forgot password? Click here to reset