Bridging the Gap: Exploring the Capabilities of Bridge-Architectures for Complex Visual Reasoning Tasks

by   Kousik Rajesh, et al.

In recent times there has been a surge of multi-modal architectures based on Large Language Models, which leverage the zero shot generation capabilities of LLMs and project image embeddings into the text space and then use the auto-regressive capacity to solve tasks such as VQA, captioning, and image retrieval. We name these architectures as "bridge-architectures" as they project from the image space to the text space. These models deviate from the traditional recipe of training transformer based multi-modal models, which involve using large-scale pre-training and complex multi-modal interactions through co or cross attention. However, the capabilities of bridge architectures have not been tested on complex visual reasoning tasks which require fine grained analysis about the image. In this project, we investigate the performance of these bridge-architectures on the NLVR2 dataset, and compare it to state-of-the-art transformer based architectures. We first extend the traditional bridge architectures for the NLVR2 dataset, by adding object level features to faciliate fine-grained object reasoning. Our analysis shows that adding object level features to bridge architectures does not help, and that pre-training on multi-modal data is key for good performance on complex reasoning tasks such as NLVR2. We also demonstrate some initial results on a recently bridge-architecture, LLaVA, in the zero shot setting and analyze its performance.


page 6

page 9

page 10

page 12

page 13

page 14

page 15

page 16


MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

Starting from the resurgence of deep learning, vision-language models (V...

Towards Zero-shot Cross-lingual Image Retrieval

There has been a recent spike in interest in multi-modal Language and Vi...

DeCap: Decoding CLIP Latents for Zero-Shot Captioning via Text-Only Training

Large-scale pre-trained multi-modal models (e.g., CLIP) demonstrate stro...

Zero-Shot Multi-Modal Artist-Controlled Retrieval and Exploration of 3D Object Sets

When creating 3D content, highly specialized skills are generally needed...

MMTM: Multi-Tasking Multi-Decoder Transformer for Math Word Problems

Recently, quite a few novel neural architectures were derived to solve m...

Exploiting CLIP-based Multi-modal Approach for Artwork Classification and Retrieval

Given the recent advances in multimodal image pretraining where visual m...

Retrieval-Enhanced Contrastive Vision-Text Models

Contrastive image-text models such as CLIP form the building blocks of m...

Please sign up or login with your details

Forgot password? Click here to reset