COLA: How to adapt vision-language models to Compose Objects Localized with Attributes?

05/05/2023
by   Arijit Ray, et al.
0

Compositional reasoning is a hallmark of human visual intelligence; yet despite the size of large vision-language models, they struggle to represent simple compositions by combining objects with their attributes. To measure this lack of compositional capability, we design Cola, a text-to-image retrieval benchmark to Compose Objects Localized with Attributes. Using Cola as a testbed, we explore modeling designs to adapt pre-trained vision-language models to reason compositionally about multiple attributes attached to multiple objects. We explore 6 finetuning strategies on 2 seminal vision-language models, using 3 finetuning datasets and 2 test benchmarks (Cola and CREPE). Surprisingly, our optimal finetuning strategy improves a 151M parameter CLIP, which disjointly encodes image and language during pretraining, to perform as well as a 241M parameter FLAVA, which uses a multi-modal transformer encoder during pretraining to attend over both vision and language modalities. This optimal finetuning strategy is a lightweight multi-modal adapter that jointly attends over both image and language features generated by the pretrained model. We show this works better than common strategies such as prompt/fine-tuning, or tuning a comparable number of unimodal layers.

READ FULL TEXT

page 16

page 17

page 18

page 19

page 23

page 27

page 29

page 30

research
04/16/2021

Effect of Vision-and-Language Extensions on Natural Language Understanding in Vision-and-Language Models

Extending language models with structural modifications and vision-and-l...
research
08/07/2023

Learning Concise and Descriptive Attributes for Visual Recognition

Recent advances in foundation models present new opportunities for inter...
research
06/20/2023

Blackbird language matrices (BLM), a new task for rule-like generalization in neural networks: Motivations and Formal Specifications

We motivate and formally define a new task for fine-tuning rule-like gen...
research
10/04/2022

When and why vision-language models behave like bags-of-words, and what to do about it?

Despite the success of large vision and language models (VLMs) in many d...
research
05/24/2023

Text encoders are performance bottlenecks in contrastive vision-language models

Performant vision-language (VL) models like CLIP represent captions usin...
research
08/31/2023

Expanding Frozen Vision-Language Models without Retraining: Towards Improved Robot Perception

Vision-language models (VLMs) have shown powerful capabilities in visual...
research
09/03/2023

BDC-Adapter: Brownian Distance Covariance for Better Vision-Language Reasoning

Large-scale pre-trained Vision-Language Models (VLMs), such as CLIP and ...

Please sign up or login with your details

Forgot password? Click here to reset