ViLLA: Fine-Grained Vision-Language Representation Learning from Real-World Data

08/22/2023
by   Maya Varma, et al.
0

Vision-language models (VLMs), such as CLIP and ALIGN, are generally trained on datasets consisting of image-caption pairs obtained from the web. However, real-world multimodal datasets, such as healthcare data, are significantly more complex: each image (e.g. X-ray) is often paired with text (e.g. physician report) that describes many distinct attributes occurring in fine-grained regions of the image. We refer to these samples as exhibiting high pairwise complexity, since each image-text pair can be decomposed into a large number of region-attribute pairings. The extent to which VLMs can capture fine-grained relationships between image regions and textual attributes when trained on such data has not been previously evaluated. The first key contribution of this work is to demonstrate through systematic evaluations that as the pairwise complexity of the training dataset increases, standard VLMs struggle to learn region-attribute relationships, exhibiting performance degradations of up to 37 our second key contribution. ViLLA, which is trained to capture fine-grained region-attribute relationships from complex datasets, involves two components: (a) a lightweight, self-supervised mapping model to decompose image-text samples into region-attribute pairs, and (b) a contrastive VLM to learn representations from generated region-attribute pairs. We demonstrate with experiments across four domains (synthetic, product, medical, and natural images) that ViLLA outperforms comparable VLMs on fine-grained reasoning tasks, such as zero-shot object detection (up to 3.6 AP50 points on COCO and 0.6 mAP points on LVIS) and retrieval (up to 14.2 R-Precision points).

READ FULL TEXT

page 2

page 9

page 16

research
12/16/2021

RegionCLIP: Region-based Language-Image Pretraining

Contrastive language-image pretraining (CLIP) using image-text pairs has...
research
07/30/2021

CLIP-Art: Contrastive Pre-Training for Fine-Grained Art Classification

Existing computer vision research in artwork struggles with artwork's fi...
research
07/29/2022

Curriculum Learning for Data-Efficient Vision-Language Alignment

Aligning image and text encoders from scratch using contrastive learning...
research
07/29/2023

Sat2Cap: Mapping Fine-Grained Textual Descriptions from Satellite Images

We propose a novel weakly supervised approach for creating maps using fr...
research
06/05/2023

Towards Unified Text-based Person Retrieval: A Large-scale Multi-Attribute and Language Search Benchmark

In this paper, we introduce a large Multi-Attribute and Language Search ...
research
03/22/2022

Leveraging Textures in Zero-shot Understanding of Fine-Grained Domains

Textures can be used to describe the appearance of objects in a wide ran...
research
10/06/2021

Grasp-Oriented Fine-grained Cloth Segmentation without Real Supervision

Automatically detecting graspable regions from a single depth image is a...

Please sign up or login with your details

Forgot password? Click here to reset