Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Fine-grained Understanding

06/15/2023
by   Le Zhang, et al.
0

Current Vision and Language Models (VLMs) demonstrate strong performance across various vision-language tasks, yet they struggle with fine-grained understanding. This issue stems from weak image-caption alignment in pretraining datasets and a simplified contrastive objective that fails to distinguish nuanced grounding elements such as relations, actions, and attributes. As a result, the models tend to learn bag-of-words representations. To mitigate these challenges, we introduce an intra-modal contrastive loss and a unique cross-modal rank loss with an adaptive threshold that serves as curriculum learning, utilizing our automatically generated hard negatives to augment the model's capacity. Our strategy, which does not necessitate additional annotations or parameters, can be incorporated into any VLM trained with an image-text contrastive loss. Upon application to CLIP, our method leads to significant improvements on three fine-grained benchmarks, and it also enhances the performance of X-VLM, which is the state-of-art moodel on fine-grained reasoning.

READ FULL TEXT
research
09/16/2023

Delving into Multimodal Prompting for Fine-grained Visual Classification

Fine-grained visual classification (FGVC) involves categorizing fine sub...
research
08/16/2021

ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration

Vision-and-language pretraining (VLP) aims to learn generic multimodal r...
research
01/16/2023

Learning Aligned Cross-modal Representations for Referring Image Segmentation

Referring image segmentation aims to segment the image region of interes...
research
07/11/2022

Adaptive Fine-Grained Predicates Learning for Scene Graph Generation

The performance of current Scene Graph Generation (SGG) models is severe...
research
05/02/2020

Robust and Interpretable Grounding of Spatial References with Relation Networks

Handling spatial references in natural language is a key challenge in ta...
research
01/19/2023

Semantic-aware Contrastive Learning for More Accurate Semantic Parsing

Since the meaning representations are detailed and accurate annotations ...
research
09/16/2019

Learning to Map Nearly Anything

Looking at the world from above, it is possible to estimate many propert...

Please sign up or login with your details

Forgot password? Click here to reset