ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration

08/16/2021
by   Yuhao Cui, et al.
0

Vision-and-language pretraining (VLP) aims to learn generic multimodal representations from massive image-text pairs. While various successful attempts have been proposed, learning fine-grained semantic alignments between image-text pairs plays a key role in their approaches. Nevertheless, most existing VLP approaches have not fully utilized the intrinsic knowledge within the image-text pairs, which limits the effectiveness of the learned alignments and further restricts the performance of their models. To this end, we introduce a new VLP method called ROSITA, which integrates the cross- and intra-modal knowledge in a unified scene graph to enhance the semantic alignments. Specifically, we introduce a novel structural knowledge masking (SKM) strategy to use the scene graph structure as a priori to perform masked language (region) modeling, which enhances the semantic alignments by eliminating the interference information within and across modalities. Extensive ablation studies and comprehensive analysis verifies the effectiveness of ROSITA in semantic alignments. Pretrained with both in-domain and out-of-domain datasets, ROSITA significantly outperforms existing state-of-the-art VLP methods on three typical vision-and-language tasks over six benchmark datasets.

READ FULL TEXT

page 1

page 3

page 8

research
06/15/2023

Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Fine-grained Understanding

Current Vision and Language Models (VLMs) demonstrate strong performance...
research
11/30/2021

CRIS: CLIP-Driven Referring Image Segmentation

Referring image segmentation aims to segment a referent via a natural li...
research
08/19/2022

Aspect-based Sentiment Classification with Sequential Cross-modal Semantic Graph

Multi-modal aspect-based sentiment classification (MABSC) is an emerging...
research
09/16/2023

Delving into Multimodal Prompting for Fine-grained Visual Classification

Fine-grained visual classification (FGVC) involves categorizing fine sub...
research
03/20/2023

Scene Graph Based Fusion Network For Image-Text Retrieval

A critical challenge to image-text retrieval is how to learn accurate co...
research
08/11/2020

KBGN: Knowledge-Bridge Graph Network for Adaptive Vision-Text Reasoning in Visual Dialogue

Visual dialogue is a challenging task that needs to extract implicit inf...
research
09/03/2020

TRACE: Transform Aggregate and Compose Visiolinguistic Representations for Image Search with Text Feedback

The ability to efficiently search for images over an indexed database is...

Please sign up or login with your details

Forgot password? Click here to reset