Learning Dual Semantic Relations with Graph Attention for Image-Text Matching

10/22/2020
by   Keyu Wen, et al.
0

Image-Text Matching is one major task in cross-modal information processing. The main challenge is to learn the unified visual and textual representations. Previous methods that perform well on this task primarily focus on not only the alignment between region features in images and the corresponding words in sentences, but also the alignment between relations of regions and relational words. However, the lack of joint learning of regional features and global features will cause the regional features to lose contact with the global context, leading to the mismatch with those non-object words which have global meanings in some sentences. In this work, in order to alleviate this issue, it is necessary to enhance the relations between regions and the relations between regional and global concepts to obtain a more accurate visual representation so as to be better correlated to the corresponding text. Thus, a novel multi-level semantic relations enhancement approach named Dual Semantic Relations Attention Network(DSRAN) is proposed which mainly consists of two modules, separate semantic relations module and the joint semantic relations module. DSRAN performs graph attention in both modules respectively for region-level relations enhancement and regional-global relations enhancement at the same time. With these two modules, different hierarchies of semantic relations are learned simultaneously, thus promoting the image-text matching process by providing more information for the final visual representation. Quantitative experimental results have been performed on MS-COCO and Flickr30K and our method outperforms previous approaches by a large margin due to the effectiveness of the dual semantic relations learning scheme. Codes are available at https://github.com/kywen1119/DSRAN.

READ FULL TEXT

page 1

page 2

page 4

page 10

page 12

research
08/12/2020

Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using Transformer Encoders

Despite the evolution of deep-learning-based visual-textual processing s...
research
05/28/2021

Learning Relation Alignment for Calibrated Cross-modal Retrieval

Despite the achievements of large-scale multimodal pre-training approach...
research
12/16/2022

HGAN: Hierarchical Graph Alignment Network for Image-Text Retrieval

Image-text retrieval (ITR) is a challenging task in the field of multimo...
research
04/11/2019

Unified Visual-Semantic Embeddings: Bridging Vision and Language with Structured Meaning Representations

We propose the Unified Visual-Semantic Embeddings (Unified VSE) for lear...
research
04/16/2021

Attention! Stay Focus!

We develop a deep convolutional neural networks(CNNs) to deal with the b...
research
11/04/2022

OSIC: A New One-Stage Image Captioner Coined

Mainstream image caption models are usually two-stage captioners, i.e., ...
research
05/22/2022

Learnable Visual Words for Interpretable Image Recognition

To interpret deep models' predictions, attention-based visual cues are w...

Please sign up or login with your details

Forgot password? Click here to reset